Benchmarked nanoGPT training costs across A100, H100, and RTX6000. A100 was ~2x more efficient than H100

I recently ran a series of experiments to determine the actual cost-efficiency of different GPU architectures for small-scale training and fine-tuning tasks. There is often a default assumption that “newer = better,” but I wanted to see if the premium is justified for non-LLM-scale workloads.

I trained Andrej Karpathy’s nanoGPT on the Shakespeare corpus using four distinct 1x GPU nodes (H100 SXM, H100 PCIe, A100 40GB, and RTX6000). I timed the active training loop (excluding package downloads) over twenty trials and normalized against current spot prices.

Key Findings:

  • The Winner (A100 40GB): For this specific workload size, the A100 was nearly 2x more cost-efficient per finished run than the H100 (0.8 cents per run vs 1.6). The H100 is obviously faster, but for small models, it doesn’t reach saturation, so you end up paying a premium for idle compute capacity.

  • Interconnects Matter (SXM vs PCIe): The H100 SXM trained significantly faster than the H100 PCIe. Even though the SXM node had a higher hourly rental rate, the speed differential was large enough that the total cost per run was lower on the SXM version. (1.6 cents for SXM vs 2 cents for PCIe)

  • False Economy (RTX6000): While this chip had the lowest hourly price tag, it underperformed the A100 by nearly 5x in total cost efficiency. The extended training time due to memory bandwidth/compute bottlenecks completely wiped out the hourly savings. (3.8 cents for total train)

If you are doing massive pre-training (e.g., Llama 70B), the H100’s FP8 throughput is necessary. But for hobby-scale tasks, inference, or light fine-tuning, the A100 might be the price-performance sweet spot.

I’ve published the full data breakdown and an interactive tool to calculate these efficiency ratios based on your own market prices here:

Full Write-up
Interactive Calculator

2 Likes