State of the Union [May 3, 2024] #344

karpathy · 2024-05-03T16:56:45Z

karpathy
May 3, 2024
Maintainer

[May 3, 2024]

It is day 24 of the llm.c project. We can now do multi-GPU training, in bfloat16, with flash attention, and it is FAST! 🚀

Single GPU training. We are now training GPT-2 (124M) faster than PyTorch nightly by ~7%, with no asterisks. i.e. this is the fastest PyTorch run that I am aware one can configure, for one GPU training on Ampere, that includes all the modern & standard bells-and-whistles: mixed precision training, torch compile and flash attention. Compared to the current PyTorch stable release 2.3.0, we are actually ~46% faster, but the folks at PyTorch have been busy and merged a number of changes over the last ~month that happen to greatly speed up the GPT-2 training setting (very nice!). Lastly, compared to the last State of the Union on April 22 (10 days ago), this is ~3X speedup. A lot of improvements landed over the last ~week to get us here, the major ones include:

✅ Mixed precision training (bfloat16), with a parameter master copy kept in fp32
✅ Kernel optimizations, across the board, including e.g. a fused classifier that is an algorithmic improvement on what the PyTorch compiler does so far (we do not materialize normalized logits and only evaluate the loss at the label's index)
✅ Flash attention (currently the one from cuDNN)
✅ Packed128 data structure (a bit like float4 but supports mixed precision) that forces new hardware to utilize 128-bit load (LDG.128) and store (STS.128) instructions to maximize memory bandwidth.
✅ Memory savings. Deleted a large amount of unnecessary memory previously used for backward pass gradients, which dramatically lowered the memory needed to train

Multi-GPU training. Achieved a solid version 1:

✅ First version of multi-gpu training with MPI+NCCL
✅ Profiling the full training run for NVIDIA Nsight Compute
PR for stage 1 of ZeRO (optimizer state sharding) merging imminently

Functional. Outside of training efficiency alone, we are gearing up for a proper reproduction of the GPT-2 miniseries of model sizes from 124M all the way to the actual 1.6B model. For this we will need additional changes including gradient accumulation, gradient clipping, init from random weights direct in C, learning rate warmup and schedule, evaluation (WikiText 103?), and a modern pretraining dataset (e.g. fineweb?). A lot of these components are pending and currently being worked on.

Goal. The current goal is to create a reliable, stable, clean, tested, minimal, hardened and sufficiently optimized LLM stack that reproduces the GPT-2 miniseries of all model sizes, from 124M to 1.6B, directly in C/CUDA. At current pace this feels like somewhere around ~2 weeks out.

Lines of code

👎 With more features and optimizations can more lines of code. The main code file train_gpt2.cu is now at around 3,000 lines of code (LOC). In addition, we split off two new files common.h (300 LOC) and tokenizer.h (100 LOC) which we now include. This is up from ~2000 LOC on April 22.

Latency

👎 Sad to report some less upbeat developments to the compile latency of the projects:

nvcc compile latency, i.e. time make train_gpt2cu: 4.3s (up from 2.4s before). So, sadly, this is now about as bad as import torch and we are very interested in how we could decrease this latency.
turning on flash attention, i.e. time make train_gpt2cu USE_CUDNN=1 includes the cudnn flash attention and gives great speedups and memory savings, but sadly bloats up the compile latency to ~1m24s 🤦‍♂️. This is a major and previously unexpected slowdown coming from our use of cudnn, and we are actively very interested in how we could delete this dependency as a result.
run latency (ENTER to first step) remains mostly unchanged at around ~3s

Peak memory

👍 Our peak memory usage has improved quite a bit recently by being very careful with what memory we allocate and how we use it, especially with our Fused Classifier. Training with batch size 32 and sequence length 1024. Example invocations for llm.c and PyTorch:

make train_gpt2cu USE_CUDNN=1 && ./train_gpt2cu -i data/TinyStories -v 250 -s 250 -g 144 -b 32
python train_gpt2.py --write_tensors=0 --num_iterations=1000 --sequence_length=1024 --compile=1 --tensorcores=1 --dtype=bfloat16 --flash=1 --batch_size=32 --input_bin=data/TinyStories_train.bin

llm.c: 16.6 GiB
PyTorch: 37.2 GiB

(err, honestly the PyTorch number feels a bit suspiciously high in this comparison, todo investigate more and edit)

Runtime, DRAM traffic, instructions:

👍 run of profile_gpt2cu.py (batch size 24):

Kernel type summaries:
  name                                       time   frac   count
  ampere_bf16                              360.03  55.13%    111
  cutlass::Kernel2                         109.09  16.70%     36
  cudnn_generated_fort_native_sdpa          58.27   8.92%     24
  gelu_backward_kernel                      20.44   3.13%     12
  fused_classifier_kernel3                  20.07   3.07%      1
  matmul_backward_bias_kernel6              15.49   2.37%     48
  layernorm_backward_kernel7                15.28   2.34%     25
  adamw_kernel3                             14.08   2.16%      1
  gelu_forward_kernel2                      13.56   2.08%     12
  residual_forward_kernel                    8.35   1.28%     24
  cudnn::fusion::rearrange_n_convert_dq      8.07   1.24%     12
  layernorm_forward_kernel3                  5.74   0.88%     25
  cudnn::fusion::compute_dot_do_o            3.46   0.53%     12
  copy_and_cast_kernel                       2.86   0.44%      0
  encoder_backward_kernel                    0.78   0.12%      1
  encoder_forward_kernel3                    0.26   0.04%      1
  cast_and_add_kernel                        0.08   0.01%     48

In total, a training step takes 302.5ms, distributed as:
  0.2ms (0.1%) in the encoder,
  75.0ms (24.8%) in forward blocks,
  44.0ms (14.5%) in the classifier part,
  183.4ms (60.6%) in backward blocks, and
  0.0ms (0.0%) in the optimizer.

We read 113.3GiB (374.4GB/s) and write 68.3GiB (225.6GB/s) to DRAM,
read 351.5GiB (1161.9GB/s) and write 70.1GiB (231.8GB/s) to L2,
and execute 6.6 billion instructions (21.9 GInst/s).

kernels eye candy, stale by 1 day:

Nsight Systems timeline eye candy, stale by 1 day:

Contributors

🧙‍♂️ kernels: @ngc92 @ademeure @ChrisDryden @JaneIllario
🔥 multi-GPU training: @PeterZhizhin @chinthysl
💎 tooling: @austinvhuang @Ricardicus @dagelf @rosslwheeler @azret @lancerts
🙏 discussions and PyTorch support: @Chillee

It is worth especially distinguishing @ngc92 @ademeure, who are both very active and have contributed a great amount of code, ideas and expertise to the llm.c project.

Notable forks

Three new notable forks:

llm.cpp by @gevtushenko: a port of this project using the CUDA C++ Core Libraries
- A presentation this fork was covered in this lecture in the CUDA MODE Discord Server
llm.zig by @saimirbaci: a Zig port of this project
llm.go by @joshcarp: a Go port of this project

Featured discussions

LLM.c Speed of Light & Beyond (A100 Performance Analysis) by @ademeure for a recent profiling run of llm.c and ideas on further steps for optimization.

For more llm.c discussions join us on #llmc on nn zero to hero Discord, or on the currently more active #llmdotc on CUDA MODE Discord.

fp32 CUDA version plans

We also split off the fp32 CUDA code into its own file train_gpt2fp32.cu, which will become pure CUDA kernels only (no cublas or cudnn or etc), and which we think would make a really nice endpoint of a CUDA course. You start with the gpt2.c pure CPU implementation, and see how fast you can make it by the end of the course on GPU, with kernels only and no dependencies.

Fine print

All measurements done on:

A100 40GB PCIe GPU on Lambda
Ubuntu 22.04.3 LTS
NVIDIA driver version 535.129.03
CUDA Version: 12.2

llm.c: at ~167K tok/s (on SOTA PR from this morning that is merging imminently), on master at ~160K
PyTorch code as is on master runs at ~150K tok/s (i.e. we are 167/150 ~= 11% faster)
If you manually pad the vocab size to 50304, tok/s improves ~150K -> ~156K, reducing llm.c speed improvement to ~7%.
Note that padding the vocab is not a trivial matter for GPT-2 in PyTorch. You have to know that having a vocab size of 50257 is bad, and that it should be e.g. 50304 (which %64 = 0), and then because of the token embedding table weight sharing with the classifier weights, you have to be very careful that you mask out or somehow set to -inf the padded dimensions, and that you also never use them during sampling. And you have to model surgery if you init with OpenAI weights. The original OpenAI GPT-2 code also did not pad the vocab in this way.

Acknowledgements

Thank you to the excellent Lambda labs for sponsoring this project with GPUs. Lambda labs our favorite, goto place for cloud GPUs 🙏.

sbmaruf · 2024-05-03T21:15:17Z

sbmaruf
May 3, 2024

OMG! You are actually doing this. Huge props!
Do you plan to implement FSDP or tensor/pipeline parallel in future?

1 reply

karpathy May 3, 2024
Maintainer Author

The PR that is up and will be merged in the next ~0-2 days is Stage 1 of ZeRO. FSDP is ~Stage 3 of ZeRO. Long story short yes.

regrettable-username · 2024-05-03T21:43:47Z

regrettable-username
May 3, 2024

Big kudos for giving attention to compile times. This is something that is often overlooked and can get way out of hand! What insane progress this project has made in the last 2 weeks!! 👏👏👏

0 replies

TayDa64 · 2024-05-03T23:23:21Z

TayDa64
May 3, 2024

I recall your post about large language models for space? Are there any plans to develop in parallel during this project?
Outstanding work BTW.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State of the Union [May 3, 2024] #344

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

State of the Union [May 3, 2024] #344

karpathy May 3, 2024 Maintainer

Replies: 3 comments · 1 reply

sbmaruf May 3, 2024

karpathy May 3, 2024 Maintainer Author

regrettable-username May 3, 2024

TayDa64 May 3, 2024

karpathy
May 3, 2024
Maintainer

Replies: 3 comments 1 reply

sbmaruf
May 3, 2024

karpathy May 3, 2024
Maintainer Author

regrettable-username
May 3, 2024

TayDa64
May 3, 2024