Replies: 3 comments 1 reply
-
OMG! You are actually doing this. Huge props! |
Beta Was this translation helpful? Give feedback.
-
Big kudos for giving attention to compile times. This is something that is often overlooked and can get way out of hand! What insane progress this project has made in the last 2 weeks!! 👏👏👏 |
Beta Was this translation helpful? Give feedback.
-
I recall your post about large language models for space? Are there any plans to develop in parallel during this project? |
Beta Was this translation helpful? Give feedback.
-
[May 3, 2024]
It is day 24 of the llm.c project. We can now do multi-GPU training, in bfloat16, with flash attention, and it is FAST! 🚀
Single GPU training. We are now training GPT-2 (124M) faster than PyTorch nightly by ~7%, with no asterisks. i.e. this is the fastest PyTorch run that I am aware one can configure, for one GPU training on Ampere, that includes all the modern & standard bells-and-whistles: mixed precision training, torch compile and flash attention. Compared to the current PyTorch stable release 2.3.0, we are actually ~46% faster, but the folks at PyTorch have been busy and merged a number of changes over the last ~month that happen to greatly speed up the GPT-2 training setting (very nice!). Lastly, compared to the last State of the Union on April 22 (10 days ago), this is ~3X speedup. A lot of improvements landed over the last ~week to get us here, the major ones include:
Multi-GPU training. Achieved a solid version 1:
Functional. Outside of training efficiency alone, we are gearing up for a proper reproduction of the GPT-2 miniseries of model sizes from 124M all the way to the actual 1.6B model. For this we will need additional changes including gradient accumulation, gradient clipping, init from random weights direct in C, learning rate warmup and schedule, evaluation (WikiText 103?), and a modern pretraining dataset (e.g. fineweb?). A lot of these components are pending and currently being worked on.
Goal. The current goal is to create a reliable, stable, clean, tested, minimal, hardened and sufficiently optimized LLM stack that reproduces the GPT-2 miniseries of all model sizes, from 124M to 1.6B, directly in C/CUDA. At current pace this feels like somewhere around ~2 weeks out.
Lines of code
👎 With more features and optimizations can more lines of code. The main code file
train_gpt2.cu
is now at around 3,000 lines of code (LOC). In addition, we split off two new filescommon.h
(300 LOC) andtokenizer.h
(100 LOC) which we now include. This is up from ~2000 LOC on April 22.Latency
👎 Sad to report some less upbeat developments to the compile latency of the projects:
time make train_gpt2cu
: 4.3s (up from 2.4s before). So, sadly, this is now about as bad asimport torch
and we are very interested in how we could decrease this latency.time make train_gpt2cu USE_CUDNN=1
includes the cudnn flash attention and gives great speedups and memory savings, but sadly bloats up the compile latency to ~1m24s 🤦♂️. This is a major and previously unexpected slowdown coming from our use of cudnn, and we are actively very interested in how we could delete this dependency as a result.Peak memory
👍 Our peak memory usage has improved quite a bit recently by being very careful with what memory we allocate and how we use it, especially with our Fused Classifier. Training with batch size 32 and sequence length 1024. Example invocations for llm.c and PyTorch:
llm.c: 16.6 GiB
PyTorch: 37.2 GiB
(err, honestly the PyTorch number feels a bit suspiciously high in this comparison, todo investigate more and edit)
Runtime, DRAM traffic, instructions:
👍 run of profile_gpt2cu.py (batch size 24):
kernels eye candy, stale by 1 day:
Nsight Systems timeline eye candy, stale by 1 day:
Contributors
It is worth especially distinguishing @ngc92 @ademeure, who are both very active and have contributed a great amount of code, ideas and expertise to the llm.c project.
Notable forks
Three new notable forks:
Featured discussions
LLM.c Speed of Light & Beyond (A100 Performance Analysis) by @ademeure for a recent profiling run of llm.c and ideas on further steps for optimization.
For more llm.c discussions join us on #llmc on nn zero to hero Discord, or on the currently more active #llmdotc on CUDA MODE Discord.
fp32 CUDA version plans
We also split off the fp32 CUDA code into its own file
train_gpt2fp32.cu
, which will become pure CUDA kernels only (no cublas or cudnn or etc), and which we think would make a really nice endpoint of a CUDA course. You start with the gpt2.c pure CPU implementation, and see how fast you can make it by the end of the course on GPU, with kernels only and no dependencies.Fine print
All measurements done on:
llm.c: at ~167K tok/s (on SOTA PR from this morning that is merging imminently), on master at ~160K
PyTorch code as is on master runs at ~150K tok/s (i.e. we are 167/150 ~= 11% faster)
If you manually pad the vocab size to 50304, tok/s improves ~150K -> ~156K, reducing llm.c speed improvement to ~7%.
Note that padding the vocab is not a trivial matter for GPT-2 in PyTorch. You have to know that having a vocab size of 50257 is bad, and that it should be e.g. 50304 (which %64 = 0), and then because of the token embedding table weight sharing with the classifier weights, you have to be very careful that you mask out or somehow set to -inf the padded dimensions, and that you also never use them during sampling. And you have to model surgery if you init with OpenAI weights. The original OpenAI GPT-2 code also did not pad the vocab in this way.
Acknowledgements
Thank you to the excellent Lambda labs for sponsoring this project with GPUs. Lambda labs our favorite, goto place for cloud GPUs 🙏.
Beta Was this translation helpful? Give feedback.
All reactions