Replies: 2 comments 7 replies
-
I have converted this checkpoint to the Hugging Face format for inferencing. It is available here: https://huggingface.co/mdouglas/llmc-gpt2-774M-150B Edit: Initially I had mistakenly uploaded a 124M checkpoint, but that's corrected now. |
Beta Was this translation helpful? Give feedback.
-
I decided to run a few benchmarks on this model on my own machine alongside GPT-2-124M and GPT-2-774M: I ran all of these in 0-shot, and GPT-3-Large is directly reported from the paper. From the looks of it, it seems for the most with these benchmarks, the model is close or on-par with GPT-3-Large (774M) which is surprising. I feel that the Fineweb dataset and its filtering has really made it a good quality dataset to train from as it is able to be very close with GPT-3-Large while only trained on half of the tokens. Maybe a bit more training would get it to be fully on par, but that's just with these benchmarks. It is interesting that GPT-2-124M-OpenAI has the highest BoolQ score, which might be from the nature of WebText in comparison with the format of BoolQ. |
Beta Was this translation helpful? Give feedback.
-
I left the GPT-2 774M model running for ~6 days on my 8X A100 80GB node (150B tokens, 1.5 epochs over the 100B FineWeb sample dataset) and training just finished a few hours ago and went well with no major issues or incidents:
As with our previous run, we somewhat unexpectedly outperform both GPT-2 and GPT-3 very quickly. Still not 100% sure if this has more to do with our evaluations or what else to attribute this to (e.g. model could be much worse on multilingual, math, code). Possibly to the quality of FineWeb (?).
The run was configured as follows:
The training script used is in scripts/run_gpt2_774M.sh.
Sometimes the training run stalls and hangs with a weird MPI error, so in addition to the training script I ran a watcher.sh in a second screen session:
Reflections:
I am very curious why we are performing better than expected, so before the next big run I'll try to run a wider suite of evals, converting this checkpoint to huggingface and running bigger evals.
We shouldn't decay the LR to 0%. It looks like at the tail end of the optimization we are going way too slow. For future runs I will probably try to follow the GPT-3 paper and only decay to 10%, i.e.
-q 0.1
instead of-q 0.0
.There is an unnerving downward "kink" in the validation loss exactly at the 1 epoch boundary. This makes me nervous that the validation set and training set of FineWeb have an unexpectedly large overlap.
s3 paths
huggingface paths
huggingface repo thank you
mdouglas
for the upload.Next up
Beta Was this translation helpful? Give feedback.
All reactions