GPT-2 (774M) reproduced #580

karpathy · 2024-06-12T15:43:17Z

karpathy
Jun 12, 2024
Maintainer

I left the GPT-2 774M model running for ~6 days on my 8X A100 80GB node (150B tokens, 1.5 epochs over the 100B FineWeb sample dataset) and training just finished a few hours ago and went well with no major issues or incidents:

As with our previous run, we somewhat unexpectedly outperform both GPT-2 and GPT-3 very quickly. Still not 100% sure if this has more to do with our evaluations or what else to attribute this to (e.g. model could be much worse on multilingual, math, code). Possibly to the quality of FineWeb (?).

The run was configured as follows:

# GPT-2 (774M) repro on FineWeb
# 774M parameter model on 150B tokens
# => 6 * 774e6 * 150e9 = 6.966e20 ~= 7e20 capability model (10X 350M)
# => 286,102 steps on 524,288 tokens/step
# on 8X A100 80GB SXM ($14/hr) steps in ~1.7s/iter
# => training time 286,102 steps * 1.7s = 135 hours ~= 5.6 days ~= $2000 (10X 124M)

The training script used is in scripts/run_gpt2_774M.sh.

Sometimes the training run stalls and hangs with a weird MPI error, so in addition to the training script I ran a watcher.sh in a second screen session:

#!/bin/bash

# watches this log file and if it hasn't been written to in 1 minute, sends CTRL+C to screen session "train"
threshold=120  # Threshold in seconds (2 minutes)
out_dir="log_gpt2_774M"
log_file="$out_dir/main.log"
done_file="$out_dir/DONE_00286102"

while true; do

    # exit condition is that optimization has finished
    if [ -f "$done_file" ]; then
        echo "File $done_file exists. Exiting the loop."
        break
    fi

    last_modified=$(stat -c %Y "$log_file")
    current_time=$(date +%s)
    elapsed_time=$((current_time - last_modified))
    echo $elapsed_time

    if [ $elapsed_time -gt $threshold ]; then
        echo "Log file hasn't been written to for more than 2 minutes. Sending CTRL+C to the 'train' session."
        screen -S train -X stuff $'\003'
    fi

    # check every 2 minutes
    # (this also gives the training script time to restart and start writing again properly)
    sleep 120
done

Reflections:

I am very curious why we are performing better than expected, so before the next big run I'll try to run a wider suite of evals, converting this checkpoint to huggingface and running bigger evals.

We shouldn't decay the LR to 0%. It looks like at the tail end of the optimization we are going way too slow. For future runs I will probably try to follow the GPT-3 paper and only decay to 10%, i.e. -q 0.1 instead of -q 0.0.

There is an unnerving downward "kink" in the validation loss exactly at the 1 epoch boundary. This makes me nervous that the validation set and training set of FineWeb have an unexpectedly large overlap.

s3 paths

huggingface paths

huggingface repo thank you mdouglas for the upload.

Next up

Figure out what's up with the evals
I think we look ready to launch "the GPT-2", i.e. gpt2-1558M. I'll get a bigger node so we don't have wait half a month, and possibly this is a great chance to actually look at and get working and merge PRs around multi-GPU.

matthewdouglas · 2024-06-12T17:12:15Z

matthewdouglas
Jun 12, 2024

I have converted this checkpoint to the Hugging Face format for inferencing. It is available here: https://huggingface.co/mdouglas/llmc-gpt2-774M-150B

Edit: Initially I had mistakenly uploaded a 124M checkpoint, but that's corrected now.

7 replies

karpathy Jun 13, 2024
Maintainer Author

@matthewdouglas this looks useful for the main repo, somewhere in dev/ . I can look at incorporating it.

cometyang Jun 16, 2024

Do you want to try Fineweb-edu instead of Fineweb so maybe better data quality leads to even better performance?

cometyang Jun 17, 2024

@dh1849 Thanks, have you compared with Fineweb dataset trained on 350M-100BT? I did test on 124M-10BT, while validation error get better on Fineweb-edu, the HellaSwag has no noticeable improvement.

morphpiece Jun 20, 2024

Thanks.. yet again!

Do you plan on making available intermediate checkpoints? I am going to train this model with a different tokenization and intermediate checkpoints will be superuseful to compare results.

simple6502 Jun 22, 2024

@dh1849 Thanks, have you compared with Fineweb dataset trained on 350M-100BT? I did test on 124M-10BT, while validation error get better on Fineweb-edu, the HellaSwag has no noticeable improvement.

I think the reason for the same performance with HellaSwag for both Fineweb and Fineweb-edu is just based on the nature of the benchmark as it is for sentence completion. So with less diversity in Fineweb-edu for general knowledge with more a focus to improve the performance and understanding of the model, it causes it to not change much, and fall behind a little as shown by your graph.

This can also be seen from the Fineweb-edu repo as shown here.

simple6502 · 2024-06-22T09:45:33Z

simple6502
Jun 22, 2024

I decided to run a few benchmarks on this model on my own machine alongside GPT-2-124M and GPT-2-774M:

I ran all of these in 0-shot, and GPT-3-Large is directly reported from the paper. From the looks of it, it seems for the most with these benchmarks, the model is close or on-par with GPT-3-Large (774M) which is surprising. I feel that the Fineweb dataset and its filtering has really made it a good quality dataset to train from as it is able to be very close with GPT-3-Large while only trained on half of the tokens. Maybe a bit more training would get it to be fully on par, but that's just with these benchmarks.

It is interesting that GPT-2-124M-OpenAI has the highest BoolQ score, which might be from the nature of WebText in comparison with the format of BoolQ.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPT-2 (774M) reproduced #580

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

GPT-2 (774M) reproduced #580

karpathy Jun 12, 2024 Maintainer

Replies: 2 comments · 7 replies

matthewdouglas Jun 12, 2024

karpathy Jun 13, 2024 Maintainer Author

cometyang Jun 16, 2024

cometyang Jun 17, 2024

morphpiece Jun 20, 2024

simple6502 Jun 22, 2024

simple6502 Jun 22, 2024

karpathy
Jun 12, 2024
Maintainer

Replies: 2 comments 7 replies

matthewdouglas
Jun 12, 2024

karpathy Jun 13, 2024
Maintainer Author

simple6502
Jun 22, 2024