From cdd928d4596c142c15a7d86b2eeadbac718c8da2 Mon Sep 17 00:00:00 2001 From: Shriya Rishab <69161273+ShriyaPalsamudram@users.noreply.github.com> Date: Wed, 14 Aug 2024 17:18:13 -0400 Subject: [PATCH] Remove gs://mlperf-llm-public2/ dependency and make reproducibility instructions clear (#761) --- large_language_model/megatron-lm/README.md | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/large_language_model/megatron-lm/README.md b/large_language_model/megatron-lm/README.md index f5c0253f4..515768120 100755 --- a/large_language_model/megatron-lm/README.md +++ b/large_language_model/megatron-lm/README.md @@ -193,9 +193,6 @@ rclone copy mlc-training:mlcommons-training-wg-public/gpt3/megatron-lm/checkpoin ### Model conversion from Paxml checkpoints Alternatively to downloading the checkpoint in Megatron ready format, it can be obtained by converting a Paxml checkpoint. -Paxml Checkpoint is available at: `gs://mlperf-llm-public2/gpt3_spmd1x64x24_tpuv4-3072_v84_20221101/checkpoints/checkpoint_00004000` -To resume training from the above checkpoint on Megatron, it should be converted into a format suitable for Megatron (this step only needs to be done once). - To convert Paxml checkpoint to the Megatron's format, a [script](scripts/convert_paxml_to_megatron_distributed.py) has been provided: ```bash # Convert model and optimizer parameters to Megatron format (runs in ~40 minutes on DGXA100, requires 1TB of CPU memory): @@ -206,7 +203,7 @@ python json_to_torch.py -i common_fp32.json -o $EXTERNAL_MODEL_CHECKPOINT_DIR/co This should result in the same checkpoint as described in the "Checkpoint download" section above. ### Dataset preprocessing -Here are the instructions to prepare the preprocessed dataset from scratch. +Here are the instructions to prepare the preprocessed dataset from scratch. Data preprocessing is already done and the final dataset can be accessed by following instructions in [S3 artifacts download](#s3-artifacts-download) section. #### Data Download Training dataset - @@ -220,7 +217,7 @@ git lfs pull --include "en/c4-train.009*.json.gz" git lfs pull --include "en/c4-train.01*.json.gz" ``` -Validation dataset needs to be downloaded from `gs://mlperf-llm-public2/c4/en_val_subset_json/c4-validation_24567exp.json` to ${C4_PATH}. +Validation data subset can be downloaded from `gs://mlperf-llm-public2/c4/en_val_subset_json/c4-validation_24567exp.json` to ${C4_PATH}. #### Data Preprocessing for Megatron-LM @@ -247,7 +244,7 @@ for shard in {6..7}; do done ``` -After preparing the data folder, download tokenizer model. The tokenizer model should be downloaded from `gs://mlperf-llm-public2/vocab/c4_en_301_5Mexp2_spm.model` and renamed as `${C4_PATH}/tokenizers/c4_spm/sentencepiece.model`. Make sure an output directory `${C4_PATH}/preprocessed_c4_spm` exists before the next step. +After preparing the data folder, download tokenizer model. The tokenizer model `c4_en_301_5Mexp2_spm.model` can be downloaded by following instructions in [S3 artifacts download](#s3-artifacts-download) and renamed as `${C4_PATH}/tokenizers/c4_spm/sentencepiece.model`. Make sure an output directory `${C4_PATH}/preprocessed_c4_spm` exists before the next step. Modify `C4_PATH` in `preprocess.sh` and `preprocess_val.sh` to specify the correct input/output paths and run preprocessing as follows