out-of-memory when specu-training a llama2-13b model #113

YunChen1227 · 2024-08-28T11:01:44Z

#!/bin/bash
#SBATCH --job-name=spec_dec_skill
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:hgx:8
#SBATCH --cpus-per-task=32
#SBATCH --mem-per-cpu=6G
#SBATCH -p pos
#SBATCH -o ./log/spec_training.log
#SBATCH -w hgx030

On AWS, the EFA and OFI paths enable NCCL to use optimized networking.

export LD_LIBRARY_PATH=/opt/nccl/build/lib:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/local/cuda:/usr/local/cuda/targets/x86_64-linux/lib/:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:$LD_LIBRARY_PATH

export FI_EFA_SET_CUDA_SYNC_MEMOPS=0

MODEL_ARGS="
--model_variant=13b
--use_dummy_dataset=False
--ckpt_load_path=/cognitive_comp/chenyun/vllm_q/model_path/switch/0701_skill_summary/global_step3100_hf_bf16
--ckpt_save_path=/cognitive_comp/chenyun/fms-fsdp/model_path/switch/0701_skill_summary/spec_dec_skill
--data_path=/cognitive_comp/chenyun/vllm_q/sftdata/0701_summary_emo
--datasets=/cognitive_comp/chenyun/vllm_q/sftdata/0701_summary_emo/selected.json
--fsdp_activation_checkpointing=False
--selective_checkpointing=1
--sharding_strategy=hsdp
--low_cpu_fsdp=False
--batch_size=2
--report_interval=200
--checkpoint_interval=20000
--use_torch_compile=False
--use_profiler=False
"

torchrun
--nnodes=$SLURM_NTASKS
--node_rank=$SLURM_NODEID
--nproc_per_node=8
--master_addr=scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1
--master_port="12234"
main_training.py
${MODEL_ARGS}

I sbatch the above train.sh to HGX, but it failed and show the following error

--> running with these configs train_config(model_variant='13b', ckpt_load_path='/cognitive_comp/chenyun/vllm_q/model_path/switch/0701_skill_summary/global_step3100_hf_bf16', ckpt_save_path='/cognitive_comp/chenyun/fms-fsdp/model_path/switch/0701_skill_summary/spec_dec_skill', use_dummy_dataset=False, data_path='/cognitive_comp/chenyun/vllm_q/sftdata/0701_summary_emo', seq_length=8192, sep_token=1, datasets='/cognitive_comp/chenyun/vllm_q/sftdata/0701_summary_emo/selected.json', weights='7700,500,550,28,17,22,25,8,100,500,175,250,100,25', logical_shards=800, mixed_precision=True, fsdp_activation_checkpointing=False, selective_checkpointing=1, sharding_strategy='hsdp', low_cpu_fsdp=False, seed=2023, batch_size=2, num_steps=2000000, learning_rate=0.0003, grad_clip_thresh=1.0, use_profiler=False, profiler_rank0_only=True, report_interval=200, checkpoint_interval=20000, tracker=None, tracker_dir='/lustre/lchu/fms-fsdp', tracker_project_name='llama', tracker_run_id=None, use_torch_compile=False, model_path='/lustre/llama_weights/8B-llama3-hf', n_speculator_heads=3, speculator_width=4096, stage2_start_step=15000, stage2_prompt_length=64, stage2_batch_size=12, stage2_seq_length=256)
bFloat16 enabled for mixed precision - using bfSixteen policy
Sharding strategy = hsdp
[2024-08-28 18:57:32,902] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3103185 closing signal SIGTERM
[2024-08-28 18:57:32,906] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3103186 closing signal SIGTERM
[2024-08-28 18:57:32,929] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3103187 closing signal SIGTERM
[2024-08-28 18:57:32,963] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3103188 closing signal SIGTERM
[2024-08-28 18:57:32,998] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3103190 closing signal SIGTERM
[2024-08-28 18:57:33,017] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3103191 closing signal SIGTERM
[2024-08-28 18:57:33,050] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3103192 closing signal SIGTERM
[2024-08-28 18:57:36,013] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 4 (pid: 3103189) of binary: /home/chenyun/miniconda3/envs/fms_fsdp/bin/python
Traceback (most recent call last):
File "/home/chenyun/miniconda3/envs/fms_fsdp/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/home/chenyun/miniconda3/envs/fms_fsdp/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/chenyun/miniconda3/envs/fms_fsdp/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/home/chenyun/miniconda3/envs/fms_fsdp/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/chenyun/miniconda3/envs/fms_fsdp/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in call**
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/chenyun/miniconda3/envs/fms_fsdp/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main_training.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-08-28_18:57:32
host : hgx030.scc.idea
rank : 4 (local_rank: 4)
exitcode : -9 (pid: 3103189)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 3103189

slurmstepd-hgx030: error: Detected 1 oom-kill event(s) in step 78214.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

out-of-memory when specu-training a llama2-13b model #113

out-of-memory when specu-training a llama2-13b model #113

YunChen1227 commented Aug 28, 2024

out-of-memory when specu-training a llama2-13b model #113

out-of-memory when specu-training a llama2-13b model #113

Comments

YunChen1227 commented Aug 28, 2024

On AWS, the EFA and OFI paths enable NCCL to use optimized networking.

main_training.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-08-28_18:57:32 host : hgx030.scc.idea rank : 4 (local_rank: 4) exitcode : -9 (pid: 3103189) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 3103189

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-08-28_18:57:32
host : hgx030.scc.idea
rank : 4 (local_rank: 4)
exitcode : -9 (pid: 3103189)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 3103189