You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
torchrun
--nnodes=$SLURM_NTASKS
--node_rank=$SLURM_NODEID
--nproc_per_node=8
--master_addr=scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1
--master_port="12234"
main_training.py
${MODEL_ARGS}
I sbatch the above train.sh to HGX, but it failed and show the following error
--> running with these configs train_config(model_variant='13b', ckpt_load_path='/cognitive_comp/chenyun/vllm_q/model_path/switch/0701_skill_summary/global_step3100_hf_bf16', ckpt_save_path='/cognitive_comp/chenyun/fms-fsdp/model_path/switch/0701_skill_summary/spec_dec_skill', use_dummy_dataset=False, data_path='/cognitive_comp/chenyun/vllm_q/sftdata/0701_summary_emo', seq_length=8192, sep_token=1, datasets='/cognitive_comp/chenyun/vllm_q/sftdata/0701_summary_emo/selected.json', weights='7700,500,550,28,17,22,25,8,100,500,175,250,100,25', logical_shards=800, mixed_precision=True, fsdp_activation_checkpointing=False, selective_checkpointing=1, sharding_strategy='hsdp', low_cpu_fsdp=False, seed=2023, batch_size=2, num_steps=2000000, learning_rate=0.0003, grad_clip_thresh=1.0, use_profiler=False, profiler_rank0_only=True, report_interval=200, checkpoint_interval=20000, tracker=None, tracker_dir='/lustre/lchu/fms-fsdp', tracker_project_name='llama', tracker_run_id=None, use_torch_compile=False, model_path='/lustre/llama_weights/8B-llama3-hf', n_speculator_heads=3, speculator_width=4096, stage2_start_step=15000, stage2_prompt_length=64, stage2_batch_size=12, stage2_seq_length=256)
bFloat16 enabled for mixed precision - using bfSixteen policy
Sharding strategy = hsdp
[2024-08-28 18:57:32,902] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3103185 closing signal SIGTERM
[2024-08-28 18:57:32,906] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3103186 closing signal SIGTERM
[2024-08-28 18:57:32,929] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3103187 closing signal SIGTERM
[2024-08-28 18:57:32,963] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3103188 closing signal SIGTERM
[2024-08-28 18:57:32,998] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3103190 closing signal SIGTERM
[2024-08-28 18:57:33,017] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3103191 closing signal SIGTERM
[2024-08-28 18:57:33,050] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3103192 closing signal SIGTERM
[2024-08-28 18:57:36,013] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 4 (pid: 3103189) of binary: /home/chenyun/miniconda3/envs/fms_fsdp/bin/python
Traceback (most recent call last):
File "/home/chenyun/miniconda3/envs/fms_fsdp/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/home/chenyun/miniconda3/envs/fms_fsdp/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/chenyun/miniconda3/envs/fms_fsdp/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/home/chenyun/miniconda3/envs/fms_fsdp/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/chenyun/miniconda3/envs/fms_fsdp/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/chenyun/miniconda3/envs/fms_fsdp/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
main_training.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-08-28_18:57:32
host : hgx030.scc.idea
rank : 4 (local_rank: 4)
exitcode : -9 (pid: 3103189)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 3103189
slurmstepd-hgx030: error: Detected 1 oom-kill event(s) in step 78214.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
The text was updated successfully, but these errors were encountered:
#!/bin/bash
#SBATCH --job-name=spec_dec_skill
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:hgx:8
#SBATCH --cpus-per-task=32
#SBATCH --mem-per-cpu=6G
#SBATCH -p pos
#SBATCH -o ./log/spec_training.log
#SBATCH -w hgx030
On AWS, the EFA and OFI paths enable NCCL to use optimized networking.
export LD_LIBRARY_PATH=/opt/nccl/build/lib:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/local/cuda:/usr/local/cuda/targets/x86_64-linux/lib/:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:$LD_LIBRARY_PATH
export FI_EFA_SET_CUDA_SYNC_MEMOPS=0
MODEL_ARGS="
--model_variant=13b
--use_dummy_dataset=False
--ckpt_load_path=/cognitive_comp/chenyun/vllm_q/model_path/switch/0701_skill_summary/global_step3100_hf_bf16
--ckpt_save_path=/cognitive_comp/chenyun/fms-fsdp/model_path/switch/0701_skill_summary/spec_dec_skill
--data_path=/cognitive_comp/chenyun/vllm_q/sftdata/0701_summary_emo
--datasets=/cognitive_comp/chenyun/vllm_q/sftdata/0701_summary_emo/selected.json
--fsdp_activation_checkpointing=False
--selective_checkpointing=1
--sharding_strategy=hsdp
--low_cpu_fsdp=False
--batch_size=2
--report_interval=200
--checkpoint_interval=20000
--use_torch_compile=False
--use_profiler=False
"
torchrun
--nnodes=$SLURM_NTASKS
--node_rank=$SLURM_NODEID
--nproc_per_node=8
--master_addr=
scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1
--master_port="12234"
main_training.py
${MODEL_ARGS}
I sbatch the above train.sh to HGX, but it failed and show the following error
--> running with these configs train_config(model_variant='13b', ckpt_load_path='/cognitive_comp/chenyun/vllm_q/model_path/switch/0701_skill_summary/global_step3100_hf_bf16', ckpt_save_path='/cognitive_comp/chenyun/fms-fsdp/model_path/switch/0701_skill_summary/spec_dec_skill', use_dummy_dataset=False, data_path='/cognitive_comp/chenyun/vllm_q/sftdata/0701_summary_emo', seq_length=8192, sep_token=1, datasets='/cognitive_comp/chenyun/vllm_q/sftdata/0701_summary_emo/selected.json', weights='7700,500,550,28,17,22,25,8,100,500,175,250,100,25', logical_shards=800, mixed_precision=True, fsdp_activation_checkpointing=False, selective_checkpointing=1, sharding_strategy='hsdp', low_cpu_fsdp=False, seed=2023, batch_size=2, num_steps=2000000, learning_rate=0.0003, grad_clip_thresh=1.0, use_profiler=False, profiler_rank0_only=True, report_interval=200, checkpoint_interval=20000, tracker=None, tracker_dir='/lustre/lchu/fms-fsdp', tracker_project_name='llama', tracker_run_id=None, use_torch_compile=False, model_path='/lustre/llama_weights/8B-llama3-hf', n_speculator_heads=3, speculator_width=4096, stage2_start_step=15000, stage2_prompt_length=64, stage2_batch_size=12, stage2_seq_length=256)
bFloat16 enabled for mixed precision - using bfSixteen policy
Sharding strategy = hsdp
[2024-08-28 18:57:32,902] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3103185 closing signal SIGTERM
[2024-08-28 18:57:32,906] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3103186 closing signal SIGTERM
[2024-08-28 18:57:32,929] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3103187 closing signal SIGTERM
[2024-08-28 18:57:32,963] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3103188 closing signal SIGTERM
[2024-08-28 18:57:32,998] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3103190 closing signal SIGTERM
[2024-08-28 18:57:33,017] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3103191 closing signal SIGTERM
[2024-08-28 18:57:33,050] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3103192 closing signal SIGTERM
[2024-08-28 18:57:36,013] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 4 (pid: 3103189) of binary: /home/chenyun/miniconda3/envs/fms_fsdp/bin/python
Traceback (most recent call last):
File "/home/chenyun/miniconda3/envs/fms_fsdp/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/home/chenyun/miniconda3/envs/fms_fsdp/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/chenyun/miniconda3/envs/fms_fsdp/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/home/chenyun/miniconda3/envs/fms_fsdp/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/chenyun/miniconda3/envs/fms_fsdp/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/chenyun/miniconda3/envs/fms_fsdp/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
main_training.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-08-28_18:57:32
host : hgx030.scc.idea
rank : 4 (local_rank: 4)
exitcode : -9 (pid: 3103189)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 3103189
slurmstepd-hgx030: error: Detected 1 oom-kill event(s) in step 78214.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
The text was updated successfully, but these errors were encountered: