Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out-of-memory when specu-training a llama2-13b model #113

Open
YunChen1227 opened this issue Aug 28, 2024 · 1 comment
Open

out-of-memory when specu-training a llama2-13b model #113

YunChen1227 opened this issue Aug 28, 2024 · 1 comment

Comments

@YunChen1227
Copy link

#!/bin/bash
#SBATCH --job-name=spec_dec_skill
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:hgx:8
#SBATCH --cpus-per-task=32
#SBATCH --mem-per-cpu=6G
#SBATCH -p pos
#SBATCH -o ./log/spec_training.log
#SBATCH -w hgx030

On AWS, the EFA and OFI paths enable NCCL to use optimized networking.

export LD_LIBRARY_PATH=/opt/nccl/build/lib:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/local/cuda:/usr/local/cuda/targets/x86_64-linux/lib/:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:$LD_LIBRARY_PATH

export FI_EFA_SET_CUDA_SYNC_MEMOPS=0

MODEL_ARGS="
--model_variant=13b
--use_dummy_dataset=False
--ckpt_load_path=/cognitive_comp/chenyun/vllm_q/model_path/switch/0701_skill_summary/global_step3100_hf_bf16
--ckpt_save_path=/cognitive_comp/chenyun/fms-fsdp/model_path/switch/0701_skill_summary/spec_dec_skill
--data_path=/cognitive_comp/chenyun/vllm_q/sftdata/0701_summary_emo
--datasets=/cognitive_comp/chenyun/vllm_q/sftdata/0701_summary_emo/selected.json
--fsdp_activation_checkpointing=False
--selective_checkpointing=1
--sharding_strategy=hsdp
--low_cpu_fsdp=False
--batch_size=2
--report_interval=200
--checkpoint_interval=20000
--use_torch_compile=False
--use_profiler=False
"

torchrun
--nnodes=$SLURM_NTASKS
--node_rank=$SLURM_NODEID
--nproc_per_node=8
--master_addr=scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1
--master_port="12234"
main_training.py
${MODEL_ARGS}

I sbatch the above train.sh to HGX, but it failed and show the following error

--> running with these configs train_config(model_variant='13b', ckpt_load_path='/cognitive_comp/chenyun/vllm_q/model_path/switch/0701_skill_summary/global_step3100_hf_bf16', ckpt_save_path='/cognitive_comp/chenyun/fms-fsdp/model_path/switch/0701_skill_summary/spec_dec_skill', use_dummy_dataset=False, data_path='/cognitive_comp/chenyun/vllm_q/sftdata/0701_summary_emo', seq_length=8192, sep_token=1, datasets='/cognitive_comp/chenyun/vllm_q/sftdata/0701_summary_emo/selected.json', weights='7700,500,550,28,17,22,25,8,100,500,175,250,100,25', logical_shards=800, mixed_precision=True, fsdp_activation_checkpointing=False, selective_checkpointing=1, sharding_strategy='hsdp', low_cpu_fsdp=False, seed=2023, batch_size=2, num_steps=2000000, learning_rate=0.0003, grad_clip_thresh=1.0, use_profiler=False, profiler_rank0_only=True, report_interval=200, checkpoint_interval=20000, tracker=None, tracker_dir='/lustre/lchu/fms-fsdp', tracker_project_name='llama', tracker_run_id=None, use_torch_compile=False, model_path='/lustre/llama_weights/8B-llama3-hf', n_speculator_heads=3, speculator_width=4096, stage2_start_step=15000, stage2_prompt_length=64, stage2_batch_size=12, stage2_seq_length=256)
bFloat16 enabled for mixed precision - using bfSixteen policy
Sharding strategy = hsdp
[2024-08-28 18:57:32,902] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3103185 closing signal SIGTERM
[2024-08-28 18:57:32,906] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3103186 closing signal SIGTERM
[2024-08-28 18:57:32,929] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3103187 closing signal SIGTERM
[2024-08-28 18:57:32,963] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3103188 closing signal SIGTERM
[2024-08-28 18:57:32,998] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3103190 closing signal SIGTERM
[2024-08-28 18:57:33,017] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3103191 closing signal SIGTERM
[2024-08-28 18:57:33,050] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3103192 closing signal SIGTERM
[2024-08-28 18:57:36,013] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 4 (pid: 3103189) of binary: /home/chenyun/miniconda3/envs/fms_fsdp/bin/python
Traceback (most recent call last):
File "/home/chenyun/miniconda3/envs/fms_fsdp/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/home/chenyun/miniconda3/envs/fms_fsdp/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/chenyun/miniconda3/envs/fms_fsdp/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/home/chenyun/miniconda3/envs/fms_fsdp/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/chenyun/miniconda3/envs/fms_fsdp/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/chenyun/miniconda3/envs/fms_fsdp/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main_training.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-08-28_18:57:32
host : hgx030.scc.idea
rank : 4 (local_rank: 4)
exitcode : -9 (pid: 3103189)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 3103189

slurmstepd-hgx030: error: Detected 1 oom-kill event(s) in step 78214.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@YunChen1227 and others