You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[BUG] <title>微调vl的全部参数出现错误 terminate called after throwing an instance of 'c10::Error' what(): CUDA error: an illegal memory access was encountered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
#349
Open
2 tasks done
chuangzhidan opened this issue
Apr 7, 2024
· 1 comment
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
我已经搜索过FAQ | I have searched FAQ
当前行为 | Current Behavior
**该错误主要来自 --fix_vit False **,为true的话是可以正常训练的。
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb49b474617 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fb49b42f98d in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fb49b530128 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x16e76 (0x7fb49b4f8e76 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: + 0x19bad (0x7fb49b4fbbad in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x19fcd (0x7fb49b4fbfcd in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #6: + 0x510d36 (0x7fb4de26dd36 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x55ca7 (0x7fb49b459ca7 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x1e3 (0x7fb49b451cb3 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fb49b451e49 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #10: + 0x7c18f8 (0x7fb4de51e8f8 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x325 (0x7fb4de51eca5 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #12: + 0x1586a7 (0x557bd830d6a7 in /root/miniconda3/bin/python)
frame #13: _PyModule_ClearDict + 0x714 (0x557bd8363364 in /root/miniconda3/bin/python)
frame #14: PyImport_Cleanup + 0x537 (0x557bd838af47 in /root/miniconda3/bin/python)
frame #15: Py_FinalizeEx + 0x79 (0x557bd83bca49 in /root/miniconda3/bin/python)
frame #16: Py_RunMain + 0x183 (0x557bd83be893 in /root/miniconda3/bin/python)
frame #17: Py_BytesMain + 0x39 (0x557bd83beca9 in /root/miniconda3/bin/python)
frame #18: __libc_start_main + 0xf3 (0x7fb51cd12083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #19: + 0x1e21c7 (0x557bd83971c7 in /root/miniconda3/bin/python)
Traceback (most recent call last):
File "finetune.py", line 367, in
train()
File "finetune.py", line 360, in train
trainer.train()
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1675, in _inner_training_loop model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 1255, in prepare
result = self._prepare_deepspeed(*args)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 1640, in _prepare_deepspeed
**engine, optimizer, _, lr_scheduler = deepspeed.initialize(kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/init.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 304, in init
self._configure_optimizer(optimizer, model_parameters)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1234, in _configure_optimizer self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1497, in _configure_zero_optimizer optimizer = DeepSpeedZeroOptimizer(
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 512, in init
self.initialize_optimizer_states()
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 647, in initialize_optimizer_states self.optimizer.step()
File "/root/miniconda3/lib/python3.8/site-packages/torch/optim/optimizer.py", line 373, in wrapper
**out = func(*args, kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py", line 191, in step multi_tensor_applier(self.multi_tensor_adam, self._dummy_overflow_buf, [g_32, p_32, m_32, v_32],
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/ops/adam/multi_tensor_apply.py", line 17, in call
*return op(self.chunk_size, noop_flag_buffer, tensor_lists, args)
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
[2024-04-02 22:34:36,014] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 70491 closing signal SIGTERM
[2024-04-02 22:34:36,015] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 70493 closing signal SIGTERM
[2024-04-02 22:34:36,016] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 70494 closing signal SIGTERM
[2024-04-02 22:34:40,441] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: xxxxx) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
File "/root/miniconda3/bin/torchrun", line 8, in
sys.exit(main())
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
**raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
finetune.py FAILED**
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : xxxxxx
host : xxxxxx
rank : 1 (local_rank: 1)
exitcode : -6 (pid: xxxxx)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID xxxxx
File "finetune.py", line 366, in train
trainer.train()
File "/usr/local/python/lib/python3.8/site-packages/transformers/trainer.py", line 1555, in train
return inner_training_loop(
File "/usr/local/python/lib/python3.8/site-packages/transformers/trainer.py", line 1687, in _inner_training_loop
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
File "/usr/local/python/lib/python3.8/site-packages/accelerate/accelerator.py", line 1296, in prepare
result = self._prepare_deepspeed(*args)
File "/usr/local/python/lib/python3.8/site-packages/accelerate/accelerator.py", line 1771, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/usr/local/python/lib/python3.8/site-packages/deepspeed/init.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
File "/usr/local/python/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 308, in init
self._configure_optimizer(optimizer, model_parameters)
File "/usr/local/python/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1247, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/usr/local/python/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1569, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer_Stage3(
File "/usr/local/python/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 380, in init
self.create_reduce_and_remove_grad_hooks()
File "/usr/local/python/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1120, in create_reduce_and_remove_grad_hooks
param.all_gather()
File "/usr/local/python/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1116, in all_gather
return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
File "/usr/local/python/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/python/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1452, in _all_gather
ret_value = self._allgather_params(all_gather_list, hierarchy=hierarchy)
File "/usr/local/python/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1846, in allgather_params
partitions[i].narrow(0, offset, param_numel).copy(param.ds_tensor.data)
RuntimeError: CUDA error: unspecified launch failure
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: unspecified launch failure
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
**该错误主要来自 --fix_vit False **,为true的话是可以正常训练的。
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb49b474617 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fb49b42f98d in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fb49b530128 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x16e76 (0x7fb49b4f8e76 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: + 0x19bad (0x7fb49b4fbbad in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x19fcd (0x7fb49b4fbfcd in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #6: + 0x510d36 (0x7fb4de26dd36 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x55ca7 (0x7fb49b459ca7 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x1e3 (0x7fb49b451cb3 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fb49b451e49 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #10: + 0x7c18f8 (0x7fb4de51e8f8 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x325 (0x7fb4de51eca5 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #12: + 0x1586a7 (0x557bd830d6a7 in /root/miniconda3/bin/python)
frame #13: _PyModule_ClearDict + 0x714 (0x557bd8363364 in /root/miniconda3/bin/python)
frame #14: PyImport_Cleanup + 0x537 (0x557bd838af47 in /root/miniconda3/bin/python)
frame #15: Py_FinalizeEx + 0x79 (0x557bd83bca49 in /root/miniconda3/bin/python)
frame #16: Py_RunMain + 0x183 (0x557bd83be893 in /root/miniconda3/bin/python)
frame #17: Py_BytesMain + 0x39 (0x557bd83beca9 in /root/miniconda3/bin/python)
frame #18: __libc_start_main + 0xf3 (0x7fb51cd12083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #19: + 0x1e21c7 (0x557bd83971c7 in /root/miniconda3/bin/python)
Traceback (most recent call last):
File "finetune.py", line 367, in
train()
File "finetune.py", line 360, in train
trainer.train()
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1675, in _inner_training_loop
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 1255, in prepare
result = self._prepare_deepspeed(*args)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 1640, in _prepare_deepspeed
**engine, optimizer, _, lr_scheduler = deepspeed.initialize(kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/init.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 304, in init
self._configure_optimizer(optimizer, model_parameters)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1234, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1497, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 512, in init
self.initialize_optimizer_states()
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 647, in initialize_optimizer_states
self.optimizer.step()
File "/root/miniconda3/lib/python3.8/site-packages/torch/optim/optimizer.py", line 373, in wrapper
**out = func(*args, kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py", line 191, in step
multi_tensor_applier(self.multi_tensor_adam, self._dummy_overflow_buf, [g_32, p_32, m_32, v_32],
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/ops/adam/multi_tensor_apply.py", line 17, in call
*return op(self.chunk_size, noop_flag_buffer, tensor_lists, args)
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.[2024-04-02 22:34:36,014] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 70491 closing signal SIGTERM
[2024-04-02 22:34:36,015] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 70493 closing signal SIGTERM
[2024-04-02 22:34:36,016] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 70494 closing signal SIGTERM
[2024-04-02 22:34:40,441] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: xxxxx) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
File "/root/miniconda3/bin/torchrun", line 8, in
sys.exit(main())
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
**raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
finetune.py FAILED**
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : xxxxxx
host : xxxxxx
rank : 1 (local_rank: 1)
exitcode : -6 (pid: xxxxx)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID xxxxx
期望行为 | Expected Behavior
No response
复现方法 | Steps To Reproduce
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=
pwd
GPUS_PER_NODE=3
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=XXXX
MODEL=xxxxx
DATA="xxxxxx.json"
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE
--nnodes $NNODES
--node_rank $NODE_RANK
--master_addr $MASTER_ADDR
--master_port $MASTER_PORT
"
nohup torchrun $DISTRIBUTED_ARGS finetune.py
--model_name_or_path $MODEL
--data_path $DATA
--bf16 True
*--fix_vit False *
--output_dir output/XXXXXX
--num_train_epochs 4
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 8
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 40
--save_total_limit 1
--learning_rate 1e-5
--weight_decay 0.1
--adam_beta2 0.95
--warmup_ratio 0.01
--lr_scheduler_type "cosine"
--logging_steps 1
--report_to "none"
--model_max_length 2048
--gradient_checkpointing True
--lazy_preprocess True
--deepspeed finetune/ds_config_zero2.json &
运行环境 | Environment
备注 | Anything else?
No response
The text was updated successfully, but these errors were encountered: