Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] <title>微调vl的全部参数出现错误 terminate called after throwing an instance of 'c10::Error' what(): CUDA error: an illegal memory access was encountered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. #349

Open
2 tasks done
chuangzhidan opened this issue Apr 7, 2024 · 1 comment

Comments

@chuangzhidan
Copy link

chuangzhidan commented Apr 7, 2024

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

**该错误主要来自 --fix_vit False **,为true的话是可以正常训练的。

RuntimeError: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb49b474617 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fb49b42f98d in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fb49b530128 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x16e76 (0x7fb49b4f8e76 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: + 0x19bad (0x7fb49b4fbbad in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x19fcd (0x7fb49b4fbfcd in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #6: + 0x510d36 (0x7fb4de26dd36 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x55ca7 (0x7fb49b459ca7 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x1e3 (0x7fb49b451cb3 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fb49b451e49 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #10: + 0x7c18f8 (0x7fb4de51e8f8 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x325 (0x7fb4de51eca5 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #12: + 0x1586a7 (0x557bd830d6a7 in /root/miniconda3/bin/python)
frame #13: _PyModule_ClearDict + 0x714 (0x557bd8363364 in /root/miniconda3/bin/python)
frame #14: PyImport_Cleanup + 0x537 (0x557bd838af47 in /root/miniconda3/bin/python)
frame #15: Py_FinalizeEx + 0x79 (0x557bd83bca49 in /root/miniconda3/bin/python)
frame #16: Py_RunMain + 0x183 (0x557bd83be893 in /root/miniconda3/bin/python)
frame #17: Py_BytesMain + 0x39 (0x557bd83beca9 in /root/miniconda3/bin/python)
frame #18: __libc_start_main + 0xf3 (0x7fb51cd12083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #19: + 0x1e21c7 (0x557bd83971c7 in /root/miniconda3/bin/python)

Traceback (most recent call last):
File "finetune.py", line 367, in
train()
File "finetune.py", line 360, in train
trainer.train()
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1675, in _inner_training_loop
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 1255, in prepare
result = self._prepare_deepspeed(*args)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 1640, in _prepare_deepspeed
**engine, optimizer, _, lr_scheduler = deepspeed.initialize(kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/init.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 304, in init
self._configure_optimizer(optimizer, model_parameters)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1234, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1497, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 512, in init
self.initialize_optimizer_states()
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 647, in initialize_optimizer_states
self.optimizer.step()
File "/root/miniconda3/lib/python3.8/site-packages/torch/optim/optimizer.py", line 373, in wrapper
**out = func(*args, kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py", line 191, in step
multi_tensor_applier(self.multi_tensor_adam, self._dummy_overflow_buf, [g_32, p_32, m_32, v_32],
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/ops/adam/multi_tensor_apply.py", line 17, in call
*return op(self.chunk_size, noop_flag_buffer, tensor_lists, args)
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[2024-04-02 22:34:36,014] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 70491 closing signal SIGTERM
[2024-04-02 22:34:36,015] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 70493 closing signal SIGTERM
[2024-04-02 22:34:36,016] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 70494 closing signal SIGTERM
[2024-04-02 22:34:40,441] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: xxxxx) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
File "/root/miniconda3/bin/torchrun", line 8, in
sys.exit(main())
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
**raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetune.py FAILED**

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : xxxxxx
host : xxxxxx
rank : 1 (local_rank: 1)
exitcode : -6 (pid: xxxxx)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID xxxxx

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=pwd

GPUS_PER_NODE=3
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=XXXX

MODEL=xxxxx
DATA="xxxxxx.json"

DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE
--nnodes $NNODES
--node_rank $NODE_RANK
--master_addr $MASTER_ADDR
--master_port $MASTER_PORT
"
nohup torchrun $DISTRIBUTED_ARGS finetune.py
--model_name_or_path $MODEL
--data_path $DATA
--bf16 True
*--fix_vit False *
--output_dir output/XXXXXX
--num_train_epochs 4
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 8
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 40
--save_total_limit 1
--learning_rate 1e-5
--weight_decay 0.1
--adam_beta2 0.95
--warmup_ratio 0.01
--lr_scheduler_type "cosine"
--logging_steps 1
--report_to "none"
--model_max_length 2048
--gradient_checkpointing True
--lazy_preprocess True
--deepspeed finetune/ds_config_zero2.json &

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

@shuxjweb
Copy link

I have encountered the same issue:

File "finetune.py", line 366, in train
trainer.train()
File "/usr/local/python/lib/python3.8/site-packages/transformers/trainer.py", line 1555, in train
return inner_training_loop(
File "/usr/local/python/lib/python3.8/site-packages/transformers/trainer.py", line 1687, in _inner_training_loop
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
File "/usr/local/python/lib/python3.8/site-packages/accelerate/accelerator.py", line 1296, in prepare
result = self._prepare_deepspeed(*args)
File "/usr/local/python/lib/python3.8/site-packages/accelerate/accelerator.py", line 1771, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/usr/local/python/lib/python3.8/site-packages/deepspeed/init.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
File "/usr/local/python/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 308, in init
self._configure_optimizer(optimizer, model_parameters)
File "/usr/local/python/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1247, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/usr/local/python/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1569, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer_Stage3(
File "/usr/local/python/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 380, in init
self.create_reduce_and_remove_grad_hooks()
File "/usr/local/python/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1120, in create_reduce_and_remove_grad_hooks
param.all_gather()
File "/usr/local/python/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1116, in all_gather
return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
File "/usr/local/python/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/python/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1452, in _all_gather
ret_value = self._allgather_params(all_gather_list, hierarchy=hierarchy)
File "/usr/local/python/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1846, in allgather_params
partitions[i].narrow(0, offset, param_numel).copy
(param.ds_tensor.data)
RuntimeError: CUDA error: unspecified launch failure
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: unspecified launch failure
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants