train errors #1379

Lsz-20 · 2022-03-15T15:22:52Z

train errors

When I train the custom model, I terminated after one round and validation with the following error:
**
Traceback (most recent call last):
File "tools/train.py", line 180, in
main()
File "tools/train.py", line 169, in main
train_segmentor(
File "/nfs/my/lsz/mmsegmentation-0.20.0/mmseg/apis/train.py", line 167, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run
iter_runner(iter_loaders[i], **kwargs)
File "/opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 61, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
return self.module.train_step(*inputs[0], *kwargs[0])
File "/nfs/my/lsz/mmsegmentation-0.20.0/mmseg/models/segmentors/base.py", line 139, in train_step
loss, log_vars = self._parse_losses(losses)
File "/nfs/my/lsz/mmsegmentation-0.20.0/mmseg/models/segmentors/base.py", line 208, in _parse_losses
log_vars[loss_name] = loss_value.item()
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1603729096996/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fe1c7a858b2 in /opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0xad2 (0x7fe1c7cd7982 in /opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fe1c7a70b7d in /opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #3: + 0x5f65b2 (0x7fe211dd05b2 in /opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x5f6666 (0x7fe211dd0666 in /opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/torch/lib/libtorch_python.so)

frame #23: __libc_start_main + 0xe7 (0x7fe242166bf7 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)
**
Perhaps what caused it?
Thank you for your reply

MengzhangLI · 2022-03-15T16:47:17Z

Perhaps it is conflict between num_classes in config and its real number in dataset.
num_classes in config should be number of foreground + 1 (background).

If it not work, you can search related issue for more help.

Lsz-20 · 2022-03-16T01:13:31Z

Perhaps it is conflict between num_classes in config and its real number in dataset. num_classes in config should be number of foreground + 1 (background).

If it not work, you can search related issue for more help.

Thanks for your advice ~.I did set up the wrong --numclass and it work well.
Perhaps this question can be answered? #1295

MengzhangLI self-assigned this Mar 15, 2022

MengzhangLI closed this as completed Mar 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train errors #1379

train errors #1379

Lsz-20 commented Mar 15, 2022

MengzhangLI commented Mar 15, 2022

Lsz-20 commented Mar 16, 2022

train errors #1379

train errors #1379

Comments

Lsz-20 commented Mar 15, 2022

train errors

MengzhangLI commented Mar 15, 2022

Lsz-20 commented Mar 16, 2022