You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I train the custom model, I terminated after one round and validation with the following error:
**
Traceback (most recent call last):
File "tools/train.py", line 180, in
main()
File "tools/train.py", line 169, in main
train_segmentor(
File "/nfs/my/lsz/mmsegmentation-0.20.0/mmseg/apis/train.py", line 167, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run
iter_runner(iter_loaders[i], **kwargs)
File "/opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 61, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
return self.module.train_step(*inputs[0], *kwargs[0])
File "/nfs/my/lsz/mmsegmentation-0.20.0/mmseg/models/segmentors/base.py", line 139, in train_step
loss, log_vars = self._parse_losses(losses)
File "/nfs/my/lsz/mmsegmentation-0.20.0/mmseg/models/segmentors/base.py", line 208, in _parse_losses
log_vars[loss_name] = loss_value.item()
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1603729096996/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fe1c7a858b2 in /opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0xad2 (0x7fe1c7cd7982 in /opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fe1c7a70b7d in /opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #3: + 0x5f65b2 (0x7fe211dd05b2 in /opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x5f6666 (0x7fe211dd0666 in /opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #23: __libc_start_main + 0xe7 (0x7fe242166bf7 in /lib/x86_64-linux-gnu/libc.so.6)
Aborted (core dumped)
**
Perhaps what caused it?
Thank you for your reply
The text was updated successfully, but these errors were encountered:
Perhaps it is conflict between num_classes in config and its real number in dataset. num_classes in config should be number of foreground + 1 (background).
If it not work, you can search related issue for more help.
Perhaps it is conflict between num_classes in config and its real number in dataset. num_classes in config should be number of foreground + 1 (background).
If it not work, you can search related issue for more help.
Thanks for your advice ~.I did set up the wrong --numclass and it work well.
Perhaps this question can be answered? #1295
train errors
When I train the custom model, I terminated after one round and validation with the following error:
**
Traceback (most recent call last):
File "tools/train.py", line 180, in
main()
File "tools/train.py", line 169, in main
train_segmentor(
File "/nfs/my/lsz/mmsegmentation-0.20.0/mmseg/apis/train.py", line 167, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run
iter_runner(iter_loaders[i], **kwargs)
File "/opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 61, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
return self.module.train_step(*inputs[0], *kwargs[0])
File "/nfs/my/lsz/mmsegmentation-0.20.0/mmseg/models/segmentors/base.py", line 139, in train_step
loss, log_vars = self._parse_losses(losses)
File "/nfs/my/lsz/mmsegmentation-0.20.0/mmseg/models/segmentors/base.py", line 208, in _parse_losses
log_vars[loss_name] = loss_value.item()
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1603729096996/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fe1c7a858b2 in /opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0xad2 (0x7fe1c7cd7982 in /opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fe1c7a70b7d in /opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #3: + 0x5f65b2 (0x7fe211dd05b2 in /opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x5f6666 (0x7fe211dd0666 in /opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #23: __libc_start_main + 0xe7 (0x7fe242166bf7 in /lib/x86_64-linux-gnu/libc.so.6)
Aborted (core dumped)
**
Perhaps what caused it?
Thank you for your reply
The text was updated successfully, but these errors were encountered: