Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train errors #1379

Closed
Lsz-20 opened this issue Mar 15, 2022 · 2 comments
Closed

train errors #1379

Lsz-20 opened this issue Mar 15, 2022 · 2 comments
Assignees

Comments

@Lsz-20
Copy link

Lsz-20 commented Mar 15, 2022

train errors

When I train the custom model, I terminated after one round and validation with the following error:
**
Traceback (most recent call last):
File "tools/train.py", line 180, in
main()
File "tools/train.py", line 169, in main
train_segmentor(
File "/nfs/my/lsz/mmsegmentation-0.20.0/mmseg/apis/train.py", line 167, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run
iter_runner(iter_loaders[i], **kwargs)
File "/opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 61, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
return self.module.train_step(*inputs[0], *kwargs[0])
File "/nfs/my/lsz/mmsegmentation-0.20.0/mmseg/models/segmentors/base.py", line 139, in train_step
loss, log_vars = self._parse_losses(losses)
File "/nfs/my/lsz/mmsegmentation-0.20.0/mmseg/models/segmentors/base.py", line 208, in _parse_losses
log_vars[loss_name] = loss_value.item()
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1603729096996/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fe1c7a858b2 in /opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void
) + 0xad2 (0x7fe1c7cd7982 in /opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fe1c7a70b7d in /opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #3: + 0x5f65b2 (0x7fe211dd05b2 in /opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x5f6666 (0x7fe211dd0666 in /opt/conda/envs/py38_torch1.7/lib/python3.8/site-packages/torch/lib/libtorch_python.so)

frame #23: __libc_start_main + 0xe7 (0x7fe242166bf7 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)
**
Perhaps what caused it?
Thank you for your reply

@MengzhangLI
Copy link
Contributor

Perhaps it is conflict between num_classes in config and its real number in dataset.
num_classes in config should be number of foreground + 1 (background).

If it not work, you can search related issue for more help.

@MengzhangLI MengzhangLI self-assigned this Mar 15, 2022
@Lsz-20
Copy link
Author

Lsz-20 commented Mar 16, 2022

Perhaps it is conflict between num_classes in config and its real number in dataset. num_classes in config should be number of foreground + 1 (background).

If it not work, you can search related issue for more help.

Thanks for your advice ~.I did set up the wrong --numclass and it work well.
Perhaps this question can be answered? #1295

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants