Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ImportError: And RuntimeError: Broken pipe #1077

Closed
chenhaiwen opened this issue Nov 25, 2021 · 1 comment
Closed

ImportError: And RuntimeError: Broken pipe #1077

chenhaiwen opened this issue Nov 25, 2021 · 1 comment
Assignees

Comments

@chenhaiwen
Copy link

When i train my own Dataset,using ocrnet.
when i train with one GPU,it can only work one epoch(i set workflow = [('train', 1), ('val', 1)]).After the first training and validation,Memory removal errors are reported: error 1. And when i try to train with MULTI-GPU,it will report a second error message:error 2. thans for your help

error 1.

2021-11-25 20:07:21,165 - mmseg - INFO -
+------+-------+------+
| aAcc | mIoU | mAcc |
+------+-------+------+
| 77.4 | 44.57 | 65.8 |
+------+-------+------+
2021-11-25 20:07:21,169 - mmseg - INFO - Exp name: ocrnet_hr18_512x512_gaofen.py
2021-11-25 20:07:21,170 - mmseg - INFO - Epoch(val) [1][1497] aAcc: 0.7740, mIoU: 0.4457, mAcc: 0.6580, IoU.building land: 0.6174, IoU.farmland: 0.623IoU.forest: 0.6644, IoU.grassland: 0.1251, IoU.water: 0.2242, IoU.background: 0.4196, Acc.building land: 0.7475, Acc.farmland: 0.8020, Acc.forest: 0.852Acc.grassland: 0.1387, Acc.water: 0.5086, Acc.background: 0.8990
2021-11-25 20:08:30,407 - mmseg - INFO - Exp name: ocrnet_hr18_512x512_gaofen.py
2021-11-25 20:08:30,408 - mmseg - INFO - Epoch(val) [1][187] decode_0.loss_ce: 0.2673, decode_0.acc_seg: 57.0989, decode_1.loss_ce: 0.6680, decode_1._seg: 57.4639, loss: 0.9352
Traceback (most recent call last):
File "tools/train.py", line 185, in
main()
File "tools/train.py", line 181, in main
meta=meta)
File "/home/chenhaiwen/project/mmsegmentation/mmseg/apis/train.py", line 120, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], *kwargs)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
self.call_hook('after_train_iter')
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
getattr(hook, fn_name)(self)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 35, in after_train_iter
runner.outputs['loss'].backward()
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/tensor.py", line 185, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/autograd/init.py", line 127, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 10.92 GiB total capacity; 8.39 GiB already allocated; 313.00 MiB free; 10.01 GiB erved in total by PyTorch)
Exception raised from malloc at /opt/conda/conda-bld/pytorch_1595629403081/work/c10/cuda/CUDACachingAllocator.cpp:272 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7fec8874c77d in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-pages/torch/lib/libc10.so)
frame #1: + 0x20626 (0x7fec889a4626 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.s
frame #2: + 0x214f4 (0x7fec889a54f4 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.s
frame #3: + 0x21b81 (0x7fec889a5b81 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.s
frame #4: at::native::empty_cuda(c10::ArrayRef, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) + 0x249 (0x7fec8b8b4c79 in /home/chenwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0xd25dc9 (0x7fec898d7dc9 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cudo)
frame #6: + 0xd3fbf7 (0x7fec898f1bf7 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cudo)
frame #7: + 0xe450dd (0x7fecbba0b0dd in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cpu)
frame #8: + 0xe453f7 (0x7fecbba0b3f7 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cpu)
frame #9: at::empty(c10::ArrayRef, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) + 0xfa (0x7fecbbb15e7a in /home/chenhaiwen/anacondenvs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #10: + 0xcabd93 (0x7fec8985dd93 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cuso)
frame #11: at::native::cudnn_convolution_backward_input(c10::ArrayRef, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef<lo, c10::ArrayRef, long, bool, bool) + 0xb2 (0x7fec8985e5d2 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libth_cuda.so)
frame #12: + 0xd117db (0x7fec898c37db in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cuso)
frame #13: + 0xd415f8 (0x7fec898f35f8 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cuso)
frame #14: at::cudnn_convolution_backward_input(c10::ArrayRef, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10rrayRef, long, bool, bool) + 0x1ad (0x7fecbbb18ced in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cpo)
frame #15: at::native::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10rrayRef, long, bool, bool, std::array<bool, 2ul>) + 0x223 (0x7fec8985cca3 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packageorch/lib/libtorch_cuda.so)
frame #16: + 0xd118c5 (0x7fec898c38c5 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cuso)
frame #17: + 0xd41654 (0x7fec898f3654 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cuso)
frame #18: at::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRlong>, long, bool, bool, std::array<bool, 2ul>) + 0x1e2 (0x7fecbbb276a2 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch//libtorch_cpu.so)
frame #19: + 0x2c250c2 (0x7fecbd7eb0c2 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cso)
frame #20: + 0x2c39684 (0x7fecbd7ff684 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cso)
frame #21: at::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRlong>, long, bool, bool, std::array<bool, 2ul>) + 0x1e2 (0x7fecbbb276a2 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch//libtorch_cpu.so)
frame #22: torch::autograd::generated::CudnnConvolutionBackward::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&) + 0x258 (0x7fecbd672098 ihome/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #23: + 0x30d1017 (0x7fecbdc97017 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cso)
frame #24: torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node
, torch::autograd::InputBufferstd::shared_ptrtorch::autograd::ReadyQueue const&) + 0x1400 (0x7fecbdc92860 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/ch/lib/libtorch_cpu.so)
frame #25: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&) + 0x451 (0x7fecbdc93401 in /home/chenhaiwen/anacondanvs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #26: torch::autograd::Engine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) + 0x89 (0x7fecbdc8b579 in /home/chenhaiwanaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #27: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) + 0x4a (0x7fecc1fba99a in me/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #28: + 0xc9039 (0x7fecc4af2039 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/../../../../ibstdc++.so.6)
frame #29: + 0x9669 (0x7fece72ff669 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #30: clone + 0x43 (0x7fece7227323 in /lib/x86_64-linux-gnu/libc.so.6)

error 2:
Traceback (most recent call last):
File "tools/train.py", line 21, in
from mmseg.apis import set_random_seed, train_segmentor
File "/home/chenhaiwen/project/mmsegmentation/mmseg/apis/init.py", line 2, in
from .inference import inference_segmentor, init_segmentor, show_result_pyplot
File "/home/chenhaiwen/project/mmsegmentation/mmseg/apis/inference.py", line 2, in
import matplotlib.pyplot as plt
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/matplotlib/pyplot.py", line 2500, in
switch_backend(rcParams["backend"])
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/matplotlib/pyplot.py", line 288, in switch_backend
newbackend, required_framework, current_framework))
ImportError: Cannot load backend 'TkAgg' which requires the 'tk' interactive framework, as 'headless' is currently running
Traceback (most recent call last):
File "tools/train.py", line 185, in
main()
File "tools/train.py", line 103, in main
init_dist(args.launcher, **cfg.dist_params)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 20, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 34, in _init_dist_pytorch
dist.init_process_group(backend=backend, **kwargs)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 422, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 172, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
KeyboardInterrupt
fatal: not a git repository (or any of the parent directories): .git
fatal: not a git repository (or any of the parent directories): .git
/home/chenhaiwen/project/mmsegmentation/mmseg/models/backbones/hrnet.py:318: UserWarning: DeprecationWarning: pretrained is deprecated, please use "init_cfg" instead
warnings.warn('DeprecationWarning: pretrained is deprecated, '
/home/chenhaiwen/project/mmsegmentation/mmseg/models/backbones/hrnet.py:318: UserWarning: DeprecationWarning: pretrained is deprecated, please use "init_cfg" instead
warnings.warn('DeprecationWarning: pretrained is deprecated, '
fatal: not a git repository (or any of the parent directories): .git
fatal: not a git repository (or any of the parent directories): .git
fatal: not a git repository (or any of the parent directories): .git
fatal: not a git repository (or any of the parent directories): .git
/home/chenhaiwen/project/mmsegmentation/mmseg/models/backbones/hrnet.py:318: UserWarning: DeprecationWarning: pretrained is deprecated, please use "init_cfg" instead
warnings.warn('DeprecationWarning: pretrained is deprecated, '
/home/chenhaiwen/project/mmsegmentation/mmseg/models/backbones/hrnet.py:318: UserWarning: DeprecationWarning: pretrained is deprecated, please use "init_cfg" instead
warnings.warn('DeprecationWarning: pretrained is deprecated, '
/home/chenhaiwen/project/mmsegmentation/mmseg/models/backbones/hrnet.py:318: UserWarning: DeprecationWarning: pretrained is deprecated, please use "init_cfg" instead
warnings.warn('DeprecationWarning: pretrained is deprecated, '
/home/chenhaiwen/project/mmsegmentation/mmseg/models/backbones/hrnet.py:318: UserWarning: DeprecationWarning: pretrained is deprecated, please use "init_cfg" instead
warnings.warn('DeprecationWarning: pretrained is deprecated, '
Traceback (most recent call last):
File "tools/train.py", line 185, in
main()
File "tools/train.py", line 145, in main
model.init_weights()
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/base_module.py", line 117, in init_weights
m.init_weights()
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/base_module.py", line 106, in init_weights
initialize(self, self.init_cfg)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 612, in initialize
_initialize(module, cp_cfg)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 517, in _initialize
func(module)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 494, in call
logger=logger)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 513, in load_checkpoint
checkpoint = _load_checkpoint(filename, map_location, logger)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 451, in _load_checkpoint
return CheckpointLoader.load_checkpoint(filename, map_location, logger)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 244, in load_checkpoint
return checkpoint_loader(filename, map_location)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 405, in load_from_openmmlab
checkpoint = load_from_http(model_url, map_location=map_location)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 286, in load_from_http
torch.distributed.barrier()
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1710, in barrier
work = _default_pg.barrier()
RuntimeError: Broken pipe

@MengzhangLI
Copy link
Contributor

It is caused by limited GPU memory.

Please check potential solutions here:
#1029

@MengzhangLI MengzhangLI self-assigned this Nov 26, 2021
wjkim81 pushed a commit to wjkim81/mmsegmentation that referenced this issue Dec 3, 2023
* support soft_wing_loss

* add unittest for soft wing loss

* update model md

* update doc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants