ImportError: And RuntimeError: Broken pipe #1077

chenhaiwen · 2021-11-25T12:51:38Z

When i train my own Dataset,using ocrnet.
when i train with one GPU,it can only work one epoch(i set workflow = [('train', 1), ('val', 1)]).After the first training and validation,Memory removal errors are reported: error 1. And when i try to train with MULTI-GPU,it will report a second error message:error 2. thans for your help

error 1.

2021-11-25 20:07:21,165 - mmseg - INFO -
+------+-------+------+
| aAcc | mIoU | mAcc |
+------+-------+------+
| 77.4 | 44.57 | 65.8 |
+------+-------+------+
2021-11-25 20:07:21,169 - mmseg - INFO - Exp name: ocrnet_hr18_512x512_gaofen.py
2021-11-25 20:07:21,170 - mmseg - INFO - Epoch(val) [1][1497] aAcc: 0.7740, mIoU: 0.4457, mAcc: 0.6580, IoU.building land: 0.6174, IoU.farmland: 0.623IoU.forest: 0.6644, IoU.grassland: 0.1251, IoU.water: 0.2242, IoU.background: 0.4196, Acc.building land: 0.7475, Acc.farmland: 0.8020, Acc.forest: 0.852Acc.grassland: 0.1387, Acc.water: 0.5086, Acc.background: 0.8990
2021-11-25 20:08:30,407 - mmseg - INFO - Exp name: ocrnet_hr18_512x512_gaofen.py
2021-11-25 20:08:30,408 - mmseg - INFO - Epoch(val) [1][187] decode_0.loss_ce: 0.2673, decode_0.acc_seg: 57.0989, decode_1.loss_ce: 0.6680, decode_1._seg: 57.4639, loss: 0.9352
Traceback (most recent call last):
File "tools/train.py", line 185, in
main()
File "tools/train.py", line 181, in main
meta=meta)
File "/home/chenhaiwen/project/mmsegmentation/mmseg/apis/train.py", line 120, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], *kwargs)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
self.call_hook('after_train_iter')
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
getattr(hook, fn_name)(self)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 35, in after_train_iter
runner.outputs['loss'].backward()
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/tensor.py", line 185, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/autograd/init.py", line 127, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 10.92 GiB total capacity; 8.39 GiB already allocated; 313.00 MiB free; 10.01 GiB erved in total by PyTorch)
Exception raised from malloc at /opt/conda/conda-bld/pytorch_1595629403081/work/c10/cuda/CUDACachingAllocator.cpp:272 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7fec8874c77d in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-pages/torch/lib/libc10.so)
frame #1: + 0x20626 (0x7fec889a4626 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.s
frame #2: + 0x214f4 (0x7fec889a54f4 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.s
frame #3: + 0x21b81 (0x7fec889a5b81 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.s
frame #4: at::native::empty_cuda(c10::ArrayRef, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) + 0x249 (0x7fec8b8b4c79 in /home/chenwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0xd25dc9 (0x7fec898d7dc9 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cudo)
frame #6: + 0xd3fbf7 (0x7fec898f1bf7 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cudo)
frame #7: + 0xe450dd (0x7fecbba0b0dd in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cpu)
frame #8: + 0xe453f7 (0x7fecbba0b3f7 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cpu)
frame #9: at::empty(c10::ArrayRef, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) + 0xfa (0x7fecbbb15e7a in /home/chenhaiwen/anacondenvs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #10: + 0xcabd93 (0x7fec8985dd93 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cuso)
frame #11: at::native::cudnn_convolution_backward_input(c10::ArrayRef, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef<lo, c10::ArrayRef, long, bool, bool) + 0xb2 (0x7fec8985e5d2 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libth_cuda.so)
frame #12: + 0xd117db (0x7fec898c37db in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cuso)
frame #13: + 0xd415f8 (0x7fec898f35f8 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cuso)
frame #14: at::cudnn_convolution_backward_input(c10::ArrayRef, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10rrayRef, long, bool, bool) + 0x1ad (0x7fecbbb18ced in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cpo)
frame #15: at::native::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10rrayRef, long, bool, bool, std::array<bool, 2ul>) + 0x223 (0x7fec8985cca3 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packageorch/lib/libtorch_cuda.so)
frame #16: + 0xd118c5 (0x7fec898c38c5 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cuso)
frame #17: + 0xd41654 (0x7fec898f3654 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cuso)
frame #18: at::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRlong>, long, bool, bool, std::array<bool, 2ul>) + 0x1e2 (0x7fecbbb276a2 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch//libtorch_cpu.so)
frame #19: + 0x2c250c2 (0x7fecbd7eb0c2 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cso)
frame #20: + 0x2c39684 (0x7fecbd7ff684 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cso)
frame #21: at::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRlong>, long, bool, bool, std::array<bool, 2ul>) + 0x1e2 (0x7fecbbb276a2 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch//libtorch_cpu.so)
frame #22: torch::autograd::generated::CudnnConvolutionBackward::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&) + 0x258 (0x7fecbd672098 ihome/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #23: + 0x30d1017 (0x7fecbdc97017 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cso)
frame #24: torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node, torch::autograd::InputBufferstd::shared_ptrtorch::autograd::ReadyQueue const&) + 0x1400 (0x7fecbdc92860 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/ch/lib/libtorch_cpu.so)
frame #25: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&) + 0x451 (0x7fecbdc93401 in /home/chenhaiwen/anacondanvs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #26: torch::autograd::Engine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) + 0x89 (0x7fecbdc8b579 in /home/chenhaiwanaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #27: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) + 0x4a (0x7fecc1fba99a in me/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #28: + 0xc9039 (0x7fecc4af2039 in /home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/../../../../ibstdc++.so.6)
frame #29: + 0x9669 (0x7fece72ff669 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #30: clone + 0x43 (0x7fece7227323 in /lib/x86_64-linux-gnu/libc.so.6)

error 2:
Traceback (most recent call last):
File "tools/train.py", line 21, in
from mmseg.apis import set_random_seed, train_segmentor
File "/home/chenhaiwen/project/mmsegmentation/mmseg/apis/init.py", line 2, in
from .inference import inference_segmentor, init_segmentor, show_result_pyplot
File "/home/chenhaiwen/project/mmsegmentation/mmseg/apis/inference.py", line 2, in
import matplotlib.pyplot as plt
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/matplotlib/pyplot.py", line 2500, in
switch_backend(rcParams["backend"])
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/matplotlib/pyplot.py", line 288, in switch_backend
newbackend, required_framework, current_framework))
ImportError: Cannot load backend 'TkAgg' which requires the 'tk' interactive framework, as 'headless' is currently running
Traceback (most recent call last):
File "tools/train.py", line 185, in
main()
File "tools/train.py", line 103, in main
init_dist(args.launcher, **cfg.dist_params)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 20, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 34, in _init_dist_pytorch
dist.init_process_group(backend=backend, **kwargs)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 422, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 172, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
KeyboardInterrupt
fatal: not a git repository (or any of the parent directories): .git
fatal: not a git repository (or any of the parent directories): .git
/home/chenhaiwen/project/mmsegmentation/mmseg/models/backbones/hrnet.py:318: UserWarning: DeprecationWarning: pretrained is deprecated, please use "init_cfg" instead
warnings.warn('DeprecationWarning: pretrained is deprecated, '
/home/chenhaiwen/project/mmsegmentation/mmseg/models/backbones/hrnet.py:318: UserWarning: DeprecationWarning: pretrained is deprecated, please use "init_cfg" instead
warnings.warn('DeprecationWarning: pretrained is deprecated, '
fatal: not a git repository (or any of the parent directories): .git
fatal: not a git repository (or any of the parent directories): .git
fatal: not a git repository (or any of the parent directories): .git
fatal: not a git repository (or any of the parent directories): .git
/home/chenhaiwen/project/mmsegmentation/mmseg/models/backbones/hrnet.py:318: UserWarning: DeprecationWarning: pretrained is deprecated, please use "init_cfg" instead
warnings.warn('DeprecationWarning: pretrained is deprecated, '
/home/chenhaiwen/project/mmsegmentation/mmseg/models/backbones/hrnet.py:318: UserWarning: DeprecationWarning: pretrained is deprecated, please use "init_cfg" instead
warnings.warn('DeprecationWarning: pretrained is deprecated, '
/home/chenhaiwen/project/mmsegmentation/mmseg/models/backbones/hrnet.py:318: UserWarning: DeprecationWarning: pretrained is deprecated, please use "init_cfg" instead
warnings.warn('DeprecationWarning: pretrained is deprecated, '
/home/chenhaiwen/project/mmsegmentation/mmseg/models/backbones/hrnet.py:318: UserWarning: DeprecationWarning: pretrained is deprecated, please use "init_cfg" instead
warnings.warn('DeprecationWarning: pretrained is deprecated, '
Traceback (most recent call last):
File "tools/train.py", line 185, in
main()
File "tools/train.py", line 145, in main
model.init_weights()
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/base_module.py", line 117, in init_weights
m.init_weights()
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/base_module.py", line 106, in init_weights
initialize(self, self.init_cfg)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 612, in initialize
_initialize(module, cp_cfg)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 517, in _initialize
func(module)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 494, in call
logger=logger)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 513, in load_checkpoint
checkpoint = _load_checkpoint(filename, map_location, logger)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 451, in _load_checkpoint
return CheckpointLoader.load_checkpoint(filename, map_location, logger)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 244, in load_checkpoint
return checkpoint_loader(filename, map_location)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 405, in load_from_openmmlab
checkpoint = load_from_http(model_url, map_location=map_location)
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 286, in load_from_http
torch.distributed.barrier()
File "/home/chenhaiwen/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1710, in barrier
work = _default_pg.barrier()
RuntimeError: Broken pipe

The text was updated successfully, but these errors were encountered:

MengzhangLI · 2021-11-26T07:42:04Z

It is caused by limited GPU memory.

Please check potential solutions here:
#1029

* support soft_wing_loss * add unittest for soft wing loss * update model md * update doc

MengzhangLI self-assigned this Nov 26, 2021

MengzhangLI closed this as completed Nov 26, 2021

wjkim81 pushed a commit to wjkim81/mmsegmentation that referenced this issue Dec 3, 2023

[Feature] Soft wing loss (TIP'2021) (open-mmlab#1077)

ff55efe

* support soft_wing_loss * add unittest for soft wing loss * update model md * update doc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ImportError: And RuntimeError: Broken pipe #1077

ImportError: And RuntimeError: Broken pipe #1077

chenhaiwen commented Nov 25, 2021

MengzhangLI commented Nov 26, 2021

ImportError: And RuntimeError: Broken pipe #1077

ImportError: And RuntimeError: Broken pipe #1077

Comments

chenhaiwen commented Nov 25, 2021

MengzhangLI commented Nov 26, 2021