Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I train LoRRA on TextVQA dataset using multi-GPUs? #116

Closed
ChenyuGAO-CS opened this issue Jul 3, 2019 · 6 comments
Closed

How can I train LoRRA on TextVQA dataset using multi-GPUs? #116

ChenyuGAO-CS opened this issue Jul 3, 2019 · 6 comments

Comments

@ChenyuGAO-CS
Copy link

❓ Questions and Help

I try to only set data_parallel=true, then I got:
2019-07-03T16:03:40 INFO: Starting training... 2019-07-03T16:03:40 INFO: Fetching fastText model for OCR processing 2019-07-03T16:03:41 INFO: Loading fasttext model now from /media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/.vector_cache/wiki.en.bin 2019-07-03T16:03:52 INFO: Finished loading fasttext model 2019-07-03T16:06:49 ERROR: set_storage is not allowed on Tensor created from .data or .detach() Traceback (most recent call last): File "tools/run.py", line 89, in <module> run() File "tools/run.py", line 78, in run trainer.train() File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/common/trainer.py", line 257, in train should_break = self._logistics(report) File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/common/trainer.py", line 347, in _logistics _, meter = self.evaluate(self.val_loader, single_batch=True) File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/common/trainer.py", line 425, in evaluate report = self._forward_pass(batch) File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/common/trainer.py", line 273, in _forward_pass model_output = self.model(prepared_batch) File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply raise output File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(*input, **kwargs) File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/models/base_model.py", line 104, in __call__ model_output = super().__call__(sample_list, *args, **kwargs) File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/models/lorra.py", line 57, in forward text_embedding_total = self.process_text_embedding(sample_list) File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/models/pythia.py", line 196, in process_text_embedding embedding = text_embedding_model(texts) File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/modules/embeddings.py", line 45, in forward return self.module(*args, **kwargs) File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/modules/embeddings.py", line 160, in forward self.recurrent_unit.flatten_parameters() File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters self.batch_first, bool(self.bidirectional)) RuntimeError: set_storage is not allowed on Tensor created from .data or .detach()
When I only set distributed=true, the model still training on 1 GPU, I guess it because local_rank is still none, but I don't know what value local_rank should be.

@ChenyuGAO-CS
Copy link
Author

When I set distributed=true, and use command:
export CUDA_VISIBLE_DEVICES=4,5,6,7 export NGPUS=4 python -m torch.distributed.launch --nproc_per_node=$NGPUS tools/run.py --tasks vqa -- datasets textvqa --model lorra --config configs/vqa/textvqa/lorra.yml
The ERRORs are:
`
2019-07-03T18:02:55 INFO: Starting training...
2019-07-03T18:02:55 INFO: Fetching fastText model for OCR processing
2019-07-03T18:02:55 ERROR: Traceback (most recent call last):
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in
samples = collate_fn([dataset[i] for i in batch_indices])
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/multi_task.py", line 73, in getitem
item = self.chosen_task[idx]
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/base_task.py", line 154, in getitem
item = self.chosen_dataset[idx]
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/utils/data/dataset.py", line 85, in getitem
return self.datasets[dataset_idx][sample_idx]
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/base_dataset.py", line 49, in getitem
sample = self.get_item(idx)
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/vqa/vqa2/dataset.py", line 94, in get_item
return self.load_item(idx)
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/vqa/vizwiz/dataset.py", line 16, in load_item
sample = super().load_item(idx)
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/vqa/vqa2/dataset.py", line 126, in load_item
current_sample = self.add_ocr_details(sample_info, current_sample)
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/vqa/vqa2/dataset.py", line 141, in add_ocr_details
context = self.context_processor({"tokens": ocr_tokens})
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/processors.py", line 153, in call
return self.processor(item)
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/processors.py", line 485, in call
self._try_download()
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/processors.py", line 400, in _try_download
synchronize()
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/utils/distributed_utils.py", line 18, in synchronize
dist.barrier()
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1358, in barrier
work = _default_pg.barrier()
RuntimeError: CUDA error: initialization error (getDevice at /opt/conda/conda-bld/pytorch_1556653183467/work/c10/cuda/impl/CUDAGuardImpl.h:35)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7feaa5b4edc5 in /home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: + 0x11b677 (0x7feacb90a677 in /home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #2: + 0x6e7b90 (0x7feacbed6b90 in /home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #3: c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) + 0x1e1 (0x7feacbf3de61 in /home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x6c2af3 (0x7feacbeb1af3 in /home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0x12ce4a (0x7feacb91be4a in /home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: _PyCFunction_FastCallDict + 0x154 (0x558dee2d9744 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #7: + 0x19857e (0x558dee36057e in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #8: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #9: + 0x14022b (0x558dee30822b in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #10: + 0x198505 (0x558dee360505 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #11: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #12: + 0x19253b (0x558dee35a53b in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #13: + 0x198505 (0x558dee360505 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #14: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #15: + 0x19253b (0x558dee35a53b in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #16: + 0x198505 (0x558dee360505 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #17: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #18: + 0x191a76 (0x558dee359a76 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #19: _PyFunction_FastCallDict + 0x1bc (0x558dee35ac4c in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #20: _PyObject_FastCallDict + 0x26f (0x558dee2d9b0f in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #21: _PyObject_Call_Prepend + 0x63 (0x558dee2de6a3 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #22: PyObject_Call + 0x3e (0x558dee2d954e in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #23: + 0x16ba91 (0x558dee333a91 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #24: _PyObject_FastCallDict + 0x8b (0x558dee2d992b in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #25: + 0x19857e (0x558dee36057e in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #26: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #27: _PyFunction_FastCallDict + 0x11b (0x558dee35abab in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #28: _PyObject_FastCallDict + 0x26f (0x558dee2d9b0f in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #29: _PyObject_Call_Prepend + 0x63 (0x558dee2de6a3 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #30: PyObject_Call + 0x3e (0x558dee2d954e in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #31: + 0x16ba91 (0x558dee333a91 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #32: _PyObject_FastCallDict + 0x8b (0x558dee2d992b in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #33: + 0x19857e (0x558dee36057e in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #35: + 0x191bfe (0x558dee359bfe in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #36: + 0x192771 (0x558dee35a771 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #37: + 0x198505 (0x558dee360505 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #38: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #39: + 0x19253b (0x558dee35a53b in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #40: + 0x198505 (0x558dee360505 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #41: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #42: + 0x191a76 (0x558dee359a76 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #43: + 0x192771 (0x558dee35a771 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #44: + 0x198505 (0x558dee360505 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #45: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #46: + 0x19253b (0x558dee35a53b in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #47: + 0x198505 (0x558dee360505 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #48: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #49: _PyFunction_FastCallDict + 0x11b (0x558dee35abab in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #50: _PyObject_FastCallDict + 0x26f (0x558dee2d9b0f in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #51: _PyObject_Call_Prepend + 0x63 (0x558dee2de6a3 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #52: PyObject_Call + 0x3e (0x558dee2d954e in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #53: + 0x16b50a (0x558dee33350a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #54: _PyEval_EvalFrameDefault + 0x877 (0x558dee3858f7 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #55: _PyFunction_FastCallDict + 0x11b (0x558dee35abab in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #56: _PyObject_FastCallDict + 0x26f (0x558dee2d9b0f in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #57: _PyObject_Call_Prepend + 0x63 (0x558dee2de6a3 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #58: PyObject_Call + 0x3e (0x558dee2d954e in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #59: + 0x16b50a (0x558dee33350a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #60: _PyEval_EvalFrameDefault + 0x877 (0x558dee3858f7 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #61: _PyFunction_FastCallDict + 0x11b (0x558dee35abab in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #62: _PyObject_FastCallDict + 0x26f (0x558dee2d9b0f in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #63: _PyObject_Call_Prepend + 0x63 (0x558dee2de6a3 in /home/wangyan/anaconda3/envs/pythia03/bin/python)

Traceback (most recent call last):
File "tools/run.py", line 89, in
run()
File "tools/run.py", line 78, in run
trainer.train()
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/common/trainer.py", line 240, in train
for batch in self.train_loader:
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in next
return self._process_next_batch(batch)
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in
samples = collate_fn([dataset[i] for i in batch_indices])
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/multi_task.py", line 73, in getitem
item = self.chosen_task[idx]
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/base_task.py", line 154, in getitem
item = self.chosen_dataset[idx]
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/utils/data/dataset.py", line 85, in getitem
return self.datasets[dataset_idx][sample_idx]
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/base_dataset.py", line 49, in getitem
sample = self.get_item(idx)
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/vqa/vqa2/dataset.py", line 94, in get_item
return self.load_item(idx)
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/vqa/vizwiz/dataset.py", line 16, in load_item
sample = super().load_item(idx)
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/vqa/vqa2/dataset.py", line 126, in load_item
current_sample = self.add_ocr_details(sample_info, current_sample)
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/vqa/vqa2/dataset.py", line 141, in add_ocr_details
context = self.context_processor({"tokens": ocr_tokens})
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/processors.py", line 153, in call
return self.processor(item)
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/processors.py", line 485, in call
self._try_download()
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/processors.py", line 400, in _try_download
synchronize()
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/utils/distributed_utils.py", line 18, in synchronize
dist.barrier()
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1358, in barrier
work = _default_pg.barrier()
RuntimeError: CUDA error: initialization error (getDevice at /opt/conda/conda-bld/pytorch_1556653183467/work/c10/cuda/impl/CUDAGuardImpl.h:35)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7feaa5b4edc5 in /home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: + 0x11b677 (0x7feacb90a677 in /home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #2: + 0x6e7b90 (0x7feacbed6b90 in /home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #3: c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) + 0x1e1 (0x7feacbf3de61 in /home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x6c2af3 (0x7feacbeb1af3 in /home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0x12ce4a (0x7feacb91be4a in /home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: _PyCFunction_FastCallDict + 0x154 (0x558dee2d9744 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #7: + 0x19857e (0x558dee36057e in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #8: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #9: + 0x14022b (0x558dee30822b in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #10: + 0x198505 (0x558dee360505 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #11: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #12: + 0x19253b (0x558dee35a53b in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #13: + 0x198505 (0x558dee360505 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #14: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #15: + 0x19253b (0x558dee35a53b in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #16: + 0x198505 (0x558dee360505 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #17: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #18: + 0x191a76 (0x558dee359a76 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #19: _PyFunction_FastCallDict + 0x1bc (0x558dee35ac4c in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #20: _PyObject_FastCallDict + 0x26f (0x558dee2d9b0f in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #21: _PyObject_Call_Prepend + 0x63 (0x558dee2de6a3 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #22: PyObject_Call + 0x3e (0x558dee2d954e in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #23: + 0x16ba91 (0x558dee333a91 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #24: _PyObject_FastCallDict + 0x8b (0x558dee2d992b in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #25: + 0x19857e (0x558dee36057e in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #26: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #27: _PyFunction_FastCallDict + 0x11b (0x558dee35abab in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #28: _PyObject_FastCallDict + 0x26f (0x558dee2d9b0f in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #29: _PyObject_Call_Prepend + 0x63 (0x558dee2de6a3 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #30: PyObject_Call + 0x3e (0x558dee2d954e in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #31: + 0x16ba91 (0x558dee333a91 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #32: _PyObject_FastCallDict + 0x8b (0x558dee2d992b in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #33: + 0x19857e (0x558dee36057e in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #35: + 0x191bfe (0x558dee359bfe in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #36: + 0x192771 (0x558dee35a771 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #37: + 0x198505 (0x558dee360505 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #38: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #39: + 0x19253b (0x558dee35a53b in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #40: + 0x198505 (0x558dee360505 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #41: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #42: + 0x191a76 (0x558dee359a76 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #43: + 0x192771 (0x558dee35a771 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #44: + 0x198505 (0x558dee360505 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #45: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #46: + 0x19253b (0x558dee35a53b in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #47: + 0x198505 (0x558dee360505 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #48: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #49: _PyFunction_FastCallDict + 0x11b (0x558dee35abab in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #50: _PyObject_FastCallDict + 0x26f (0x558dee2d9b0f in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #51: _PyObject_Call_Prepend + 0x63 (0x558dee2de6a3 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #52: PyObject_Call + 0x3e (0x558dee2d954e in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #53: + 0x16b50a (0x558dee33350a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #54: _PyEval_EvalFrameDefault + 0x877 (0x558dee3858f7 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #55: _PyFunction_FastCallDict + 0x11b (0x558dee35abab in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #56: _PyObject_FastCallDict + 0x26f (0x558dee2d9b0f in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #57: _PyObject_Call_Prepend + 0x63 (0x558dee2de6a3 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #58: PyObject_Call + 0x3e (0x558dee2d954e in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #59: + 0x16b50a (0x558dee33350a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #60: _PyEval_EvalFrameDefault + 0x877 (0x558dee3858f7 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #61: _PyFunction_FastCallDict + 0x11b (0x558dee35abab in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #62: _PyObject_FastCallDict + 0x26f (0x558dee2d9b0f in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #63: _PyObject_Call_Prepend + 0x63 (0x558dee2de6a3 in /home/wangyan/anaconda3/envs/pythia03/bin/python)

Traceback (most recent call last):
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/distributed/launch.py", line 235, in
main()
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/distributed/launch.py", line 231, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/home/wangyan/anaconda3/envs/pythia03/bin/python', '-u', 'tools/run.py', '--local_rank=0', '--tasks', 'vqa', '--datasets', 'textvqa', '--model', 'lorra', '--config', 'configs/vqa/textvqa/lorra.yml']' returned non-zero exit status 1.
`

@apsdehal
Copy link
Contributor

apsdehal commented Jul 3, 2019

For using data_parallel, make sure you are on the version 1.0.1.post2 of PyTorch. Either that, or use the nightly build. There was an issue in PyTorch which creates issues with flatten_parameters. Refer pytorch/pytorch#21108

About distributed, I will look into it. I am actually preparing a patch for fixing some issues with distributed.

@ChenyuGAO-CS
Copy link
Author

Thanks, I will install 1.0.1.post2 of PyTorch and try to use data_paralle again.

@ChenyuGAO-CS
Copy link
Author

I installed 1.0.1.post2 of PyTorch, but when I using data_parallel=true I got another error:

2019-07-05T16:20:55 INFO: Starting training...
2019-07-05T16:20:55 INFO: Fetching fastText model for OCR processing
2019-07-05T16:20:55 INFO: Loading fasttext model now from /media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/.vector_cache/wiki.en.bin
2019-07-05T16:21:09 INFO: Finished loading fasttext model
2019-07-05T16:34:39 INFO: textvqa:, 100/240000000, train/total_loss: 10.0250 (46.8477), train/logit_bce: 10.0250 (46.8477), train/vqa_accuracy: 0.0867 (0.0907), val/total_loss: 8.3159, val/logit_bce: 8.3159, val/vqa_accuracy: 0.1125, max mem: 6572.0, lr: 0.00208, time: 09m 17s 149ms, eta: 385919h 02m 33s 365ms
2019-07-05T16:35:07 ERROR: '/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/models/lorra.py'
Traceback (most recent call last):
File "tools/run.py", line 89, in
run()
File "tools/run.py", line 78, in run
trainer.train()
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/common/trainer.py", line 251, in train
report = self._forward_pass(batch)
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/common/trainer.py", line 273, in _forward_pass
model_output = self.model(prepared_batch)
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
raise output
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
output = module(*input, **kwargs)
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/models/base_model.py", line 104, in call
model_output = super().call(sample_list, *args, **kwargs)
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/models/lorra.py", line 58, in forward
sample_list.text = self.word_embedding(sample_list.text) # question embedding, torch.Size([128, 14, 300])
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 118, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/functional.py", line 1454, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/traceback.py", line 197, in format_stack
return format_list(extract_stack(f, limit=limit))
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/traceback.py", line 211, in extract_stack
stack = StackSummary.extract(walk_stack(f), limit=limit)
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/traceback.py", line 360, in extract
linecache.checkcache(filename)
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/linecache.py", line 79, in checkcache
del cache[filename]
KeyError: '/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/models/lorra.py'

@apsdehal
Copy link
Contributor

apsdehal commented Jul 6, 2019

Issue you are facing is related to https://bugs.python.org/issue25872. This happens if you edit files while you are training something.

@ChenyuGAO-CS
Copy link
Author

Thanks for your reply, it works.

apsdehal pushed a commit that referenced this issue May 8, 2020
Summary:
Fix LMDBFeatureReader

Test Plan : Test with mmimdb dataset.
Pull Request resolved: fairinternal/mmf-internal#116

Reviewed By: apsdehal

Differential Revision: D21445705

Pulled By: vedanuj

fbshipit-source-id: af96f3cc32f50d3c97ebc2df8ca815f97bea0c0d
apsdehal pushed a commit that referenced this issue May 8, 2020
Summary:
Fix LMDBFeatureReader

Test Plan : Test with mmimdb dataset.
Pull Request resolved: https://github.com/fairinternal/pythia-internal/pull/116

Reviewed By: apsdehal

Differential Revision: D21445705

Pulled By: vedanuj

fbshipit-source-id: af96f3cc32f50d3c97ebc2df8ca815f97bea0c0d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants