How can I train LoRRA on TextVQA dataset using multi-GPUs? #116

ChenyuGAO-CS · 2019-07-03T08:51:07Z

❓ Questions and Help

I try to only set data_parallel=true, then I got:
2019-07-03T16:03:40 INFO: Starting training... 2019-07-03T16:03:40 INFO: Fetching fastText model for OCR processing 2019-07-03T16:03:41 INFO: Loading fasttext model now from /media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/.vector_cache/wiki.en.bin 2019-07-03T16:03:52 INFO: Finished loading fasttext model 2019-07-03T16:06:49 ERROR: set_storage is not allowed on Tensor created from .data or .detach() Traceback (most recent call last): File "tools/run.py", line 89, in <module> run() File "tools/run.py", line 78, in run trainer.train() File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/common/trainer.py", line 257, in train should_break = self._logistics(report) File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/common/trainer.py", line 347, in _logistics _, meter = self.evaluate(self.val_loader, single_batch=True) File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/common/trainer.py", line 425, in evaluate report = self._forward_pass(batch) File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/common/trainer.py", line 273, in _forward_pass model_output = self.model(prepared_batch) File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply raise output File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(*input, **kwargs) File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/models/base_model.py", line 104, in __call__ model_output = super().__call__(sample_list, *args, **kwargs) File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/models/lorra.py", line 57, in forward text_embedding_total = self.process_text_embedding(sample_list) File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/models/pythia.py", line 196, in process_text_embedding embedding = text_embedding_model(texts) File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/modules/embeddings.py", line 45, in forward return self.module(*args, **kwargs) File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/modules/embeddings.py", line 160, in forward self.recurrent_unit.flatten_parameters() File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters self.batch_first, bool(self.bidirectional)) RuntimeError: set_storage is not allowed on Tensor created from .data or .detach()
When I only set distributed=true, the model still training on 1 GPU, I guess it because local_rank is still none, but I don't know what value local_rank should be.

The text was updated successfully, but these errors were encountered:

ChenyuGAO-CS · 2019-07-03T11:31:23Z

When I set distributed=true, and use command:
export CUDA_VISIBLE_DEVICES=4,5,6,7 export NGPUS=4 python -m torch.distributed.launch --nproc_per_node=$NGPUS tools/run.py --tasks vqa -- datasets textvqa --model lorra --config configs/vqa/textvqa/lorra.yml
The ERRORs are:
`
2019-07-03T18:02:55 INFO: Starting training...
2019-07-03T18:02:55 INFO: Fetching fastText model for OCR processing
2019-07-03T18:02:55 ERROR: Traceback (most recent call last):
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in
samples = collate_fn([dataset[i] for i in batch_indices])
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/multi_task.py", line 73, in getitem
item = self.chosen_task[idx]
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/base_task.py", line 154, in getitem
item = self.chosen_dataset[idx]
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/utils/data/dataset.py", line 85, in getitem
return self.datasets[dataset_idx][sample_idx]
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/base_dataset.py", line 49, in getitem
sample = self.get_item(idx)
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/vqa/vqa2/dataset.py", line 94, in get_item
return self.load_item(idx)
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/vqa/vizwiz/dataset.py", line 16, in load_item
sample = super().load_item(idx)
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/vqa/vqa2/dataset.py", line 126, in load_item
current_sample = self.add_ocr_details(sample_info, current_sample)
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/vqa/vqa2/dataset.py", line 141, in add_ocr_details
context = self.context_processor({"tokens": ocr_tokens})
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/processors.py", line 153, in call
return self.processor(item)
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/processors.py", line 485, in call
self._try_download()
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/processors.py", line 400, in _try_download
synchronize()
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/utils/distributed_utils.py", line 18, in synchronize
dist.barrier()
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1358, in barrier
work = _default_pg.barrier()
RuntimeError: CUDA error: initialization error (getDevice at /opt/conda/conda-bld/pytorch_1556653183467/work/c10/cuda/impl/CUDAGuardImpl.h:35)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7feaa5b4edc5 in /home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: + 0x11b677 (0x7feacb90a677 in /home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #2: + 0x6e7b90 (0x7feacbed6b90 in /home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #3: c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) + 0x1e1 (0x7feacbf3de61 in /home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x6c2af3 (0x7feacbeb1af3 in /home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0x12ce4a (0x7feacb91be4a in /home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: _PyCFunction_FastCallDict + 0x154 (0x558dee2d9744 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #7: + 0x19857e (0x558dee36057e in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #8: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #9: + 0x14022b (0x558dee30822b in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #10: + 0x198505 (0x558dee360505 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #11: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #12: + 0x19253b (0x558dee35a53b in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #13: + 0x198505 (0x558dee360505 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #14: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #15: + 0x19253b (0x558dee35a53b in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #16: + 0x198505 (0x558dee360505 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #17: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #18: + 0x191a76 (0x558dee359a76 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #19: _PyFunction_FastCallDict + 0x1bc (0x558dee35ac4c in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #20: _PyObject_FastCallDict + 0x26f (0x558dee2d9b0f in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #21: _PyObject_Call_Prepend + 0x63 (0x558dee2de6a3 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #22: PyObject_Call + 0x3e (0x558dee2d954e in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #23: + 0x16ba91 (0x558dee333a91 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #24: _PyObject_FastCallDict + 0x8b (0x558dee2d992b in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #25: + 0x19857e (0x558dee36057e in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #26: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #27: _PyFunction_FastCallDict + 0x11b (0x558dee35abab in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #28: _PyObject_FastCallDict + 0x26f (0x558dee2d9b0f in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #29: _PyObject_Call_Prepend + 0x63 (0x558dee2de6a3 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #30: PyObject_Call + 0x3e (0x558dee2d954e in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #31: + 0x16ba91 (0x558dee333a91 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #32: _PyObject_FastCallDict + 0x8b (0x558dee2d992b in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #33: + 0x19857e (0x558dee36057e in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #35: + 0x191bfe (0x558dee359bfe in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #36: + 0x192771 (0x558dee35a771 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #37: + 0x198505 (0x558dee360505 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #38: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #39: + 0x19253b (0x558dee35a53b in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #40: + 0x198505 (0x558dee360505 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #41: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #42: + 0x191a76 (0x558dee359a76 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #43: + 0x192771 (0x558dee35a771 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #44: + 0x198505 (0x558dee360505 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #45: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #46: + 0x19253b (0x558dee35a53b in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #47: + 0x198505 (0x558dee360505 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #48: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #49: _PyFunction_FastCallDict + 0x11b (0x558dee35abab in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #50: _PyObject_FastCallDict + 0x26f (0x558dee2d9b0f in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #51: _PyObject_Call_Prepend + 0x63 (0x558dee2de6a3 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #52: PyObject_Call + 0x3e (0x558dee2d954e in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #53: + 0x16b50a (0x558dee33350a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #54: _PyEval_EvalFrameDefault + 0x877 (0x558dee3858f7 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #55: _PyFunction_FastCallDict + 0x11b (0x558dee35abab in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #56: _PyObject_FastCallDict + 0x26f (0x558dee2d9b0f in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #57: _PyObject_Call_Prepend + 0x63 (0x558dee2de6a3 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #58: PyObject_Call + 0x3e (0x558dee2d954e in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #59: + 0x16b50a (0x558dee33350a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #60: _PyEval_EvalFrameDefault + 0x877 (0x558dee3858f7 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #61: _PyFunction_FastCallDict + 0x11b (0x558dee35abab in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #62: _PyObject_FastCallDict + 0x26f (0x558dee2d9b0f in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #63: _PyObject_Call_Prepend + 0x63 (0x558dee2de6a3 in /home/wangyan/anaconda3/envs/pythia03/bin/python)

Traceback (most recent call last):
File "tools/run.py", line 89, in
run()
File "tools/run.py", line 78, in run
trainer.train()
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/common/trainer.py", line 240, in train
for batch in self.train_loader:
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in next
return self._process_next_batch(batch)
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in
samples = collate_fn([dataset[i] for i in batch_indices])
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/multi_task.py", line 73, in getitem
item = self.chosen_task[idx]
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/base_task.py", line 154, in getitem
item = self.chosen_dataset[idx]
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/utils/data/dataset.py", line 85, in getitem
return self.datasets[dataset_idx][sample_idx]
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/base_dataset.py", line 49, in getitem
sample = self.get_item(idx)
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/vqa/vqa2/dataset.py", line 94, in get_item
return self.load_item(idx)
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/vqa/vizwiz/dataset.py", line 16, in load_item
sample = super().load_item(idx)
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/vqa/vqa2/dataset.py", line 126, in load_item
current_sample = self.add_ocr_details(sample_info, current_sample)
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/vqa/vqa2/dataset.py", line 141, in add_ocr_details
context = self.context_processor({"tokens": ocr_tokens})
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/processors.py", line 153, in call
return self.processor(item)
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/processors.py", line 485, in call
self._try_download()
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/tasks/processors.py", line 400, in _try_download
synchronize()
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/utils/distributed_utils.py", line 18, in synchronize
dist.barrier()
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1358, in barrier
work = _default_pg.barrier()
RuntimeError: CUDA error: initialization error (getDevice at /opt/conda/conda-bld/pytorch_1556653183467/work/c10/cuda/impl/CUDAGuardImpl.h:35)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7feaa5b4edc5 in /home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: + 0x11b677 (0x7feacb90a677 in /home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #2: + 0x6e7b90 (0x7feacbed6b90 in /home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #3: c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) + 0x1e1 (0x7feacbf3de61 in /home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x6c2af3 (0x7feacbeb1af3 in /home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0x12ce4a (0x7feacb91be4a in /home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: _PyCFunction_FastCallDict + 0x154 (0x558dee2d9744 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #7: + 0x19857e (0x558dee36057e in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #8: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #9: + 0x14022b (0x558dee30822b in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #10: + 0x198505 (0x558dee360505 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #11: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #12: + 0x19253b (0x558dee35a53b in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #13: + 0x198505 (0x558dee360505 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #14: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #15: + 0x19253b (0x558dee35a53b in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #16: + 0x198505 (0x558dee360505 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #17: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #18: + 0x191a76 (0x558dee359a76 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #19: _PyFunction_FastCallDict + 0x1bc (0x558dee35ac4c in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #20: _PyObject_FastCallDict + 0x26f (0x558dee2d9b0f in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #21: _PyObject_Call_Prepend + 0x63 (0x558dee2de6a3 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #22: PyObject_Call + 0x3e (0x558dee2d954e in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #23: + 0x16ba91 (0x558dee333a91 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #24: _PyObject_FastCallDict + 0x8b (0x558dee2d992b in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #25: + 0x19857e (0x558dee36057e in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #26: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #27: _PyFunction_FastCallDict + 0x11b (0x558dee35abab in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #28: _PyObject_FastCallDict + 0x26f (0x558dee2d9b0f in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #29: _PyObject_Call_Prepend + 0x63 (0x558dee2de6a3 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #30: PyObject_Call + 0x3e (0x558dee2d954e in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #31: + 0x16ba91 (0x558dee333a91 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #32: _PyObject_FastCallDict + 0x8b (0x558dee2d992b in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #33: + 0x19857e (0x558dee36057e in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #35: + 0x191bfe (0x558dee359bfe in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #36: + 0x192771 (0x558dee35a771 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #37: + 0x198505 (0x558dee360505 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #38: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #39: + 0x19253b (0x558dee35a53b in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #40: + 0x198505 (0x558dee360505 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #41: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #42: + 0x191a76 (0x558dee359a76 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #43: + 0x192771 (0x558dee35a771 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #44: + 0x198505 (0x558dee360505 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #45: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #46: + 0x19253b (0x558dee35a53b in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #47: + 0x198505 (0x558dee360505 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #48: _PyEval_EvalFrameDefault + 0x30a (0x558dee38538a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #49: _PyFunction_FastCallDict + 0x11b (0x558dee35abab in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #50: _PyObject_FastCallDict + 0x26f (0x558dee2d9b0f in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #51: _PyObject_Call_Prepend + 0x63 (0x558dee2de6a3 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #52: PyObject_Call + 0x3e (0x558dee2d954e in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #53: + 0x16b50a (0x558dee33350a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #54: _PyEval_EvalFrameDefault + 0x877 (0x558dee3858f7 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #55: _PyFunction_FastCallDict + 0x11b (0x558dee35abab in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #56: _PyObject_FastCallDict + 0x26f (0x558dee2d9b0f in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #57: _PyObject_Call_Prepend + 0x63 (0x558dee2de6a3 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #58: PyObject_Call + 0x3e (0x558dee2d954e in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #59: + 0x16b50a (0x558dee33350a in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #60: _PyEval_EvalFrameDefault + 0x877 (0x558dee3858f7 in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #61: _PyFunction_FastCallDict + 0x11b (0x558dee35abab in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #62: _PyObject_FastCallDict + 0x26f (0x558dee2d9b0f in /home/wangyan/anaconda3/envs/pythia03/bin/python)
frame #63: _PyObject_Call_Prepend + 0x63 (0x558dee2de6a3 in /home/wangyan/anaconda3/envs/pythia03/bin/python)

Traceback (most recent call last):
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/distributed/launch.py", line 235, in
main()
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/distributed/launch.py", line 231, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/home/wangyan/anaconda3/envs/pythia03/bin/python', '-u', 'tools/run.py', '--local_rank=0', '--tasks', 'vqa', '--datasets', 'textvqa', '--model', 'lorra', '--config', 'configs/vqa/textvqa/lorra.yml']' returned non-zero exit status 1.
`

apsdehal · 2019-07-03T17:47:29Z

For using data_parallel, make sure you are on the version 1.0.1.post2 of PyTorch. Either that, or use the nightly build. There was an issue in PyTorch which creates issues with flatten_parameters. Refer pytorch/pytorch#21108

About distributed, I will look into it. I am actually preparing a patch for fixing some issues with distributed.

ChenyuGAO-CS · 2019-07-04T03:37:15Z

Thanks, I will install 1.0.1.post2 of PyTorch and try to use data_paralle again.

ChenyuGAO-CS · 2019-07-05T09:13:56Z

I installed 1.0.1.post2 of PyTorch, but when I using data_parallel=true I got another error:

2019-07-05T16:20:55 INFO: Starting training...
2019-07-05T16:20:55 INFO: Fetching fastText model for OCR processing
2019-07-05T16:20:55 INFO: Loading fasttext model now from /media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/.vector_cache/wiki.en.bin
2019-07-05T16:21:09 INFO: Finished loading fasttext model
2019-07-05T16:34:39 INFO: textvqa:, 100/240000000, train/total_loss: 10.0250 (46.8477), train/logit_bce: 10.0250 (46.8477), train/vqa_accuracy: 0.0867 (0.0907), val/total_loss: 8.3159, val/logit_bce: 8.3159, val/vqa_accuracy: 0.1125, max mem: 6572.0, lr: 0.00208, time: 09m 17s 149ms, eta: 385919h 02m 33s 365ms
2019-07-05T16:35:07 ERROR: '/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/models/lorra.py'
Traceback (most recent call last):
File "tools/run.py", line 89, in
run()
File "tools/run.py", line 78, in run
trainer.train()
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/common/trainer.py", line 251, in train
report = self._forward_pass(batch)
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/common/trainer.py", line 273, in _forward_pass
model_output = self.model(prepared_batch)
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
raise output
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
output = module(*input, **kwargs)
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/models/base_model.py", line 104, in call
model_output = super().call(sample_list, *args, **kwargs)
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/models/lorra.py", line 58, in forward
sample_list.text = self.word_embedding(sample_list.text) # question embedding, torch.Size([128, 14, 300])
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 118, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/functional.py", line 1454, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/traceback.py", line 197, in format_stack
return format_list(extract_stack(f, limit=limit))
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/traceback.py", line 211, in extract_stack
stack = StackSummary.extract(walk_stack(f), limit=limit)
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/traceback.py", line 360, in extract
linecache.checkcache(filename)
File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/linecache.py", line 79, in checkcache
del cache[filename]
KeyError: '/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/models/lorra.py'

apsdehal · 2019-07-06T03:50:09Z

Issue you are facing is related to https://bugs.python.org/issue25872. This happens if you edit files while you are training something.

ChenyuGAO-CS · 2019-07-08T06:37:09Z

Thanks for your reply， it works.

Summary: Fix LMDBFeatureReader Test Plan : Test with mmimdb dataset. Pull Request resolved: fairinternal/mmf-internal#116 Reviewed By: apsdehal Differential Revision: D21445705 Pulled By: vedanuj fbshipit-source-id: af96f3cc32f50d3c97ebc2df8ca815f97bea0c0d

Summary: Fix LMDBFeatureReader Test Plan : Test with mmimdb dataset. Pull Request resolved: https://github.com/fairinternal/pythia-internal/pull/116 Reviewed By: apsdehal Differential Revision: D21445705 Pulled By: vedanuj fbshipit-source-id: af96f3cc32f50d3c97ebc2df8ca815f97bea0c0d

ChenyuGAO-CS closed this as completed Jul 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I train LoRRA on TextVQA dataset using multi-GPUs? #116

How can I train LoRRA on TextVQA dataset using multi-GPUs? #116

ChenyuGAO-CS commented Jul 3, 2019

ChenyuGAO-CS commented Jul 3, 2019

apsdehal commented Jul 3, 2019

ChenyuGAO-CS commented Jul 4, 2019

ChenyuGAO-CS commented Jul 5, 2019

apsdehal commented Jul 6, 2019

ChenyuGAO-CS commented Jul 8, 2019

How can I train LoRRA on TextVQA dataset using multi-GPUs? #116

How can I train LoRRA on TextVQA dataset using multi-GPUs? #116

Comments

ChenyuGAO-CS commented Jul 3, 2019

❓ Questions and Help

ChenyuGAO-CS commented Jul 3, 2019

apsdehal commented Jul 3, 2019

ChenyuGAO-CS commented Jul 4, 2019

ChenyuGAO-CS commented Jul 5, 2019

apsdehal commented Jul 6, 2019

ChenyuGAO-CS commented Jul 8, 2019