-
Notifications
You must be signed in to change notification settings - Fork 939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can I train LoRRA on TextVQA dataset using multi-GPUs? #116
Comments
When I set distributed=true, and use command: Traceback (most recent call last): Traceback (most recent call last): |
For using data_parallel, make sure you are on the version 1.0.1.post2 of PyTorch. Either that, or use the nightly build. There was an issue in PyTorch which creates issues with flatten_parameters. Refer pytorch/pytorch#21108 About distributed, I will look into it. I am actually preparing a patch for fixing some issues with distributed. |
Thanks, I will install 1.0.1.post2 of PyTorch and try to use data_paralle again. |
I installed 1.0.1.post2 of PyTorch, but when I using data_parallel=true I got another error:
|
Issue you are facing is related to https://bugs.python.org/issue25872. This happens if you edit files while you are training something. |
Thanks for your reply, it works. |
Summary: Fix LMDBFeatureReader Test Plan : Test with mmimdb dataset. Pull Request resolved: fairinternal/mmf-internal#116 Reviewed By: apsdehal Differential Revision: D21445705 Pulled By: vedanuj fbshipit-source-id: af96f3cc32f50d3c97ebc2df8ca815f97bea0c0d
Summary: Fix LMDBFeatureReader Test Plan : Test with mmimdb dataset. Pull Request resolved: https://github.com/fairinternal/pythia-internal/pull/116 Reviewed By: apsdehal Differential Revision: D21445705 Pulled By: vedanuj fbshipit-source-id: af96f3cc32f50d3c97ebc2df8ca815f97bea0c0d
❓ Questions and Help
I try to only set data_parallel=true, then I got:
2019-07-03T16:03:40 INFO: Starting training... 2019-07-03T16:03:40 INFO: Fetching fastText model for OCR processing 2019-07-03T16:03:41 INFO: Loading fasttext model now from /media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/.vector_cache/wiki.en.bin 2019-07-03T16:03:52 INFO: Finished loading fasttext model 2019-07-03T16:06:49 ERROR: set_storage is not allowed on Tensor created from .data or .detach() Traceback (most recent call last): File "tools/run.py", line 89, in <module> run() File "tools/run.py", line 78, in run trainer.train() File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/common/trainer.py", line 257, in train should_break = self._logistics(report) File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/common/trainer.py", line 347, in _logistics _, meter = self.evaluate(self.val_loader, single_batch=True) File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/common/trainer.py", line 425, in evaluate report = self._forward_pass(batch) File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/common/trainer.py", line 273, in _forward_pass model_output = self.model(prepared_batch) File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply raise output File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(*input, **kwargs) File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/models/base_model.py", line 104, in __call__ model_output = super().__call__(sample_list, *args, **kwargs) File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/models/lorra.py", line 57, in forward text_embedding_total = self.process_text_embedding(sample_list) File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/models/pythia.py", line 196, in process_text_embedding embedding = text_embedding_model(texts) File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/modules/embeddings.py", line 45, in forward return self.module(*args, **kwargs) File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/media/userdisk1/wy/Document/GraduationProject/TestVQACode/pythia/pythia_v03/pythia03_2/pythia/pythia/modules/embeddings.py", line 160, in forward self.recurrent_unit.flatten_parameters() File "/home/wangyan/anaconda3/envs/pythia03/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters self.batch_first, bool(self.bidirectional)) RuntimeError: set_storage is not allowed on Tensor created from .data or .detach()
When I only set distributed=true, the model still training on 1 GPU, I guess it because local_rank is still none, but I don't know what value local_rank should be.
The text was updated successfully, but these errors were encountered: