You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using the nvidia docker container for pytorch-1912. I can clone the github repository without any problem, but when I try to run CC-FPSE on my own data (on a 4 GPU instance) :
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/uge_mnt/home/adeschem/CC-FPSE/train.py", line 37, in main_worker
dist.init_process_group(backend='nccl', init_method=opt.dist_url, world_size=world_size, rank=rank)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 397, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 109, in _tcp_rendezvous_handler
store = TCPStore(result.hostname, result.port, world_size, start_daemon)
RuntimeError: Network is unreachable
This seems to be related to torch distributed communication package, eventhough I am not using the --mpdist option to use distributed multiprocessing.
The text was updated successfully, but these errors were encountered:
I am using the nvidia docker container for pytorch-1912. I can clone the github repository without any problem, but when I try to run CC-FPSE on my own data (on a 4 GPU instance) :
python train.py --name condconv --netG condconv --netD fpse --lambda_feat 20 --dataset_mode custom --label_dir mydata/train_label --image_dir mydata/train_img --label_nc 6 --no_instance --batchSize 1 --niter 100 --niter_decay 100 --use_vae --ngpus_per_node 4
I get the following error :
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/uge_mnt/home/adeschem/CC-FPSE/train.py", line 37, in main_worker
dist.init_process_group(backend='nccl', init_method=opt.dist_url, world_size=world_size, rank=rank)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 397, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 109, in _tcp_rendezvous_handler
store = TCPStore(result.hostname, result.port, world_size, start_daemon)
RuntimeError: Network is unreachable
This seems to be related to torch distributed communication package, eventhough I am not using the --mpdist option to use distributed multiprocessing.
The text was updated successfully, but these errors were encountered: