-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
global process count incorrect with elastic, fault tolerant training #6853
Comments
Shouldn't def init_ddp_connection(self, global_rank: int, world_size: int) -> None:
os.environ["MASTER_ADDR"] = str(self.cluster_environment.master_address())
os.environ["MASTER_PORT"] = str(self.cluster_environment.master_port())
os.environ["WORLD_SIZE"] = str(self.cluster_environment.world_size())
if not torch.distributed.is_initialized():
log.info(f"initializing ddp: GLOBAL_RANK: {global_rank}, MEMBER: {global_rank + 1}/{world_size}")
torch_distrib.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size) |
@srib what does your |
@ananthsub Thank you for the reply. All the following cases were run with the following command:
It seems to work fine when
world_size: 8
os.environ["WORLD_SIZE"]: 8
world_size: 8
os.environ["WORLD_SIZE"]: 16
world_size: 8
os.environ["WORLD_SIZE"]: 8
world_size: 8
os.environ["WORLD_SIZE"]: 16 I even got training to run with the following change: Before: torch_distrib.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size) After: torch_distrib.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=int(os.environ["WORLD_SIZE"])) Am I missing something here? |
@ananthsub Can I submit a PR for this or are there any other implications that I am missing here? |
@awaelchli what do you think of making Given this state is already available from the env variables, this could be used to set the Then the cluster environment becomes the source of truth for these fields, which can be used in the training type plugin, which is used in the accelerator, which is used in the trainer. |
Sounds good yes. Let's try it. This will affect all major plugins, so we need to be careful :) |
@srib ok I think I understand now the bug you are experiencing. The world size is changing in the middle of training, and we are keeping the old value saved. And so we should read the world size always from the cluster environment (which will read it from the environment) |
Found a related bug in master: Running |
@awaelchli is it because the lightning environment uses the same environment variables as torchelastic for local rank? |
@awaelchli Thanks. If I understand |
@ananthsub yes, I'll try send a PR today. I discovered that while trying your idea with global rank in the cluster environment |
@awaelchli Sounds good. I have a short term workaround for now. I will wait till the fix makes it to the master. If you need any help with testing, please let me know. Appreciate all the effort and help. |
@awaelchli I think this is another issue: We shouldn't be messing with these environment variables because we're not distinguishing whether these env vars are set by torchelastic or the lightning spawn environment. Users running with torchelastic would mysteriously see world size disappear |
true. would be good to know why this is there in the first place ... :( |
the issue is if someone runs trainer.fit() and then trainer.test() right afterward |
@awaelchli Sorry for the delay in responding. I had to make a few tweaks to my code as it was using the hydra configs from the latest master. I can confirm that it works fine. It does assign the global ranks correctly now. Please note that I only tested it with |
@srib that's amazing. thanks for confirming, that's very good to know |
Thanks @awaelchli and @ananthsub for this PR. When is the next release so that I can be lazy and rely on the wheel? 😄 |
The next release is planned for tomorrow. This is the branch that we will release as 1.2.8: #6983 |
Thanks @awaelchli! |
1.2.8 is out now. hope this will work out well for you on the nccl backend. otherwise let me know |
🐛 Bug
Problem
Count of the total number of processes incorrectly set.
Context
I am trying to run elastic training with
torchelastic
. I have tried with bothgloo
andnccl
backends.Error message
Error coming from gloo backend:
NCCL backend gives this error: pytorch/pytorch#20313
Please reproduce using the BoringModel
I am running imagenet example from
pl
usingtorchvision.models.resnet34
. Happy to reproduce withBoringModel
if needed.Before launching, I have exported the variable
GLOO_SOCKET_IFNAME
and set it to the appropriate interface name.On node 0:
PL_TORCH_DISTRIBUTED_BACKEND=gloo python -m torchelastic.distributed.launch --nnodes=1:5 --rdzv_id='nodockertestelasticlaunch7' --rdzv_backend=etcd --rdzv_endpoint=10.18.0.15:2379 train_hydra.py +experiment=elastic_config.yaml
On node 1:
PL_TORCH_DISTRIBUTED_BACKEND=gloo python -m torchelastic.distributed.launch --nnodes=1:5 --rdzv_id='nodockertestelasticlaunch7' --rdzv_backend=etcd --rdzv_endpoint=10.18.0.15:2379 train_hydra.py +experiment=elastic_config.yaml
To Reproduce
Use following BoringModel and post here
Expected behavior
To be able to run distributed fault tolerant training :)
Environment
Note:
Bugs with code
are solved faster !Colab Notebook
should be madepublic
!IDE
: Please, use our python bug_report_model.py template.Colab Notebook
: Please copy and paste the output from our environment collection script (or fill out the checklist below manually).Output of
collect_env_details.py
:Additional context
The text was updated successfully, but these errors were encountered: