Make Pytorch-Lightning DDP work without SLURM #1345

areshytko · 2020-04-02T14:55:57Z

🚀 Feature

Allow pytorch-lightning DDP mode to work everywhere ordinary pytorch DDP can work.
Basically if every node in a cluster defines the following environment variables it should work:

MASTER_PORT: A free port on the machine that will host the process with rank 0.
MASTER_ADDR: IP address of the machine that will host the process with rank 0.
WORLD_SIZE: The total number of processes, so that the master knows how many workers to wait for.
RANK: Rank of each process, so they will know whether it is the master of a worker.

See pytorch documentation

Motivation

Pytorch-lightning positions itself as a framework wrapper around pytorch. One of it's differentiator features is the ease of distributed learning and it is very counter intuitive that it doesn't work in cases where vanilla pytorch does.

For example in Kubeflow there is a special operator PyTorchJob that spawns worker nodes with proper environment variables so that pytorch.distributed. init_process_group establishes communication between processes.

Pitch

While the user is able to override LightningModule.init_ddp_connection to the following:

    def init_ddp_connection(self, proc_rank: int, world_size: int) -> None:
        torch.distributed.init_process_group(
            'nccl', rank=proc_rank, world_size=world_size)

there's at least one more place that is coupled tightly with SLURM and impedes running it inside ordinary pytorch distributed environment: its TrainerDDPMixin.ddp_train method:

    def ddp_train(self, gpu_idx, model):
        """
        Entry point into a DP thread
        :param gpu_idx:
        :param model:
        :param cluster_obj:
        :return:
        """
        # node rank using relative slurm id
        # otherwise default to node rank 0
        try:
            node_id = os.environ['SLURM_NODEID']
            self.node_rank = int(node_id)
        except Exception:
            self.node_rank = 0

One possible solution is to add another check for os.environ['RANK'] instead of just assigning 0 rank to the node in case SLURM variable is missing.

Alternatives

Additional context

The text was updated successfully, but these errors were encountered:

Borda · 2020-04-06T07:36:23Z

@PyTorchLightning/core-contributors @williamFalcon ^^

jeon30c · 2020-05-11T02:10:13Z

@areshytko Can we use pytorch-lightning as using apex or pytorch.ddp in multi node traing?? Can you provide simple examples?

areshytko · 2020-05-11T10:46:13Z

@jeon30c Yes, I'll try to propose additions with examples to the documentation this week.
In short:

Each node in your cluster needs following environment variables:

NODE_RANK starting from 0 to N, where N - is your world size
MASTER_ADDR with the IP of a node with node rank 0
[optionally] MASTER_PORT with some available port to communicate over

provide the following arguments to the Trainer instance:

trainer = pl.Trainer(
   gpus=NUMBER_OF_GPUS_PER_NODE,
   num_nodes=WORLD_SIZE_OF_YOUR_CLUSTER,
   distributed_backend='ddp'  # or ddp2
)

run the same script on all nodes in your cluster

faizanahemad · 2020-12-21T12:30:48Z

Any documentation about how to train on multi-node without slurm?

untrix · 2021-01-29T18:19:41Z

This should go into official lightning documentation. There is no mention of this there.

untrix · 2021-02-26T00:56:13Z

Hi @areshytko, Thanks for adding the doc. I see that you've added WORLD_SIZE which I wasn't using earlier. Anyways, it was all working with Lightining 1.8 but stopped working after I upgraded to Lightning 1.2.1. Now I get an NCCL error if I use my old commands (i.e. without WORLD_SIZE env). In this case I notice that all three nodes initialize global_rank 0 through 7 (members 1-8/24), which means that each one thinks it is the master. If I set WORLD_SIZE env var, then the scripts just hang on all three nodes.

I do not want to downgrade back to v1.8 because it doesn't have the BaseFinetuning callback which I am now using. I wish Lightning wasn't so flaky!

csvance · 2021-03-23T17:18:21Z

Multi node DDP works without Slurm still in 1.1.6, but doesn't seem to work in 1.2.4. It appears there was a major refactor of DDPPlugin between those versions.

One other thing is the documentation doesn't mention you need to set LOCAL_RANK per GPU as well. Say you are training on 2 nodes each with 2 GPU. At least in 1.1.6 Lightning won't spawn a process per GPU, you need to set the local rank and start it yourself.

On the first node:

MASTER_ADDR=MasterNode MASTER_PORT=12345 WORLD_SIZE=4 NODE_RANK=0 LOCAL_RANK=0 python train.py
MASTER_ADDR=MasterNode MASTER_PORT=12345 WORLD_SIZE=4 NODE_RANK=0 LOCAL_RANK=1 python train.py

On the second node:

MASTER_ADDR=MasterNode MASTER_PORT=12345 WORLD_SIZE=4 NODE_RANK=1 LOCAL_RANK=0 python train.py
MASTER_ADDR=MasterNode MASTER_PORT=12345 WORLD_SIZE=4 NODE_RANK=1 LOCAL_RANK=1 python train.py

MrChill · 2021-03-24T14:31:55Z

Is this working for you @csvance? I have trouble running it. It does not detect multiple nodes and pends forever...

csvance · 2021-03-24T19:37:33Z

Is this working for you @csvance? I have trouble running it. It does not detect multiple nodes and pends forever...

Yes it is working for me, although I am currently just doing distributed testing. So I am not doing any kind of gradient / parameter syncing, only syncing of results with torch.distributed.all_gather_objects after running on the test set.

In my trainer I set gpus=2, num_nodes=2. I am using PyTorch 1.8, Lightning 1.1.6, CUDA 10.2

Do you see all of your ranks startup? In mine I see this across the four processes:

initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4

Be sure you are starting node rank 0 on your master node.

areshytko added feature help wanted labels Apr 2, 2020

areshytko changed the title ~~Make Pytorch-Lightning to work in Kubeflow PyTorchJob~~ Make Pytorch-Lightning DDP work without SLURM Apr 6, 2020

areshytko mentioned this issue Apr 6, 2020

Add SLURM check in ddp_train() and init_ddp_connection() #1387

Merged

5 tasks

williamFalcon closed this as completed in #1387 Apr 19, 2020

areshytko mentioned this issue Feb 2, 2021

[docs] Add documentation for non-slurm computing cluster setup #5746

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Pytorch-Lightning DDP work without SLURM #1345

Make Pytorch-Lightning DDP work without SLURM #1345

areshytko commented Apr 2, 2020

Borda commented Apr 6, 2020

jeon30c commented May 11, 2020

areshytko commented May 11, 2020 •

edited

Loading

faizanahemad commented Dec 21, 2020

untrix commented Jan 29, 2021

untrix commented Feb 26, 2021 •

edited

Loading

csvance commented Mar 23, 2021 •

edited

Loading

MrChill commented Mar 24, 2021 •

edited

Loading

csvance commented Mar 24, 2021 •

edited

Loading

Make Pytorch-Lightning DDP work without SLURM #1345

Make Pytorch-Lightning DDP work without SLURM #1345

Comments

areshytko commented Apr 2, 2020

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

Borda commented Apr 6, 2020

jeon30c commented May 11, 2020

areshytko commented May 11, 2020 • edited Loading

faizanahemad commented Dec 21, 2020

untrix commented Jan 29, 2021

untrix commented Feb 26, 2021 • edited Loading

csvance commented Mar 23, 2021 • edited Loading

MrChill commented Mar 24, 2021 • edited Loading

csvance commented Mar 24, 2021 • edited Loading

areshytko commented May 11, 2020 •

edited

Loading

untrix commented Feb 26, 2021 •

edited

Loading

csvance commented Mar 23, 2021 •

edited

Loading

MrChill commented Mar 24, 2021 •

edited

Loading

csvance commented Mar 24, 2021 •

edited

Loading