-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA_VISIBLE_DEVICES isn't correctly inherited on a SLURM system #1331
Comments
I wanted to quickly follow up on this bug report to see if there has been any discussion on a fix. Since the underlying issue is that the Per my report, if I'm misunderstanding any of DeepSpeed's functionality, please let me know. Thanks a bunch! |
I am having the same issue. Has this been solved? Update: I realized that by setting --num_gpus (or --include, etc), we are basically overwriting the CUDA_VISIBLE_DEVICES environment variable. In a slurm system, the node only sees what it gets, so set the number of GPUs via slurm, and do not set num_gpus for DeepSpeed. DeepSpeed also by default uses all the GPUs it sees when num_gpus is not set. |
same issue here. Any solution? |
you can add --include localhost:0,1 in DeepSpeed command, in my case, my gpu is 3,4,5,6 in slurm node, so I need to use --include localhost:0,1,2,3 haha |
|
I had the same issue but on an SGE cluster, sharing what worked for me in case it helps anyone. My understanding is that when you submit a job with let's say 4 GPUs (4,5,6,7), the scheduler sets CUDA_VISIBLE_DEVICES to these values. When you run What works, is to reset CUDA_VISIBLE_DEVICES and allow deepspeed to figure it out, i.e. This worked for me at least. In your case, the command would be To be honest, I don't have a solid explanation for this behaviour. But a naive one would be that when Please let me know if you have a better explanation and hope this helps. |
is there any news on this? I have the same issues, and the solution provided by @ahmedhshahin, while it may work, is against the best practices of slurm of not modifying the environment variables set by slurm itself.
|
I agree, it is not a good practice and a better solution is needed to correctly handle this case. |
Describe the bug
This issue occurs on a SLURM cluster where worker nodes equipped with multiple GPU's are shared amongst users. GPU's are given slot number assignments (for example, on a node with 8 GPU's:
0-7
), and users may be assigned any number of the GPU's of a node by the SLURM scheduler. For example, a SLURM assignment could setCUDA_VISIBLE_DEVICES
to4,5
.Per this bug report, I tried to use the
--include
flag with thedeepspeed
command to input my specific GPU numeric assignments (i.e., the values ofCUDA_VISIBLE_DEVICES
) from SLURM. When I tried the following command:deepspeed --include localhost:4,5 mycode.py --deepspeed ds_config.json
I receive the following errors:
Note: the value of
echo $CUDA_VISIBLE_DEVICES
was4,5
in this example.When I instead tried the following code:
deepspeed --include localhost:0,1 mycode.py --deepspeed ds_config.json
I successfully ran the
mycode.py
script with the following output preceding it:But as I received GPU's that were already in use, my processes ran out of memory. I.e.,
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 31.75 GiB total capacity; 550.30 MiB already allocated; 2.75 MiB free; 574.00 MiB reserved in total by PyTorch)
After reading the bug report I referenced above, I noticed the
world_info
dictionary that's submitted as a base-64 string to the launch script usestorch.cuda.device_count()
to create the list of GPU's that should be used. Would it instead be possible to inherit the pre-existingCUDA_VISIBLE_DEVICES
assignments?As a preliminary work-around, I've been creating a custom
world_info
dictionary and calling the launch script directly, like so:Hostfile attempt
I also tried with a specified hostfile. This input:
resulted in the following output:
Called via:
deepspeed --hostfile myhostfile mycode.py --deepspeed ds_config.json
I presumed I was simply misunderstanding the syntax, so as an experiment I changed the hostfile to:
I had thought that to mean I was specifying up to 5 GPU's (whereas I'm only assigned / using 2 GPU's in this example). Calling
deepspeed
as above resulted in the node prompting me for its password, which of course was also not desired behavior.Expected behavior
I expected
deepspeed
to inherit the specific GPU numeric assignments fromCUDA_VISIBLE_DEVICES
. It seems as though Deepspeed always re-indexes the assignments of the GPU's to start from0
. I believe this code snippet shows how the values given to theworld_info
dictionary are created (specifically line 306:list(range(args.num_gpus))
).ds_report output
Software Version Notes
I'm running
Python 3.9.6
within a Conda library where I've installeddeepspeed
viapip
(and all other libraries viaconda install
).Additional Notes and Future Usage
Currently I'm only running these workflows on single node, which has anywhere from 8 to 16 GPU's. However, I am interested in applying these workflows across multiple nodes that have high-speed interconnects.
Please let me know if you require any further information.
The text was updated successfully, but these errors were encountered: