[BUG] Engine returned by deepspeed.initialize() on the wrong device #1761

skpig · 2022-02-11T02:44:31Z

Describe the bug
According to #662, the --include arguments can set the CUDA_VISIBLE_DEVICES properly. But the engine returned by 1deepspeed.initialize()1 is on the wrong device.
To Reproduce
launch the code below with deepspeed --include="localhost:3" train.py --deepspeed --deepspeed_config config.json

def debug():
    parser = argparse.ArgumentParser()
    parser.add_argument("--local_rank", default=0, type=int,
                        help="local_rank for distributed training on gpus")
    parser = deepspeed.add_config_arguments(parser)
    args = parser.parse_args()
    os.environ['MASTER_ADDR'] = 'localhost'  #
    os.environ['MASTER_PORT'] = '12345'  #
    os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3'
    args.local_rank = int(os.environ['LOCAL_RANK'])

    model = torch.nn.Linear(10,10)
    engine, _, _, _ = deepspeed.initialize(args=args, model=model, model_parameters=model.parameters())
    print("Engine is on device: ", next(engine.parameters()).device)

Expected behavior
I'm willing to run the model on device 3, but the engine is on device 0.

The text was updated successfully, but these errors were encountered:

skpig · 2022-02-11T02:56:09Z

I've checked the source code of class DeepSpeedEngine. The device is set by args.local_rank or os.environ['LOCAL_RANK']
https://github.com/microsoft/DeepSpeed/blob/41ab660b5df3567966935fe8ac3672497bc8690a/deepspeed/runtime/engine.py#L502-L526

So I run the other code snippet as below

def debug():
    parser = argparse.ArgumentParser()
    parser.add_argument("--local_rank", default=0, type=int,
                        help="local_rank for distributed training on gpus")
    parser = deepspeed.add_config_arguments(parser)
    args = parser.parse_args()
    os.environ['MASTER_ADDR'] = 'localhost'  #
    os.environ['MASTER_PORT'] = '12345'  #
    os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3'
    os.environ['LOCAL_RANK'] = '3'
    args.local_rank = int(os.environ['LOCAL_RANK'])

    model = torch.nn.Linear(1000,1000)
    engine, _, _, _ = deepspeed.initialize(args=args, model=model, model_parameters=model.parameters())
    print("Engine is on device: ", next(engine.parameters()).device)

Now the print() shows the engine is on device 3 properly. But nvidia-smi shows some memory leak on device 0.

loadams · 2023-09-06T22:09:34Z

Hi @skpig - apologies for not replying sooner, is this still a bug that you believe you're hitting? And if so, could you confirm it happens with the latest DeepSpeed too?

loadams · 2023-09-26T16:07:07Z

Hi @skpig - closing this issue for now given the older version of DeepSpeed. If you are hitting this still, please reply and we can re-open and will gladly work on debugging it now. Apologies for the long delay in replying the first time.

motata · 2024-03-05T03:18:28Z

I'm facing the same problem. Even with --include local_host there's still memory being located on GPU:0.

skpig added the bug Something isn't working label Feb 11, 2022

skpig mentioned this issue Feb 11, 2022

CUDA_VISIBLE_DEVICES isn't being respected / hostfile doesn't quite work for one node #662

Closed

loadams self-assigned this Sep 6, 2023

loadams closed this as completed Sep 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Engine returned by deepspeed.initialize() on the wrong device #1761

[BUG] Engine returned by deepspeed.initialize() on the wrong device #1761

skpig commented Feb 11, 2022

skpig commented Feb 11, 2022

loadams commented Sep 6, 2023

loadams commented Sep 26, 2023

motata commented Mar 5, 2024

[BUG] Engine returned by deepspeed.initialize() on the wrong device #1761

[BUG] Engine returned by deepspeed.initialize() on the wrong device #1761

Comments

skpig commented Feb 11, 2022

skpig commented Feb 11, 2022

loadams commented Sep 6, 2023

loadams commented Sep 26, 2023

motata commented Mar 5, 2024