Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Engine returned by deepspeed.initialize() on the wrong device #1761

Closed
skpig opened this issue Feb 11, 2022 · 4 comments
Closed

[BUG] Engine returned by deepspeed.initialize() on the wrong device #1761

skpig opened this issue Feb 11, 2022 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@skpig
Copy link
Contributor

skpig commented Feb 11, 2022

Describe the bug
According to #662, the --include arguments can set the CUDA_VISIBLE_DEVICES properly. But the engine returned by 1deepspeed.initialize()1 is on the wrong device.
To Reproduce
launch the code below with deepspeed --include="localhost:3" train.py --deepspeed --deepspeed_config config.json

def debug():
    parser = argparse.ArgumentParser()
    parser.add_argument("--local_rank", default=0, type=int,
                        help="local_rank for distributed training on gpus")
    parser = deepspeed.add_config_arguments(parser)
    args = parser.parse_args()
    os.environ['MASTER_ADDR'] = 'localhost'  #
    os.environ['MASTER_PORT'] = '12345'  #
    os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3'
    args.local_rank = int(os.environ['LOCAL_RANK'])

    model = torch.nn.Linear(10,10)
    engine, _, _, _ = deepspeed.initialize(args=args, model=model, model_parameters=model.parameters())
    print("Engine is on device: ", next(engine.parameters()).device)

Expected behavior
I'm willing to run the model on device 3, but the engine is on device 0.

@skpig
Copy link
Contributor Author

skpig commented Feb 11, 2022

I've checked the source code of class DeepSpeedEngine. The device is set by args.local_rank or os.environ['LOCAL_RANK']
https://github.com/microsoft/DeepSpeed/blob/41ab660b5df3567966935fe8ac3672497bc8690a/deepspeed/runtime/engine.py#L502-L526

So I run the other code snippet as below

def debug():
    parser = argparse.ArgumentParser()
    parser.add_argument("--local_rank", default=0, type=int,
                        help="local_rank for distributed training on gpus")
    parser = deepspeed.add_config_arguments(parser)
    args = parser.parse_args()
    os.environ['MASTER_ADDR'] = 'localhost'  #
    os.environ['MASTER_PORT'] = '12345'  #
    os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3'
    os.environ['LOCAL_RANK'] = '3'
    args.local_rank = int(os.environ['LOCAL_RANK'])

    model = torch.nn.Linear(1000,1000)
    engine, _, _, _ = deepspeed.initialize(args=args, model=model, model_parameters=model.parameters())
    print("Engine is on device: ", next(engine.parameters()).device)

Now the print() shows the engine is on device 3 properly. But nvidia-smi shows some memory leak on device 0.
NVIDIA-SMI 478 57 02 Driver Version 470 57 02 CUDA Version 1 4

@loadams
Copy link
Collaborator

loadams commented Sep 6, 2023

Hi @skpig - apologies for not replying sooner, is this still a bug that you believe you're hitting? And if so, could you confirm it happens with the latest DeepSpeed too?

@loadams loadams self-assigned this Sep 6, 2023
@loadams
Copy link
Collaborator

loadams commented Sep 26, 2023

Hi @skpig - closing this issue for now given the older version of DeepSpeed. If you are hitting this still, please reply and we can re-open and will gladly work on debugging it now. Apologies for the long delay in replying the first time.

@loadams loadams closed this as completed Sep 26, 2023
@motata
Copy link

motata commented Mar 5, 2024

I'm facing the same problem. Even with --include local_host there's still memory being located on GPU:0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants