Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve device picking #113

Merged
merged 2 commits into from
Feb 12, 2025
Merged

improve device picking #113

merged 2 commits into from
Feb 12, 2025

Conversation

willdumm
Copy link
Contributor

@willdumm willdumm commented Feb 12, 2025

Pick the device as before, except that when picking the CUDA GPU, if there is no difference between gpu utilization, check whether there is a significant difference in allocated memory, and then check which gpu has the fewest running processes.

Also, to immediately register the result of pick_device with nvidia-smi, if pick_device returns a CUDA GPU, it also immediately sends a tensor to that chosen device.

This removes the need to pick devices twice in train functions, with the second pick_device called after a random waiting period. I think we still need the random waiting period, but it can be shorter, and it can be immediately before the first call to pick_device. Currently, the waiting period is rarely sufficient to allow any usage to show up in nvidia-smi from other simultaneous processes. (These changes will be implemented in a separate dns-experiments PR)

@willdumm willdumm merged commit 5d47176 into main Feb 12, 2025
2 checks passed
@willdumm willdumm deleted the wd-improve-pick-device branch February 12, 2025 16:47
@willdumm willdumm restored the wd-improve-pick-device branch February 12, 2025 16:53
willdumm added a commit that referenced this pull request Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants