improve device picking #113

willdumm · 2025-02-12T00:39:52Z

Pick the device as before, except that when picking the CUDA GPU, if there is no difference between gpu utilization, check whether there is a significant difference in allocated memory, and then check which gpu has the fewest running processes.

Also, to immediately register the result of pick_device with nvidia-smi, if pick_device returns a CUDA GPU, it also immediately sends a tensor to that chosen device.

This removes the need to pick devices twice in train functions, with the second pick_device called after a random waiting period. I think we still need the random waiting period, but it can be shorter, and it can be immediately before the first call to pick_device. Currently, the waiting period is rarely sufficient to allow any usage to show up in nvidia-smi from other simultaneous processes. (These changes will be implemented in a separate dns-experiments PR)

This reverts commit 5d47176.

improve device picking

567450f

matsen approved these changes Feb 12, 2025

View reviewed changes

format

bb3798f

willdumm merged commit 5d47176 into main Feb 12, 2025
2 checks passed

willdumm deleted the wd-improve-pick-device branch February 12, 2025 16:47

willdumm restored the wd-improve-pick-device branch February 12, 2025 16:53

willdumm added a commit that referenced this pull request Feb 12, 2025

Revert "improve device picking (#113)"

e85f8b5

This reverts commit 5d47176.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve device picking #113

improve device picking #113

willdumm commented Feb 12, 2025 •

edited

Loading

improve device picking #113

improve device picking #113

Conversation

willdumm commented Feb 12, 2025 • edited Loading

willdumm commented Feb 12, 2025 •

edited

Loading