-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add nvidia MIG #258
base: develop
Are you sure you want to change the base?
feat: add nvidia MIG #258
Conversation
3964289
to
a9636a0
Compare
6b025dc
to
d72bc52
Compare
4cb9254
to
385d4fe
Compare
385d4fe
to
495148d
Compare
@@ -1,5 +1,6 @@ | |||
d /run/cache 0755 root root - | |||
d /run/lock 0755 root root - | |||
d /run/prairiedog 0755 root root - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The systemd
package's tmpfiles snippet isn't the right place for this.
I'm also not convinced migmanager
should be using prairiedog
's temp directory, or that we want to route it all through prairiedog
.
We could just add another program to reboot-if-required.service
(like this, or via drop-in):
[Service]
Type=oneshot
ExecStart=/usr/bin/prairiedog reboot-if-required
ExecStart=-/usr/bin/nvidia-migmanager reboot-if-required
// If there a multiple devices with the same ID, dedup them to minimize iterations | ||
let unique_ids = present_devices | ||
.iter() | ||
.map(|x| format!("0x{}", x.device().to_uppercase()).clone()) | ||
.collect::<HashSet<_>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the system has multiple GPUs, can each GPU be in a different state with respect to MIG?
// If there a multiple devices with the same ID, dedup them to minimize iterations | ||
let unique_ids = present_devices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code assumes that all GPUs will be the same, but it might improve it to abstract over the possibility of heterogenous GPUs, and to be precise about which GPU is being operated on.
In other words - return a Result<Vec<NvidiaGpu>>
, ensure each NvidiaGpu
entry has enough metadata to identify it in calls to nvidia-smi
to query the device state for that GPU, and so on.
d317028
to
ed90b3e
Compare
ed90b3e
to
45baf0a
Compare
Issue number:
Related:
Description of changes:
Adding nvidia-migmanager service and binary that configures the instance with nvidia mig.
Testing done:
kubectl describe node
shows 56 gpus post instance reboot.Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.