Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Open GPU drivers from NVIDIA #4172

Closed
4 tasks done
yeazelm opened this issue Sep 3, 2024 · 2 comments
Closed
4 tasks done

Add Open GPU drivers from NVIDIA #4172

yeazelm opened this issue Sep 3, 2024 · 2 comments
Assignees
Labels
area/accelerated-computing Issues related to GPUs/ASICs type/enhancement New feature or request

Comments

@yeazelm
Copy link
Contributor

yeazelm commented Sep 3, 2024

What I'd like:
I would like to have the NVIDIA Open GPU kernel drivers available in Bottlerocket so that I can use EFA, this is related to #1031.

Ideally the driver would be chosen for the user depending on if the hardware supports the open driver. That way the correct driver is used automatically. The PCI device ID of the NVIDIA card can be used to determine if it supports the open kernel modules, if it does, then it would choose that driver instead of the current proprietary drivers in the NVIDIA variants.

The first step would be to compile in the driver from the .run archive in the kmod-5.15-nvidia and kmod-6.1-nvidia package in the core kit. Both sets of drivers can be included alongside each other and only the desired modules provided to modprobe on boot.

Currently, driverdog doesn't support the notion of two conflicting drivers (the proprietary and open source drivers use the same names and conflict with each other) so there needs to be a way to ensure driverdog only loads the desired driver. Ideally driverdog remains focused on linking and loading modules, and not on deciding which modules might be needed or choosing between configurations. ghostdog is already aware of PCI devices due to udev rules calling it so it would be a good place to put PCI device-specific code. driverdog also needs to solve the problem of only loading drivers if provided a configuration that doesn't call for linking. This will involve improving the structure of the configuration files since they assume each module needs linking when not all do (like the open GPU driver won't).

The PCI devices are enumerated pretty early in boot and ghostdog will be able to match those devices easily, but providing the right information later when driverdog needs it is the difficult part. Using /run or /etc could be racy because the PCI devices might enumerate before those filesystems are fully ready.

Things needed for this to work:

Any alternatives you've considered:
Providing a different NVIDIA variant that only has the open drivers would work as well, but then users would have to choose the correct variant. That would be the only change and would make it more confusing to figure out which variant is the right one.

@yeazelm yeazelm added type/enhancement New feature or request status/needs-triage Pending triage or re-evaluation labels Sep 3, 2024
@yeazelm yeazelm self-assigned this Sep 3, 2024
@yeazelm yeazelm added area/accelerated-computing Issues related to GPUs/ASICs and removed status/needs-triage Pending triage or re-evaluation labels Sep 3, 2024
@yeazelm
Copy link
Contributor Author

yeazelm commented Sep 24, 2024

All the changes have been merged. This will be in the next minor release of the bottlerocket-core-kit.

@yeazelm
Copy link
Contributor Author

yeazelm commented Oct 1, 2024

Bottlerocket 1.24.0 was released with this functionality.

@yeazelm yeazelm closed this as completed Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/accelerated-computing Issues related to GPUs/ASICs type/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant