-
-
Notifications
You must be signed in to change notification settings - Fork 14.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CDI: Add mount-nvidia-binaries
and mount-nvidia-docker-1-directories
options
#290979
CDI: Add mount-nvidia-binaries
and mount-nvidia-docker-1-directories
options
#290979
Conversation
5dbec5a
to
3e0dc82
Compare
e1c958f
to
e30ae50
Compare
nixos/modules/services/hardware/nvidia-container-toolkit-cdi-generator/cdi-generate.nix
Outdated
Show resolved
Hide resolved
They are not producing a broken image. When you use CUDA in container images it's common practice to do this. With the nvidia-docker, you would:
Later on, I don't know exactly the timeline or when that happened, they changed the story a little bit:
It's not ideal, it's probably not the most stable thing, but it's what we have. Many, many, many container images use this as of today. Our interest as NixOS developers is to find the balance between a perfect implementation, and letting people run their stuff without getting in their way as much as possible. |
If they actually enforce that these paths exist, they are broken:) If they were not broken, their |
They are not broken, again. They just took the decision that the container images would place the LD_LIBRARY_PATH at the place where the runtime wrapper will mount the host libraries (or ld.so.conf.d being augmented with it) --see official example of creating a cuda image from nvidia:-- Reality is, these images exist in the wild, a lot of them. We cannot fix them, they are already built and published. We can adapt to this reality and make our users life a bit easier, that's all. |
The EDIT: I don't disagree we should provide for a smooth experience, I just think we're mistaken about why and how the failure occurs (if it does), and thus about the method of addressing it |
For the context of this PR, it doesn't matter whether it's LD_LIBRARY_PATH or ld.so.conf.d. Side note: I think the applications depending on CUDA tend to do a dlopen, that is affected by In any case, I did show both:
|
The Does your example (ollama) actually fail without these extra mounts? |
No, if the code in the container does a |
No, |
Yes, you are right, my bad. In any case, we still need to mount this directory. |
Why, have we seen any failures? |
As per PR description: |
Very good, and sorry I missed this. Could you publish the logs so we can see what the error was? |
Note last line: "no GPU detected". |
So yeah this instance is pretty hardcoded in ollama, what we could fix to make it a better citizen to be honest: https://github.com/ollama/ollama/blob/2a4b128ae3e3a18b10e8701aca2434d401eaa7ba/gpu/gpu.go#L37-L51 But honestly, I think this is a very common practice I fear. |
Aha. I'd need to dig deeper, but looking at https://github.com/ollama/ollama/blob/2a4b128ae3e3a18b10e8701aca2434d401eaa7ba/gpu/gpu.go#L38-L50, it's more than likely they do not use |
@SomeoneSerge we basically saw the same thing at the same time and wrote the same thing hehe. I would settle it like this: it makes some broken published software work, and the old wrapper was doing it as well; also, we don't lose much in the process. Addendum: I also fear this is not the only instance, but I don't have more evidence at hand to back that statement. |
I have to say: and thanks for pushing to find the real root cause :) |
Ah, you were faster:)
What I'm thinking is, and this concerns the
Fairly hard to say. This increases pollution/complexity of the environment, i.e. makes errors and successes less deterministic |
59209f1
to
9c73be8
Compare
9c73be8
to
6b80785
Compare
6b80785
to
26c83b0
Compare
|
I think we can implement the typed option for CDI specs. I would suggest to do so as a follow up after this has been merged, and a way for users to specify their own CDI configurations statically that is properly type checked.
Right, thanks for the reminder. Adding it later today. |
26c83b0
to
7ad83fc
Compare
Just for the sake of notifying; this was pushed that very same day already. cc/ @SomeoneSerge |
The updated PR removes the previously introduced "generic" options ( I updated #290609 to reflect this new direction, and I updated the message in the "feature freeze" issue to refer other issues with the current CDI implementation. Notably, we still need to follow up with a fix to the incomplete deployment issue (dependencies of the host drivers and tools not being mounted). Before this PR is merged I'd like the commit history to be updated. Ideally, the PR be split into two commits: one for deprecation of the previous interface (the file moves, attribute renamings, &c), and one for the new feature (the FHS bin/libs paths) @ereslibre thanks again for driving this |
7ad83fc
to
cf5fdb7
Compare
@SomeoneSerge: thank you. Splitted in two commits as you suggested. I will post in https://discourse.nixos.org/t/breaking-changes-announcement-for-unstable/17574/ when it's merged.
Absolutely, we have to follow up on this. Thanks for your feedback during this process. |
cf5fdb7
to
6edbd3e
Compare
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/breaking-changes-announcement-for-unstable/17574/47 |
nixos/modules/services/hardware/nvidia-container-toolkit/cdi-generate.nix
Outdated
Show resolved
Hide resolved
…s.cdi.dynamic.nvidia.enable` Add the NixOS option `hardware.nvidia-container-toolkit-cdi-generator.enable`. This enables the ability to expose GPU's in containers for container runtimes that support the Container Device Interface (CDI) Remove `cdi.static` and `cdi.dynamic.nvidia.enable` attributes.
…ount-nvidia-docker-1-directories` options - `mount-nvidia-binaries`: this option allows users to avoid mounting nvidia binaries on the container. - `mount-nvidia-docker-1-directories`: this option allows users to avoid mounting `/usr/local/nvidia/lib{,64}` on containers.
6edbd3e
to
de3ce5f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you've tested the runtime, I'm happy to merge now (I can test later)
I just re-run it, but let me do a more thorough check before finally merging, everything should be fine but will comment here. I will also double check the Docker issue and if it’s what I presume I will open another PR to fix that so it also makes it for 24.05. |
Confirmed that it's working fine here in all the cases I tested. |
Add three options to
hardware.nvidia-container-toolkit
:mounts
: list of mounts that allow to mount arbitrary paths on the CDI enabled containers.mount-nvidia-binaries
: this option allows users to avoid mounting nvidia binaries on the container.mount-nvidia-docker-1-directories
: this option allows users to avoid mounting/usr/local/nvidia/lib{,64}
on containers.Related: #290609
Description of changes
Things done
I have tested that
podman run --device=nvidia.com/gpu=all -d -p 11434:11434 --name ollama ollama/ollama
works as expected with this change, and is able to use all the GPU's in my system without any change.My NixOS configuration sets
virtualisation.containers.cdi.dynamic.nvidia.enable = true;
.nix.conf
? (See Nix manual)sandbox = relaxed
sandbox = true
nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD"
. Note: all changes have to be committed, also see nixpkgs-review usage./result/bin/
)Add a 👍 reaction to pull requests you find important.