Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pillar: Add support to CDI for native containers #4265

Merged
merged 3 commits into from
Sep 19, 2024

Conversation

rene
Copy link
Contributor

@rene rene commented Sep 17, 2024

The Container Device Interface (CDI) is already available on containerd and that are CDI files provided for NVIDIA platform that describes GPU devices for native container access. This commit introduces the support for CDI on pillar, so Video I/O adapters can pass a CDI device name to enable the direct access on native containers.

The corresponding documentation is also provided.

go mod tidy and go mod vendor

Signed-off-by: Renê de Souza Pinto <[email protected]>
@deitch
Copy link
Contributor

deitch commented Sep 17, 2024

Two questions.

First, I had thought that the CDI was going to be in /opt/vendor/<vendor>/, so that the CDI file is in a known place, vendor-specific, and thus is already mounted into pillar? CDI files do not have to be in /etc/cdi/, it is configurable.

Second, I could not figure out where you consume the CDI files. Or are you assuming that by default, it will look in /etc/cdi/?

@rene
Copy link
Contributor Author

rene commented Sep 17, 2024

Two questions.

First, I had thought that the CDI was going to be in /opt/vendor/<vendor>/, so that the CDI file is in a known place, vendor-specific, and thus is already mounted into pillar? CDI files do not have to be in /etc/cdi/, it is configurable.

  1. We need to set the CDI directory in containerd's config file (https://github.com/lf-edge/eve/blob/master/pkg/dom0-ztools/rootfs/etc/containerd/config.toml) but directories like /opt/vendor/nvidia/cdi will not be available on all platforms, throwing an error when the directory doesn't exist
  2. By using /etc/cdi I can push any CDI file direct to this directory without the needs to change /etc/containerd/config.toml which is part of pkg/dom0-ztools
  3. In my tests, configuring another directory other than /etc/cdi in the config.toml didn't work, I didn't have time to debug more deeply, once I figured out, maybe the custom CDI directory can be implemented....

Second, I could not figure out where you consume the CDI files. Or are you assuming that by default, it will look in /etc/cdi/?

Here: https://github.com/lf-edge/eve/pull/4265/files#diff-b4bffe1f639681bb5f53d8804f2010200cc30b19a149577433ae128d00e03175R181

I just need to inform the device string, containerd's CDI plugin does the whole job to look for the CDI files (in the directory configured in config.toml), consume them and populate the OCI spec...

@rene rene force-pushed the introduce-pillar-cdi branch from bc9c930 to 391d4b9 Compare September 17, 2024 16:39
@deitch
Copy link
Contributor

deitch commented Sep 17, 2024

containerd's config file (https://github.com/lf-edge/eve/blob/master/pkg/dom0-ztools/rootfs/etc/containerd/config.toml)

Isn't that system containerd, and we want this for user containerd? Although I spent time with Paul earlier today that showed that user apps are running on system containerd? I lost track. 🤷‍♂️

By using /etc/cdi I can push any CDI file direct to this directory without the needs to change /etc/containerd/config.toml which is part of pkg/dom0-ztools

So what would be the process for a specific vendor? Root filesystem is immutable, so /etc/cdi (which is mounted into the pillar container) should be immutable. How do I get my CDI files in there? Or is the assumption that they are part of the rootfs build? How would that work?

We do have the pillar /opt/vendor/*/init.sh process, where we could just link /etc/cdi -> /opt/vendor/<foo>/cdi/? I am not sure that is better, though. I always have a concern with modifying things on the root filesystem and mounting it in.

In my tests, configuring another directory other than /etc/cdi in the config.toml didn't work

Different containerd? 🤷‍♂️

@rene
Copy link
Contributor Author

rene commented Sep 18, 2024

containerd's config file (https://github.com/lf-edge/eve/blob/master/pkg/dom0-ztools/rootfs/etc/containerd/config.toml)

Isn't that system containerd, and we want this for user containerd? Although I spent time with @christoph-zededa earlier today that showed that user apps are running on system containerd? I lost track. 🤷‍♂️

No, user containerd is used only for CAS, user apps run on system's containerd....

By using /etc/cdi I can push any CDI file direct to this directory without the needs to change /etc/containerd/config.toml which is part of pkg/dom0-ztools

So what would be the process for a specific vendor? Root filesystem is immutable, so /etc/cdi (which is mounted into the pillar container) should be immutable. How do I get my CDI files in there? Or is the assumption that they are part of the rootfs build? How would that work?

We do have the pillar /opt/vendor/*/init.sh process, where we could just link /etc/cdi -> /opt/vendor/<foo>/cdi/? I am not sure that is better, though. I always have a concern with modifying things on the root filesystem and mounting it in.

Me too, I have a security concern about allow runtime CDI files, so the assumption is that any CDI file must be part of rootfs build, which, IMO makes sense since they are usually very hardware specific and will be provided by specific packages, such as the pkg/nvidia....

In my tests, configuring another directory other than /etc/cdi in the config.toml didn't work

Different containerd? 🤷‍♂️

Unfortunately not, the containerd config is correct, it's the system's containerd....

The Container Device Interface (CDI) is already available on containerd and
that are CDI files provided for NVIDIA platform that describes GPU devices
for native container access. This commit introduces the support for CDI on
pillar, so Video I/O adapters can pass a CDI device name to enable the
direct access on native containers.

Signed-off-by: Renê de Souza Pinto <[email protected]>
@rene rene force-pushed the introduce-pillar-cdi branch from 391d4b9 to e99de15 Compare September 18, 2024 10:49
@deitch
Copy link
Contributor

deitch commented Sep 18, 2024

No, user containerd is used only for CAS, user apps run on system's containerd

Yeah, that came up yesterday. We probably should change that, but well beyond the scope of this PR.

Me too, I have a security concern about allow runtime CDI files, so the assumption is that any CDI file must be part of rootfs build, which, IMO makes sense since they are usually very hardware specific and will be provided by specific packages, such as the pkg/nvidia....

What happens if you have 100 different devices, all of the same family, with slightly different CDI? Are you going to have 100 different rootfs builds? Or just one, with multiple CDIs, and the ability to detect each? This gets us very much down the road of different builds because of a single few KB config file in /etc/.

Unfortunately not, the containerd config is correct, it's the system's containerd

So, we mount /etc/cdi into pillar, just so that we can retrieve the devices and modify the container spec, but in the end that gets passed to system containerd anyways, which runs (by definition) outside of pillar?

I didn't quite get what we are doing with that chunk of code inside pillar. We inject the devices into the container spec based on the name. Essentially, we are duplicating what containerd normally does?

@rene
Copy link
Contributor Author

rene commented Sep 18, 2024

What happens if you have 100 different devices, all of the same family, with slightly different CDI? Are you going to have 100 different rootfs builds? Or just one, with multiple CDIs, and the ability to detect each? This gets us very much down the road of different builds because of a single few KB config file in /etc/.

It's totally fine to have multiple CDI files under /etc/cdi, we don't need a single build per device. Actually, that's how it's working for NVIDIA, we have CDI files for both Xavier + Orin boards on the same build. Inside each file, we use different names for device description, so we have "nvidia.com/xavier-gpu" for Xavier and "nvidia.com/orin-gpu" for Orin...

Unfortunately not, the containerd config is correct, it's the system's containerd

So, we mount /etc/cdi into pillar, just so that we can retrieve the devices and modify the container spec, but in the end that gets passed to system containerd anyways, which runs (by definition) outside of pillar?

Yes, this can be improved when we move execution of Edge Apps to user containerd...

I didn't quite get what we are doing with that chunk of code inside pillar. We inject the devices into the container spec based on the name. Essentially, we are duplicating what containerd normally does?

No, this parses the I/O adapters list from the device model. The way we give direct access to GPU is exactly the same way as we use for passthrough PCI devices, the difference is that instead of giving a PCIe Bus Address in the device model, we pass the CDI string for the particular device. The CDI is only used for GPU access for now, for Serial devices and any other device (like a webcam under /dev/video0) we read them from the device model and add them to the OCI spec. For standard containers (with ShimVM) we just give full access to all file devices. In the documentation being added you can see some examples and a better description: https://github.com/lf-edge/eve/pull/4265/files#diff-8230cff5878b3df207474c79828836840673a5ac49fdc808ba034809902cac96

@deitch
Copy link
Contributor

deitch commented Sep 18, 2024

It's totally fine to have multiple CDI files under /etc/cdi, we don't need a single build per device. Actually, that's how it's working for NVIDIA, we have CDI files for both Xavier + Orin boards on the same build. Inside each file, we use different names for device description, so we have "nvidia.com/xavier-gpu" for Xavier and "nvidia.com/orin-gpu" for Orin...

OK, that works, thanks.

No, this parses the I/O adapters list from the device model. The way we give direct access to GPU is exactly the same way as we use for passthrough PCI devices, the difference is that instead of giving a PCIe Bus Address in the device model, we pass the CDI string for the particular device. The CDI is only used for GPU access for now, for Serial devices and any other device (like a webcam under /dev/video0) we read them from the device model and add them to the OCI spec. For standard containers (with ShimVM) we just give full access to all file devices. In the documentation being added you can see some examples and a better descriptio

So this is about translating between what came in the EVE API request for devices to pass to the app, and the CDI file format, so that it will know what to do with it? That would make sense to me, but the doc to which you linked implies that the spec comes with the CDI attribute? So why translate?

@rene
Copy link
Contributor Author

rene commented Sep 18, 2024

So this is about translating between what came in the EVE API request for devices to pass to the app, and the CDI file format, so that it will know what to do with it? That would make sense to me, but the doc to which you linked implies that the spec comes with the CDI attribute? So why translate?

I don't know if I understood your question, but the main idea is that the CDI string works as a "hardware ID", the same way we specify PCIe Bus Address, network interface names (eth0, wlan0), etc, in the hardware device model, for this particular case we specify the CDI string which points to the device described in the CDI file.... this makes the "native container GPU passthrough" work transparently with the controller, so the user can "pass-through" a GPU for a native container in the same way it pass-through a PCI Video card on x86, for example....

}
```

If two containers need to access the same GPU (when supported), two I/O adapters must be defined. For instance, two
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the number of these files depends on how many App Instances will use them, am I right? It means: dynamic? In this case, is it ok to store the file in the RO partition?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, but will let Rene confirm. You have one or more CDI yaml files that describe the devices on the device and how to access them (including devices, and mounts necessary and all sorts of instruction stuff). Whether it is 1 or 50 apps, only the defined CDI files for the device are used. They are read-only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly @deitch ... @OhmSpectator , we must have one GPU I/O adapter per Edge App because an I/O adapter can be "passed-through" to only one VM, makes total sense for PCI devices, but we do have some exceptions, like CAN interfaces (that can be accessed by multiple VMs at the same time) and now this use case, where a single GPU can be shared across multiple containers... but this is only about device model, there is nothing to do with CDI files.....

@deitch
Copy link
Contributor

deitch commented Sep 18, 2024

Actually, that's how it's working for NVIDIA, we have CDI files for both Xavier + Orin boards on the same build. Inside each file, we use different names for device description, so we have "nvidia.com/xavier-gpu" for Xavier and "nvidia.com/orin-gpu" for Orin..

I was just thinking about this. How do you distinguish between them? Aren't the device names similar? Won't you have conflicts?

@deitch
Copy link
Contributor

deitch commented Sep 18, 2024

I don't know if I understood your question, but the main idea is that the CDI string works as a "hardware ID", the same way we specify PCIe Bus Address, network interface names (eth0, wlan0), etc, in the hardware device model, for this particular case we specify the CDI string which points to the device described in the CDI file.... this makes the "native container GPU passthrough" work transparently with the controller, so the user can "pass-through" a GPU for a native container in the same way it pass-through a PCI Video card on x86, for example

What I meant was, if we already define the CDI string inside the app instance, e.g. nvidia.com/gpus=0, then that is how containerd will take it and build the right config.json OCI spec for it, so why does pillar need to have access to it? What is that chunk of code doing?

@rene
Copy link
Contributor Author

rene commented Sep 18, 2024

Actually, that's how it's working for NVIDIA, we have CDI files for both Xavier + Orin boards on the same build. Inside each file, we use different names for device description, so we have "nvidia.com/xavier-gpu" for Xavier and "nvidia.com/orin-gpu" for Orin..

I was just thinking about this. How do you distinguish between them? Aren't the device names similar? Won't you have conflicts?

We can naming these devices whatever we want, originally nvidia-ctk will always use "nvidia.com/gpu" during the CDI generation, I just changed them to nvidia.com/xavier-gpu and nvidia.com/orin-gpu

@rene
Copy link
Contributor Author

rene commented Sep 18, 2024

I don't know if I understood your question, but the main idea is that the CDI string works as a "hardware ID", the same way we specify PCIe Bus Address, network interface names (eth0, wlan0), etc, in the hardware device model, for this particular case we specify the CDI string which points to the device described in the CDI file.... this makes the "native container GPU passthrough" work transparently with the controller, so the user can "pass-through" a GPU for a native container in the same way it pass-through a PCI Video card on x86, for example

What I meant was, if we already define the CDI string inside the app instance, e.g. nvidia.com/gpus=0, then that is how containerd will take it and build the right config.json OCI spec for it, so why does pillar need to have access to it? What is that chunk of code doing?

We don't define CDI string inside the Edge App, the Edge App should be as any regular Edge App, the only requirement is to have NO_HYPER as the virtualization mode. Then, we are going to passthrough a GPU I/O adapter to this Edge App, as any regular PCI passthrough... the trick happens when we parse the I/O adapters from the device model and found the "cdi" attribute under "cbattr", so for native containers (and only for native containers) we will use this string as a CDI device and processing accordingly....

This approach makes the CDI solution 100% compatible with the current passthrough mechanism and it requires no changes neither in the API nor in the controller side....

@deitch
Copy link
Contributor

deitch commented Sep 18, 2024

Ok, now that makes sense. So there still is a "translation" going on between "how GPU appears in EVE API" and "how GPU is listed in CDI files". The work in pillar is there to do that translation. Correct? Can we capture that in the docs?

@rene
Copy link
Contributor Author

rene commented Sep 18, 2024

Ok, now that makes sense. So there still is a "translation" going on between "how GPU appears in EVE API" and "how GPU is listed in CDI files". The work in pillar is there to do that translation. Correct? Can we capture that in the docs?

Correct. Ok, I will update the documentation...

Add proper documentation explaning about the CDI support for bare metal
containers.

Signed-off-by: Renê de Souza Pinto <[email protected]>
@rene rene force-pushed the introduce-pillar-cdi branch from e99de15 to d275fb1 Compare September 18, 2024 15:32
@rene
Copy link
Contributor Author

rene commented Sep 18, 2024

Updates in this PR:

  • Documentation updated.

@OhmSpectator OhmSpectator merged commit 215fcb3 into lf-edge:master Sep 19, 2024
38 checks passed
@rene rene deleted the introduce-pillar-cdi branch October 11, 2024 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants