nvproxy: add ioctl `NV_CONF_COMPUTE_CTRL_CMD_GPU_GET_KEY_ROTATION_STATE` #10824

derpsteb · 2024-08-27T09:43:43Z

Hey,

this adds a missing ioctl required to run workloads on H100s with CC mode on.
I couldn't find the respective ioctl in any supported driver version prior to 550.90.07, hence I added it only to that version's ABI.

Without this patch the following example crashes:

$ docker run --runtime=runsc --gpus=all pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime python -c "import torch; torch.cuda.init()"

The error is:

Traceback (most recent call last):                                                                                                           
  File "/test.py", line 3, in <module>
    torch.cuda.init()                   
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 260, in init                                                                                                                                                                                                
    _lazy_init()                  
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 293, in _lazy_init                                             
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available

At the same time gvisor's debug logs show nvproxy: unknown control command 0xcb33010c.

Signed-off-by: Otto Bittner <[email protected]>

ayushr2 · 2024-08-27T15:42:00Z

NV_CONF_COMPUTE_CTRL_CMD_GPU_GET_KEY_ROTATION_STATE was added in 550.90.07 and removed in 555.42.02 and then added back in again in 560.28.03.

So adding it in 550.90.07 would be the right thing to do for now.

Hey, this adds a missing ioctl required to run workloads on H100s with CC mode on. I couldn't find the respective ioctl in any supported driver version prior to 550.90.07, hence I added it only to that version's ABI. Without this patch the following example crashes: ```bash $ docker run --runtime=runsc --gpus=all pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime python -c "import torch; torch.cuda.init()" ``` The error is: ``` Traceback (most recent call last): File "/test.py", line 3, in <module> torch.cuda.init() File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 260, in init _lazy_init() File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 293, in _lazy_init torch._C._cuda_init() RuntimeError: No CUDA GPUs are available ``` At the same time gvisor's debug logs show `nvproxy: unknown control command 0xcb33010c`. FUTURE_COPYBARA_INTEGRATE_REVIEW=#10824 from derpsteb:ob/key-rotation 960c2d0 PiperOrigin-RevId: 668003601

nvproxy: add key-rotation ioctl

960c2d0

Signed-off-by: Otto Bittner <[email protected]>

derpsteb force-pushed the ob/key-rotation branch from 4533462 to 960c2d0 Compare August 27, 2024 09:44

ayushr2 approved these changes Aug 27, 2024

View reviewed changes

ayushr2 added the ready to pull label Aug 27, 2024

copybara-service bot mentioned this pull request Aug 27, 2024

nvproxy: add ioctl NV_CONF_COMPUTE_CTRL_CMD_GPU_GET_KEY_ROTATION_STATE #10828

Closed

copybara-service bot merged commit f02d783 into google:master Aug 27, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvproxy: add ioctl `NV_CONF_COMPUTE_CTRL_CMD_GPU_GET_KEY_ROTATION_STATE` #10824

nvproxy: add ioctl `NV_CONF_COMPUTE_CTRL_CMD_GPU_GET_KEY_ROTATION_STATE` #10824

derpsteb commented Aug 27, 2024

ayushr2 commented Aug 27, 2024

nvproxy: add ioctl NV_CONF_COMPUTE_CTRL_CMD_GPU_GET_KEY_ROTATION_STATE #10824

nvproxy: add ioctl NV_CONF_COMPUTE_CTRL_CMD_GPU_GET_KEY_ROTATION_STATE #10824

Conversation

derpsteb commented Aug 27, 2024

ayushr2 commented Aug 27, 2024

nvproxy: add ioctl `NV_CONF_COMPUTE_CTRL_CMD_GPU_GET_KEY_ROTATION_STATE` #10824

nvproxy: add ioctl `NV_CONF_COMPUTE_CTRL_CMD_GPU_GET_KEY_ROTATION_STATE` #10824