Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvproxy: add ioctl NV_CONF_COMPUTE_CTRL_CMD_GPU_GET_KEY_ROTATION_STATE #10824

Merged
merged 1 commit into from
Aug 27, 2024

Conversation

derpsteb
Copy link
Contributor

Hey,

this adds a missing ioctl required to run workloads on H100s with CC mode on.
I couldn't find the respective ioctl in any supported driver version prior to 550.90.07, hence I added it only to that version's ABI.

Without this patch the following example crashes:

$ docker run --runtime=runsc --gpus=all pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime python -c "import torch; torch.cuda.init()"

The error is:

Traceback (most recent call last):                                                                                                           
  File "/test.py", line 3, in <module>
    torch.cuda.init()                   
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 260, in init                                                                                                                                                                                                
    _lazy_init()                  
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 293, in _lazy_init                                             
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available 

At the same time gvisor's debug logs show nvproxy: unknown control command 0xcb33010c.

@ayushr2
Copy link
Collaborator

ayushr2 commented Aug 27, 2024

NV_CONF_COMPUTE_CTRL_CMD_GPU_GET_KEY_ROTATION_STATE was added in 550.90.07 and removed in 555.42.02 and then added back in again in 560.28.03.

So adding it in 550.90.07 would be the right thing to do for now.

copybara-service bot pushed a commit that referenced this pull request Aug 27, 2024
Hey,

this adds a missing ioctl required to run workloads on H100s with CC mode on.
I couldn't find the respective ioctl in any supported driver version prior to 550.90.07, hence I added it only to that version's ABI.

Without this patch the following example crashes:
```bash
$ docker run --runtime=runsc --gpus=all pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime python -c "import torch; torch.cuda.init()"
```
The error is:
```
Traceback (most recent call last):
  File "/test.py", line 3, in <module>
    torch.cuda.init()
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 260, in init
    _lazy_init()
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 293, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available
```

At the same time gvisor's debug logs show `nvproxy: unknown control command 0xcb33010c`.

FUTURE_COPYBARA_INTEGRATE_REVIEW=#10824 from derpsteb:ob/key-rotation 960c2d0
PiperOrigin-RevId: 668003601
copybara-service bot pushed a commit that referenced this pull request Aug 27, 2024
Hey,

this adds a missing ioctl required to run workloads on H100s with CC mode on.
I couldn't find the respective ioctl in any supported driver version prior to 550.90.07, hence I added it only to that version's ABI.

Without this patch the following example crashes:
```bash
$ docker run --runtime=runsc --gpus=all pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime python -c "import torch; torch.cuda.init()"
```
The error is:
```
Traceback (most recent call last):
  File "/test.py", line 3, in <module>
    torch.cuda.init()
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 260, in init
    _lazy_init()
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 293, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available
```

At the same time gvisor's debug logs show `nvproxy: unknown control command 0xcb33010c`.

FUTURE_COPYBARA_INTEGRATE_REVIEW=#10824 from derpsteb:ob/key-rotation 960c2d0
PiperOrigin-RevId: 668003601
copybara-service bot pushed a commit that referenced this pull request Aug 27, 2024
Hey,

this adds a missing ioctl required to run workloads on H100s with CC mode on.
I couldn't find the respective ioctl in any supported driver version prior to 550.90.07, hence I added it only to that version's ABI.

Without this patch the following example crashes:
```bash
$ docker run --runtime=runsc --gpus=all pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime python -c "import torch; torch.cuda.init()"
```
The error is:
```
Traceback (most recent call last):
  File "/test.py", line 3, in <module>
    torch.cuda.init()
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 260, in init
    _lazy_init()
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 293, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available
```

At the same time gvisor's debug logs show `nvproxy: unknown control command 0xcb33010c`.

FUTURE_COPYBARA_INTEGRATE_REVIEW=#10824 from derpsteb:ob/key-rotation 960c2d0
PiperOrigin-RevId: 668003601
@copybara-service copybara-service bot merged commit f02d783 into google:master Aug 27, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants