On windows, error is raised if nvidia query cannot find a card #142

mattip · 2022-11-22T21:58:02Z

Describe the bug

python -m gpustat errors if user is not admin or if nvidia drivers are installed but no NVidia card is available

Screenshots or Program Output

Please provide the output of gpustat --debug and nvidia-smi. Or attach screenshots if applicable.

As a regular user:

>python -m gpustat --debug
Error on querying NVIDIA devices. Use --debug flag for details
Traceback (most recent call last):
  File "d:\temp\ray_venv\lib\site-packages\gpustat\cli.py", line 20, in print_gpustat
    gpu_stats = GPUStatCollection.new_query(debug=debug)
  File "d:\temp\ray_venv\lib\site-packages\gpustat\core.py", line 362, in new_query
    N.nvmlInit()
  File "d:\temp\ray_venv\lib\site-packages\pynvml.py", line 1450, in nvmlInit
    nvmlInitWithFlags(0)
  File "d:\temp\ray_venv\lib\site-packages\pynvml.py", line 1440, in nvmlInitWithFlags
    _nvmlCheckReturn(ret)
  File "d:\temp\ray_venv\lib\site-packages\pynvml.py", line 765, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.NVMLError_NoPermission: Insufficient Permissions

As an admin:

Error on querying NVIDIA devices. Use --debug flag for details
Traceback (most recent call last):
  File "d:\temp\ray_venv\lib\site-packages\gpustat\cli.py", line 18, in print_gpustat
    gpu_stats = GPUStatCollection.new_query(debug=debug)
  File "d:\temp\ray_venv\lib\site-packages\gpustat\core.py", line 370, in new_query
    N.nvmlInit()
  File "d:\temp\ray_venv\lib\site-packages\pynvml.py", line 1450, in nvmlInit
    nvmlInitWithFlags(0)
  File "d:\temp\ray_venv\lib\site-packages\pynvml.py", line 1440, in nvmlInitWithFlags
    _nvmlCheckReturn(ret)
  File "d:\temp\ray_venv\lib\site-packages\pynvml.py", line 765, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.NVMLError_DriverNotLoaded: Driver Not Loaded

As a regular user

>nvidia-smi
NVIDIA-SMI has failed because you are not:
        a) running as an administrator or
        b) there is not at least one TCC device in the system

As an admin

>nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. This can also be happening if non-NVIDIA GPU is running as primary display, and NVIDIA GPU is in WDDM mode.

Environment information:

OS: windows10
NVIDIA Driver version: 11.7 (Edit: changed from 11.3 to 11.7)
The name(s) of GPU card: None
gpustat version: 1.1.0
pynvml version: nvidia-ml-py 11.495.46

Additional context

Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

wookayin · 2022-11-24T00:52:18Z

So in this case, nvidia drivers cannot be found (although it exists?) for some reasons, gpustat saying Error on querying NVIDIA devices. Use --debug flag for details and nvidia-smi says an error as well. I think the both programs are working as expected (raising and printing errors) --- what do you expect for gpustat to behave in this scenario?

mattip · 2022-11-24T05:07:01Z

This depends on how you view the purpose of the program. Since in practice the result is that the user cannot use a graphics card, I would personally expect gpustat to behave as if it could not find a graphics card rather than an error. That way a consumer of gpustat like ray would not need to use an error catching mechanism to differentiate between "no available graphics card for the user" and "some internal problem with gpustat that is unexpected".

wookayin · 2022-11-24T06:00:07Z

I see; yes, we can definitely improve the error message depending on the type of error being thrown from nvmlInit().

wookayin · 2022-12-01T23:23:26Z

With 175c34b (in 1.1.dev), I think gpustat will print the exception messages being thrown. For instance,

Error on querying NVIDIA devices. Use --debug flag to see more details.
Insufficient Permissions

Error on querying NVIDIA devices. Use --debug flag to see more details.
Driver Not Loaded

@mattip Would that suffice for you?

XuehaiPan · 2022-12-02T06:22:20Z

@wookayin. I think @mattip's usage is about Python integration than the command line tool. @mattip may want an extra flag for GPUStatCollection.new_query to return an empty collection if any NVML-related error raises.

mattip · 2022-12-02T11:37:11Z

extra flag for GPUStatCollection.new_query to return an empty collection

+1 that would be great if possible. 175c34b is already a nice improvement.

wookayin · 2022-12-02T18:37:45Z

I don't think gpustat.new_query() should return an empty collection when error happens from the NVML side. This is a clear case where "exception" or "error" happens. Although not explicitly documented and we don't have exception translations, NVMLError (or its subclass) will be thrown when NVML-related error occurs while querying NVIDIA devices.

XuehaiPan · 2022-12-03T05:27:24Z

I don't think gpustat.new_query() should return an empty collection when error happens from the NVML side.

Then, how about new test functions such as is_available() and device_count() (or gpu_count())? They should return False or 0 if an NVML-related error occurs rather than raise. Test functions that silently fail, which reduce try-except blocks and improve code readability. I also added similar functions in my nvitop: nvitop.Device.is_available(). It would be great to have similar things in gpustat.

Most mainstream ML frameworks provide those test functions for users so that they can test the GPU availability in the first place. For example:

❯ export CUDA_VISIBLE_DEVICES=''

❯ ipython3
Python 3.10.8 (main, Oct 11 2022, 11:35:05) [GCC 11.2.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.7.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import torch

In [2]: torch.cuda.is_available()
Out[2]: False

In [3]: torch.cuda.device_count()
Out[3]: 0

In [4]: import tensorflow as tf
2022-12-03 05:10:48.804143: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-03 05:10:49.453942: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/PanXuehai/.mujoco/mujoco210/bin:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/home/PanXuehai/.local/lib
2022-12-03 05:10:49.454000: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/PanXuehai/.mujoco/mujoco210/bin:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/home/PanXuehai/.local/lib
2022-12-03 05:10:49.454008: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

In [5]: tf.config.list_physical_devices('GPU')
2022-12-03 05:10:57.811047: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-12-03 05:10:57.811074: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: Precision-3630-Tower
2022-12-03 05:10:57.811081: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: Precision-3630-Tower
2022-12-03 05:10:57.811112: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 520.56.6
2022-12-03 05:10:57.811133: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 520.56.6
2022-12-03 05:10:57.811139: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 520.56.6
Out[5]: []

In [6]: from nvitop import Device

In [7]: Device.all()
Out[7]: [PhysicalDevice(index=0, name="NVIDIA GeForce RTX 3060", total_memory=12288MiB)]

In [8]: Device.count()
Out[8]: 1

In [9]: Device.is_available()
Out[9]: True

In [10]: Device.cuda.all()
Out[10]: []

In [11]: Device.cuda.count()
Out[11]: 0

In [12]: Device.cuda.is_available()
Out[12]: False

mattip added the bug label Nov 22, 2022

mattip mentioned this issue Nov 22, 2022

Ray cannot access GPUs under a non-root user (failed access of ray.init() to root-owned /proc/driver/nvidia/gpus) ray-project/ray#28064

Closed

wookayin added enhancement and removed bug labels Nov 24, 2022

wookayin added this to the 1.1 milestone Nov 24, 2022

XuehaiPan mentioned this issue Dec 6, 2022

Add noexcept funcitons gpu_count and is_available #145

Merged

wookayin closed this as completed in #145 Mar 2, 2023

mattip mentioned this issue May 21, 2023

unify gpu checking around gpustat ray-project/ray#35581

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On windows, error is raised if nvidia query cannot find a card #142

On windows, error is raised if nvidia query cannot find a card #142

mattip commented Nov 22, 2022 •

edited

Loading

wookayin commented Nov 24, 2022

mattip commented Nov 24, 2022

wookayin commented Nov 24, 2022

wookayin commented Dec 1, 2022

XuehaiPan commented Dec 2, 2022

mattip commented Dec 2, 2022

wookayin commented Dec 2, 2022 •

edited

Loading

XuehaiPan commented Dec 3, 2022 •

edited

Loading

On windows, error is raised if nvidia query cannot find a card #142

On windows, error is raised if nvidia query cannot find a card #142

Comments

mattip commented Nov 22, 2022 • edited Loading

wookayin commented Nov 24, 2022

mattip commented Nov 24, 2022

wookayin commented Nov 24, 2022

wookayin commented Dec 1, 2022

XuehaiPan commented Dec 2, 2022

mattip commented Dec 2, 2022

wookayin commented Dec 2, 2022 • edited Loading

XuehaiPan commented Dec 3, 2022 • edited Loading

mattip commented Nov 22, 2022 •

edited

Loading

wookayin commented Dec 2, 2022 •

edited

Loading

XuehaiPan commented Dec 3, 2022 •

edited

Loading