-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On windows, error is raised if nvidia query cannot find a card #142
Comments
So in this case, nvidia drivers cannot be found (although it exists?) for some reasons, gpustat saying |
This depends on how you view the purpose of the program. Since in practice the result is that the user cannot use a graphics card, I would personally expect gpustat to behave as if it could not find a graphics card rather than an error. That way a consumer of gpustat like ray would not need to use an error catching mechanism to differentiate between "no available graphics card for the user" and "some internal problem with gpustat that is unexpected". |
I see; yes, we can definitely improve the error message depending on the type of error being thrown from |
With 175c34b (in 1.1.dev), I think gpustat will print the exception messages being thrown. For instance,
@mattip Would that suffice for you? |
+1 that would be great if possible. 175c34b is already a nice improvement. |
I don't think |
Then, how about new test functions such as Most mainstream ML frameworks provide those test functions for users so that they can test the GPU availability in the first place. For example: ❯ export CUDA_VISIBLE_DEVICES=''
❯ ipython3
Python 3.10.8 (main, Oct 11 2022, 11:35:05) [GCC 11.2.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.7.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import torch
In [2]: torch.cuda.is_available()
Out[2]: False
In [3]: torch.cuda.device_count()
Out[3]: 0
In [4]: import tensorflow as tf
2022-12-03 05:10:48.804143: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-03 05:10:49.453942: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/PanXuehai/.mujoco/mujoco210/bin:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/home/PanXuehai/.local/lib
2022-12-03 05:10:49.454000: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/PanXuehai/.mujoco/mujoco210/bin:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/home/PanXuehai/.local/lib
2022-12-03 05:10:49.454008: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
In [5]: tf.config.list_physical_devices('GPU')
2022-12-03 05:10:57.811047: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-12-03 05:10:57.811074: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: Precision-3630-Tower
2022-12-03 05:10:57.811081: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: Precision-3630-Tower
2022-12-03 05:10:57.811112: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 520.56.6
2022-12-03 05:10:57.811133: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 520.56.6
2022-12-03 05:10:57.811139: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 520.56.6
Out[5]: []
In [6]: from nvitop import Device
In [7]: Device.all()
Out[7]: [PhysicalDevice(index=0, name="NVIDIA GeForce RTX 3060", total_memory=12288MiB)]
In [8]: Device.count()
Out[8]: 1
In [9]: Device.is_available()
Out[9]: True
In [10]: Device.cuda.all()
Out[10]: []
In [11]: Device.cuda.count()
Out[11]: 0
In [12]: Device.cuda.is_available()
Out[12]: False |
Describe the bug
python -m gpustat
errors if user is not admin or if nvidia drivers are installed but no NVidia card is availableScreenshots or Program Output
Please provide the output of
gpustat --debug
andnvidia-smi
. Or attach screenshots if applicable.As a regular user:
As an admin:
As a regular user
As an admin
Environment information:
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: