Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No process and user information in output (for graphics application) #18

Closed
kapsh opened this issue Aug 13, 2017 · 10 comments
Closed

No process and user information in output (for graphics application) #18

kapsh opened this issue Aug 13, 2017 · 10 comments
Labels
Milestone

Comments

@kapsh
Copy link

kapsh commented Aug 13, 2017

I have no process and user information while running gpustat with -cpu flags.

gpustat 0.4.0.dev
% python -m gpustat -cpu                       
feline  Sun Aug 13 15:45:59 2017
[0] GeForce GTX 750 Ti | 48'C,   7 % |   213 /  1995 MB |

nvidia-smi v384.59 shows processes as well. Here is XML dump from it:
nvidia-smi_out_gtx750ti.xml.txt

@wookayin
Copy link
Owner

wookayin commented Aug 13, 2017

Hi, can you post the output of the following command when it happens as well?

nvidia-smi --query-compute-apps=gpu_uuid,pid,used_memory --format=csv,noheader,nounits

or the raw nvidia-smi (the process part).

I presume the xml output (nvidia-smi -q -x) is your attachment, which shows a valid process information. Additionally, the result of ps will be also useful. (PID 3264 for your xml, for instance)

It is reported that nvidia-smi sometimes don't retrieve the process correctly.

@kapsh
Copy link
Author

kapsh commented Aug 13, 2017

Hello,
nvidia-smi shows nothing with parameters that you mentioned.
Output without pararmeters:

Sun Aug 13 17:06:02 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.59                 Driver Version: 384.59                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 750 Ti  Off  | 00000000:01:00.0  On |                  N/A |
| 16%   42C    P8     1W /  38W |    246MiB /  1995MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      3110    G   X                                              130MiB |
|    0      3264    G   ...el-token=73D57F3FB25E828F1EA307823D70EF6D   112MiB |
+-----------------------------------------------------------------------------+

And here is ps for 3264:

UID        PID  PPID  C STIME TTY          TIME CMD
kapsh     3264  3185  0 Aug12 ?        00:06:15 /opt/google/chrome/chrome --type=gpu-process --field-trial-handle=4198532887510418061,4204420223969125725,131072 --ignore-gpu-blacklist --enable-crash-reporter=831AB4A2-FDF7-4D2B-9BE0-F22377348434, --disable-breakpad --supports-dual-gpus=false --gpu-driver-bug-workarounds=7,20,24,44,53,57,66,77,88,90,96 --disable-gl-extensions=GL_KHR_blend_equation_advanced GL_KHR_blend_equation_advanced_coherent --gpu-vendor-id=0x10de --gpu-device-id=0x1380 --gpu-driver-vendor=Nvidia --gpu-driver-version=384.59 --gpu-driver-date --enable-crash-reporter=831AB4A2-FDF7-4D2B-9BE0-F22377348434, --service-request-channel-token=73D57F3FB25E828F1EA307823D70EF6D

@wookayin
Copy link
Owner

I see. The query mode currently we are using as command-line indeed gets broken sometimes. We are going to change the way of retrieving GPU and process information to another API (#17); so I hope this issue could be fixed together.

@kapsh
Copy link
Author

kapsh commented Aug 13, 2017

Yeah, it looks like nvidia-smi issue. I'll wait while moving to NVLM will be completed.

@wookayin wookayin added the bug label Aug 16, 2017
@wookayin
Copy link
Owner

wookayin commented Sep 5, 2017

@kapsh Could you please test once again on your machines, now that #20 is now merged to master?

@kapsh
Copy link
Author

kapsh commented Sep 6, 2017

Checked with current master - no user or processes information.

I ran your utility through debugger and figured out that nv_processes = N.nvmlDeviceGetComputeRunningProcesses(handle) in gpustat.py:255 becomes an empty list.

def nvmlDeviceGetComputeRunningProcesses(handle):
    # first call to get the size
    c_count = c_uint(0)
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses")
    ret = fn(handle, byref(c_count), None)

    if (ret == NVML_SUCCESS):
        # special case, no running processes
        return []

nvmlDeviceGetComputeRunningProcesses here returns NVML_SUCCESS on my PC.
So I guess that it is not gpustat bug but something with nvidia driver instead. This issue can be closed I think.

@kapsh kapsh closed this as completed Sep 6, 2017
@wookayin
Copy link
Owner

wookayin commented Sep 6, 2017

Thank you @kapsh for checking it out. It looks very weird; so I understood that nvml-py still cannot fetch a correct list of running processes, while nvidia-smi -q -x can do. Am I correct?

I would rather like to keep it open, although nvidia drivers, APIs, or something else must be in charge of this, I do believe there must be a good workaround to get the correct process information (as long as there's an way we manage to get it). Maybe, for those old cards like GTX 750, these APIs just don't work.

@wookayin wookayin reopened this Sep 6, 2017
@kapsh
Copy link
Author

kapsh commented Sep 6, 2017

I think I got it. It is documented feature:

This function returns information only about compute running processes (e.g. CUDA application which have active context). Any graphics applications (e.g. using OpenGL, DirectX) won't be listed by this function.

https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g46ceaea624d5c96e098e03c453419d68

I guess this is about CUDA applications - and I was running glxgears for testing.
So I found another function in their API and did some weird patching like this:

diff --git a/gpustat.py b/gpustat.py
index 8b45fde..a5210e2 100755
--- a/gpustat.py
+++ b/gpustat.py
@@ -253,7 +253,8 @@ class GPUStatCollection(object):
             processes = []
             try:
-                nv_processes = N.nvmlDeviceGetComputeRunningProcesses(handle)
+                comp_processes = N.nvmlDeviceGetComputeRunningProcesses(handle)
+                graphics_processes = N.nvmlDeviceGetGraphicsRunningProcesses(handle)
                 # dict type is mutable
-                for nv_process in nv_processes:
+                for nv_process in comp_processes + graphics_processes:
                     #TODO: could be more information such as system memory usage,
                     # CPU percentage, create time etc.

And now I have this:

feline  Thu Sep  7 00:46:28 2017
[0] GeForce GTX 750 Ti | 46'C,   4 % |   257 /  1995 MB | kapsh/3278(170M) kapsh/16279(83M)

wookayin added a commit that referenced this issue Sep 11, 2017
Previously, only the computing applications (e.g. CUDA) are shown
in the running process list. Conforming the way nvidia-smi works
by default, graphics applications (e.g. OpenGL) are now also listed.

Contributed by @kapsh.
@wookayin
Copy link
Owner

@kapsh Great catch! I have adopted your patch in 48a150f, please check it out! Thanks.

@kapsh
Copy link
Author

kapsh commented Sep 11, 2017

@wookayin Current master works as expected, thank you.

@kapsh kapsh closed this as completed Sep 11, 2017
@wookayin wookayin changed the title No process and user information in output No process and user information in output (for graphics application) Sep 17, 2017
@wookayin wookayin added this to the 0.4 milestone Nov 2, 2017
wookayin added a commit that referenced this issue May 4, 2020
A single process might appear in both of graphics and compute running
processes list (#18). In such cases, the same process (same PID)
would appear appearing twice. We fix this bug: list a process only once.
Stonesjtu added a commit that referenced this issue May 19, 2020
* Do not list the same GPU process more than once

A single process might appear in both of graphics and compute running
processes list (#18). In such cases, the same process (same PID)
would appear appearing twice. We fix this bug: list a process only once.

* Add comments for seen_pids

Co-authored-by: Kaiyu Shi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants