Fix usage of --dashboard-address in dask-cuda-worker #487

pentschev · 2021-01-05T16:08:34Z

Fixes #486

pentschev · 2021-01-05T16:12:24Z

codecov-io · 2021-01-05T16:26:10Z

Codecov Report

Merging #487 (5c314ab) into branch-0.18 (32d9d33) will increase coverage by 0.27%.
The diff coverage is n/a.

@@               Coverage Diff               @@
##           branch-0.18     #487      +/-   ##
===============================================
+ Coverage        90.42%   90.69%   +0.27%     
===============================================
  Files               15       15              
  Lines             1128     1118      -10     
===============================================
- Hits              1020     1014       -6     
+ Misses             108      104       -4

Impacted Files	Coverage Δ
dask_cuda/cuda_worker.py	`79.22% <ø> (+2.20%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 32d9d33...5c314ab. Read the comment docs.

quasiben

Thanks @pentschev

lmeyerov · 2021-01-05T16:48:16Z

Super cool, thank you!

@pentschev I had a bit of difficulty understanding what this means for diagnostics/recovery/etc. I think what you're saying is:

a cuda worker health check will test the cpu health of the process, but maybe not its cpu subprocesses/subthreads
a cuda worker health checks will test gpu context health... but only for the first gpu managed by the worker? or doesn't at all?
so in 1:n (or 1:1 too?) settings, a true cuda worker health checks needs to be done beyond /health, or by setting up a worker per gpu?

So,

near-term: We're prepping a rapids 0.17 based release, so it sounds like for the interim, we should create a manual per-worker-per-gpu test that tries to do something like a cudf task on each gpu. not sure how? some sort of per-gpu variant of client.run?
rapids 0.18+: when this patch lands, how would you recommend gpu health checks then?
dask: happy to file the issue, though based on ^^^, not sure what you're thinking

pentschev · 2021-01-05T17:32:55Z

What I'm saying is that by default a dask-worker will launch a single worker process with T threads (where T is the number of system threads available). You can instead launch P worker processes with dask-worker --nprocs P, where each process can have T threads by passing --nthreads T.

Dask-CUDA today relies solely on multiple processes (one per GPU) and a single host thread per process, which would be equivalent to launching dask-worker --nprocs G --nthreads 1 (where G is the number of GPUs in the system).

Dask Distributed will attempt to use the same address passed via --dashboard-address for all worker processes, that will effectively assign the same port to all worker processes, which will thus fail and fallback to generating a random port for the dashboard of all processes, except the one that got created first and locked in that port. I'm not sure what would be the correct way of tackling this in Distributed, perhaps you'd pass the first port that would be incremented by 1 for each additional worker process, but that will require some discussion in Distributed followed by the work in actually implementing the chosen solution there, after that Dask-CUDA will automatically rely on the new behavior Distributed implements.

TBH, I don't know exactly what are all the possible states you can read with /health, but if it's only to check whether you have all the workers you're expecting to have before launching work, I've seen people (including myself) relying on a completely different solution, which is simply to use wait_for_workers. Maybe this is enough for your use case, if you really need /health information, then the only solution you'll have for now is to launch one dask-cuda-worker for each GPU you have, as I suggested in #486 (comment) .

lmeyerov · 2021-01-05T18:24:40Z

Hmm, I'm going to investigate a bit on health:

dask main is instant-200 😅: https://github.com/dask/distributed/blob/0c2c9f8e81cb488f97d85d61cc1eeb8e83b1a355/distributed/http/health.py
dask worker: could not find
dask stats: https://github.com/dask/distributed/blob/master/docs/source/http_services.rst (will test what's up for dask-cuda-worker)

I'm not sure what a decent one would be for a dask-cuda-worker representing physical resources. We often do app-level checks because gpu libs corrupt easily (cudf/bsql -> rmm), like cudf.DataFrame({'x': [1,2,3]})['x'].sum() == 6. However, that's inappropriate for dask-cuda if the use case is instead something like BERT. Maybe just that cuda contexts can be created. If the http resource settings work, another alternative may be pushing to the user level: /health for the cpu worker manager, and the user can add specific per-resource checks if they really care. In a similar spot on our app, we have a flag on resource/health?check_context=true which does heavier checks, as we find doing our heavier cuda checks adds noticeable jitter so we do those less frequently.

pentschev · 2021-01-05T20:18:49Z

It seems to me that Distributed is only doing the "/health: check server is alive" case, which apparently has the same meaning as of the worker being connected to the scheduler (and thus alive). Dask-CUDA doesn't rewrite anything, each GPU worker is an instance of Nanny, which will return the same health condition as a regular Distributed Nanny/Worker` would, so you can assume they're identical.

As for CUDA context, this is already done today in

dask-cuda/dask_cuda/initialize.py

Lines 106 to 110 in 32d9d33

    
           if create_cuda_context: 
        
               try: 
        
                   numba.cuda.current_context() 
        
               except Exception: 
        
                   logger.error("Unable to start CUDA Context", exc_info=True)

, so if you verify a Dask-CUDA worker is connected to the scheduler and/or its health status is ok, it implies that a CUDA context already exists, thus there's no need to make any changes in that regard. Regarding GPU libs corruption, this is definitely not something Dask-CUDA would check and it would be a job for the user's application to do so, although I must say I don't see any GPU libs corruption in my daily work.

lmeyerov · 2021-01-05T22:37:24Z

Yeah the relevant failures we regularly see here are:

-- On creation due to misconfigured system (drivers, ..): current initial cuda context create test handles these

-- Users triggering cudf/bsql/cugraph/etc. bugs that corrupt shared global app process state (ex: RMM). These are by nature hard to detect as users upload weird format data / settings that trigger all sorts of surprises. Notebook users just restart their kernels and tune this out, but trickier for sw.

The more I think about it, the more I'm thinking we'll need to do a client.run(custom_rapids_checker) => test cudf + bsql. The unclear part will be having it handle a dask-cuda-worker with multiple GPUs. Would client.run() run on each GPU of a multi-GPU worker, or there's another way to do it?

pentschev · 2021-01-05T22:52:49Z

-- On creation due to misconfigured system (drivers, ..): current initial cuda context create test handles these

The dask-cuda-worker will fail to initialize on a GPU where the CUDA context fails to be created, so that may give you a hint already if you can parse its output.

-- Users triggering cudf/bsql/cugraph/etc. bugs that corrupt shared global app process state (ex: RMM). These are by nature hard to detect as users upload weird format data / settings that trigger all sorts of surprises. Notebook users just restart their kernels and tune this out, but trickier for sw.

These sounds strange to me, although plausible. ccing @kkraus14 in case he has ideas, and generally for awareness too.

The more I think about it, the more I'm thinking we'll need to do a client.run(custom_rapids_checker) => test cudf + bsql. The unclear part will be having it handle a dask-cuda-worker with multiple GPUs. Would client.run() run on each GPU of a multi-GPU worker, or there's another way to do it?

You can think of dask-cuda-worker as a parent process that will spawn one new process per GPU, the parent process serves only as a spawning mechanism. RAPIDS uses a one-worker-per-GPU model, meaning we'll always address only GPU 0 in each process, thus every Dask-CUDA worker that gets spawned receives a different CUDA_VISIBLE_DEVICES variable, where the GPU of interest appears first in the CUDA_VISIBLE_DEVICES list. Therefore, client.run() will execute on each Dask worker process, if you address GPU 0 during the client.run call you'll always be targeting the correct GPU for that process.

pentschev added 2 commits January 5, 2021 08:06

Fix usage of --dashboard-address in dask-cuda-worker

481574d

Add tests for --dashboard-address in dask-cuda-worker

137a226

pentschev requested a review from a team as a code owner January 5, 2021 16:08

pentschev added 3 - Ready for Review Ready for review by team bug Something isn't working non-breaking Non-breaking change labels Jan 5, 2021

Remove unused services variable from CUDAWorker

5c314ab

quasiben approved these changes Jan 5, 2021

View reviewed changes

pentschev added 6 - Okay to Auto-Merge and removed 3 - Ready for Review Ready for review by team labels Jan 6, 2021

rapids-bot bot merged commit b936748 into rapidsai:branch-0.18 Jan 6, 2021

pentschev deleted the fix-dashboard-address branch April 23, 2021 21:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix usage of --dashboard-address in dask-cuda-worker #487

Fix usage of --dashboard-address in dask-cuda-worker #487

pentschev commented Jan 5, 2021

pentschev commented Jan 5, 2021

codecov-io commented Jan 5, 2021 •

edited

Loading

quasiben left a comment

lmeyerov commented Jan 5, 2021

pentschev commented Jan 5, 2021 •

edited

Loading

lmeyerov commented Jan 5, 2021

pentschev commented Jan 5, 2021

lmeyerov commented Jan 5, 2021

pentschev commented Jan 5, 2021

Fix usage of --dashboard-address in dask-cuda-worker #487

Fix usage of --dashboard-address in dask-cuda-worker #487

Conversation

pentschev commented Jan 5, 2021

pentschev commented Jan 5, 2021

codecov-io commented Jan 5, 2021 • edited Loading

Codecov Report

quasiben left a comment

Choose a reason for hiding this comment

lmeyerov commented Jan 5, 2021

pentschev commented Jan 5, 2021 • edited Loading

lmeyerov commented Jan 5, 2021

pentschev commented Jan 5, 2021

lmeyerov commented Jan 5, 2021

pentschev commented Jan 5, 2021

codecov-io commented Jan 5, 2021 •

edited

Loading

pentschev commented Jan 5, 2021 •

edited

Loading