Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix usage of --dashboard-address in dask-cuda-worker #487

Merged
merged 3 commits into from
Jan 6, 2021

Conversation

pentschev
Copy link
Member

Fixes #486

@pentschev pentschev requested a review from a team as a code owner January 5, 2021 16:08
@pentschev pentschev added 3 - Ready for Review Ready for review by team bug Something isn't working non-breaking Non-breaking change labels Jan 5, 2021
@pentschev
Copy link
Member Author

cc @lmeyerov

@codecov-io
Copy link

codecov-io commented Jan 5, 2021

Codecov Report

Merging #487 (5c314ab) into branch-0.18 (32d9d33) will increase coverage by 0.27%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.18     #487      +/-   ##
===============================================
+ Coverage        90.42%   90.69%   +0.27%     
===============================================
  Files               15       15              
  Lines             1128     1118      -10     
===============================================
- Hits              1020     1014       -6     
+ Misses             108      104       -4     
Impacted Files Coverage Δ
dask_cuda/cuda_worker.py 79.22% <ø> (+2.20%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 32d9d33...5c314ab. Read the comment docs.

Copy link
Member

@quasiben quasiben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pentschev

@lmeyerov
Copy link

lmeyerov commented Jan 5, 2021

Super cool, thank you!

@pentschev I had a bit of difficulty understanding what this means for diagnostics/recovery/etc. I think what you're saying is:

  • a cuda worker health check will test the cpu health of the process, but maybe not its cpu subprocesses/subthreads
  • a cuda worker health checks will test gpu context health... but only for the first gpu managed by the worker? or doesn't at all?
  • so in 1:n (or 1:1 too?) settings, a true cuda worker health checks needs to be done beyond /health, or by setting up a worker per gpu?

So,

  • near-term: We're prepping a rapids 0.17 based release, so it sounds like for the interim, we should create a manual per-worker-per-gpu test that tries to do something like a cudf task on each gpu. not sure how? some sort of per-gpu variant of client.run?
  • rapids 0.18+: when this patch lands, how would you recommend gpu health checks then?
  • dask: happy to file the issue, though based on ^^^, not sure what you're thinking

@pentschev
Copy link
Member Author

pentschev commented Jan 5, 2021

What I'm saying is that by default a dask-worker will launch a single worker process with T threads (where T is the number of system threads available). You can instead launch P worker processes with dask-worker --nprocs P, where each process can have T threads by passing --nthreads T.

Dask-CUDA today relies solely on multiple processes (one per GPU) and a single host thread per process, which would be equivalent to launching dask-worker --nprocs G --nthreads 1 (where G is the number of GPUs in the system).

Dask Distributed will attempt to use the same address passed via --dashboard-address for all worker processes, that will effectively assign the same port to all worker processes, which will thus fail and fallback to generating a random port for the dashboard of all processes, except the one that got created first and locked in that port. I'm not sure what would be the correct way of tackling this in Distributed, perhaps you'd pass the first port that would be incremented by 1 for each additional worker process, but that will require some discussion in Distributed followed by the work in actually implementing the chosen solution there, after that Dask-CUDA will automatically rely on the new behavior Distributed implements.

TBH, I don't know exactly what are all the possible states you can read with /health, but if it's only to check whether you have all the workers you're expecting to have before launching work, I've seen people (including myself) relying on a completely different solution, which is simply to use wait_for_workers. Maybe this is enough for your use case, if you really need /health information, then the only solution you'll have for now is to launch one dask-cuda-worker for each GPU you have, as I suggested in #486 (comment) .

@lmeyerov
Copy link

lmeyerov commented Jan 5, 2021

Hmm, I'm going to investigate a bit on health:

I'm not sure what a decent one would be for a dask-cuda-worker representing physical resources. We often do app-level checks because gpu libs corrupt easily (cudf/bsql -> rmm), like cudf.DataFrame({'x': [1,2,3]})['x'].sum() == 6. However, that's inappropriate for dask-cuda if the use case is instead something like BERT. Maybe just that cuda contexts can be created. If the http resource settings work, another alternative may be pushing to the user level: /health for the cpu worker manager, and the user can add specific per-resource checks if they really care. In a similar spot on our app, we have a flag on resource/health?check_context=true which does heavier checks, as we find doing our heavier cuda checks adds noticeable jitter so we do those less frequently.

@pentschev
Copy link
Member Author

It seems to me that Distributed is only doing the "/health: check server is alive" case, which apparently has the same meaning as of the worker being connected to the scheduler (and thus alive). Dask-CUDA doesn't rewrite anything, each GPU worker is an instance of Nanny, which will return the same health condition as a regular Distributed Nanny/Worker` would, so you can assume they're identical.

As for CUDA context, this is already done today in

if create_cuda_context:
try:
numba.cuda.current_context()
except Exception:
logger.error("Unable to start CUDA Context", exc_info=True)
, so if you verify a Dask-CUDA worker is connected to the scheduler and/or its health status is ok, it implies that a CUDA context already exists, thus there's no need to make any changes in that regard. Regarding GPU libs corruption, this is definitely not something Dask-CUDA would check and it would be a job for the user's application to do so, although I must say I don't see any GPU libs corruption in my daily work.

@lmeyerov
Copy link

lmeyerov commented Jan 5, 2021

Yeah the relevant failures we regularly see here are:

-- On creation due to misconfigured system (drivers, ..): current initial cuda context create test handles these

-- Users triggering cudf/bsql/cugraph/etc. bugs that corrupt shared global app process state (ex: RMM). These are by nature hard to detect as users upload weird format data / settings that trigger all sorts of surprises. Notebook users just restart their kernels and tune this out, but trickier for sw.

The more I think about it, the more I'm thinking we'll need to do a client.run(custom_rapids_checker) => test cudf + bsql. The unclear part will be having it handle a dask-cuda-worker with multiple GPUs. Would client.run() run on each GPU of a multi-GPU worker, or there's another way to do it?

@pentschev
Copy link
Member Author

-- On creation due to misconfigured system (drivers, ..): current initial cuda context create test handles these

The dask-cuda-worker will fail to initialize on a GPU where the CUDA context fails to be created, so that may give you a hint already if you can parse its output.

-- Users triggering cudf/bsql/cugraph/etc. bugs that corrupt shared global app process state (ex: RMM). These are by nature hard to detect as users upload weird format data / settings that trigger all sorts of surprises. Notebook users just restart their kernels and tune this out, but trickier for sw.

These sounds strange to me, although plausible. ccing @kkraus14 in case he has ideas, and generally for awareness too.

The more I think about it, the more I'm thinking we'll need to do a client.run(custom_rapids_checker) => test cudf + bsql. The unclear part will be having it handle a dask-cuda-worker with multiple GPUs. Would client.run() run on each GPU of a multi-GPU worker, or there's another way to do it?

You can think of dask-cuda-worker as a parent process that will spawn one new process per GPU, the parent process serves only as a spawning mechanism. RAPIDS uses a one-worker-per-GPU model, meaning we'll always address only GPU 0 in each process, thus every Dask-CUDA worker that gets spawned receives a different CUDA_VISIBLE_DEVICES variable, where the GPU of interest appears first in the CUDA_VISIBLE_DEVICES list. Therefore, client.run() will execute on each Dask worker process, if you address GPU 0 during the client.run call you'll always be targeting the correct GPU for that process.

@pentschev pentschev added 6 - Okay to Auto-Merge and removed 3 - Ready for Review Ready for review by team labels Jan 6, 2021
@rapids-bot rapids-bot bot merged commit b936748 into rapidsai:branch-0.18 Jan 6, 2021
@pentschev pentschev deleted the fix-dashboard-address branch April 23, 2021 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] respect dask-cuda-worker --dashboard-address in order to support worker healthchecks
4 participants