-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix usage of --dashboard-address in dask-cuda-worker #487
Fix usage of --dashboard-address in dask-cuda-worker #487
Conversation
cc @lmeyerov |
Codecov Report
@@ Coverage Diff @@
## branch-0.18 #487 +/- ##
===============================================
+ Coverage 90.42% 90.69% +0.27%
===============================================
Files 15 15
Lines 1128 1118 -10
===============================================
- Hits 1020 1014 -6
+ Misses 108 104 -4
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pentschev
Super cool, thank you! @pentschev I had a bit of difficulty understanding what this means for diagnostics/recovery/etc. I think what you're saying is:
So,
|
What I'm saying is that by default a Dask-CUDA today relies solely on multiple processes (one per GPU) and a single host thread per process, which would be equivalent to launching Dask Distributed will attempt to use the same address passed via TBH, I don't know exactly what are all the possible states you can read with |
Hmm, I'm going to investigate a bit on health:
I'm not sure what a decent one would be for a |
It seems to me that Distributed is only doing the " As for CUDA context, this is already done today in dask-cuda/dask_cuda/initialize.py Lines 106 to 110 in 32d9d33
ok , it implies that a CUDA context already exists, thus there's no need to make any changes in that regard. Regarding GPU libs corruption, this is definitely not something Dask-CUDA would check and it would be a job for the user's application to do so, although I must say I don't see any GPU libs corruption in my daily work.
|
Yeah the relevant failures we regularly see here are: -- On creation due to misconfigured system (drivers, ..): current initial cuda context create test handles these -- Users triggering cudf/bsql/cugraph/etc. bugs that corrupt shared global app process state (ex: RMM). These are by nature hard to detect as users upload weird format data / settings that trigger all sorts of surprises. Notebook users just restart their kernels and tune this out, but trickier for sw. The more I think about it, the more I'm thinking we'll need to do a |
The
These sounds strange to me, although plausible. ccing @kkraus14 in case he has ideas, and generally for awareness too.
You can think of |
Fixes #486