-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raise if expected workers are not alive in xgboost.dask.train
#9421
Raise if expected workers are not alive in xgboost.dask.train
#9421
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for raising the issue, some questions in comments.
python-package/xgboost/dask.py
Outdated
@@ -907,6 +905,13 @@ def _filter_empty( | |||
raise ValueError("None of the workers can provide a valid result.") | |||
|
|||
|
|||
def _check_workers_are_alive(workers: List[str], client: "distributed.Client") -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not entirely sure what replacing wait with check does on some clusters. The wait was there for some cluster types like GKE, where the worker can take some time to be in sync with the scheduler somehow and client.scheduler_info()["workers"].
was not particularly reliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trivialfis: You raise a good point here. When using an async client, client.scheduler_info()
would get refreshed by a periodic callback. So this could lead to a situation where client.scheduler_info()
is outdated. I've updated the PR to request the data directly from the scheduler, ensuring that workers
is up-to-date. Generally, if a worker is alive and has executed work (e.g., loaded partitions of the DaskDMatrix
), the scheduler will be aware of the worker.
Could you please take a look into the failing Python tests? |
Sure, I'll try to reproduce it locally. |
@trivialfis: The tests should work now. |
xgboost.dask.train
gets stuck indefinitely if worker holding data fromDaskDMatrix
left #9419xgboost.dask.train
gets stuck indefinitely if worker holding data fromDaskDMatrix
restarts #9420This PR checks whether the workers expected to hold the data from
DaskDMatrix
are still alive so that training jobs can then be scheduled on those workers. This avoids several deadlock scenarios.Note that this PR does not make
xgboost.dask.train
resilient against an expected worker leaving after the check.