Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise if expected workers are not alive in xgboost.dask.train #9421

Merged
merged 11 commits into from
Aug 3, 2023

Conversation

hendrikmakait
Copy link
Contributor

@hendrikmakait hendrikmakait commented Jul 26, 2023

This PR checks whether the workers expected to hold the data from DaskDMatrix are still alive so that training jobs can then be scheduled on those workers. This avoids several deadlock scenarios.

Note that this PR does not make xgboost.dask.train resilient against an expected worker leaving after the check.

@hendrikmakait hendrikmakait marked this pull request as ready for review July 26, 2023 12:06
Copy link

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for raising the issue, some questions in comments.

@@ -907,6 +905,13 @@ def _filter_empty(
raise ValueError("None of the workers can provide a valid result.")


def _check_workers_are_alive(workers: List[str], client: "distributed.Client") -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely sure what replacing wait with check does on some clusters. The wait was there for some cluster types like GKE, where the worker can take some time to be in sync with the scheduler somehow and client.scheduler_info()["workers"]. was not particularly reliable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trivialfis: You raise a good point here. When using an async client, client.scheduler_info() would get refreshed by a periodic callback. So this could lead to a situation where client.scheduler_info() is outdated. I've updated the PR to request the data directly from the scheduler, ensuring that workers is up-to-date. Generally, if a worker is alive and has executed work (e.g., loaded partitions of the DaskDMatrix), the scheduler will be aware of the worker.

@trivialfis
Copy link
Member

Could you please take a look into the failing Python tests?

@hendrikmakait
Copy link
Contributor Author

Could you please take a look into the failing Python tests?

Sure, I'll try to reproduce it locally.

@hendrikmakait
Copy link
Contributor Author

@trivialfis: The tests should work now.

@trivialfis trivialfis merged commit f958e32 into dmlc:master Aug 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants