Raise if expected workers are not alive in `xgboost.dask.train` #9421

hendrikmakait · 2023-07-26T11:56:18Z

Closes xgboost.dask.train gets stuck indefinitely if worker holding data from DaskDMatrix left #9419
Closes xgboost.dask.train gets stuck indefinitely if worker holding data from DaskDMatrix restarts #9420

This PR checks whether the workers expected to hold the data from DaskDMatrix are still alive so that training jobs can then be scheduled on those workers. This avoids several deadlock scenarios.

Note that this PR does not make xgboost.dask.train resilient against an expected worker leaving after the check.

fjetter

LGTM

trivialfis

Thank you for raising the issue, some questions in comments.

trivialfis · 2023-07-27T02:08:43Z

python-package/xgboost/dask.py

@@ -907,6 +905,13 @@ def _filter_empty(
    raise ValueError("None of the workers can provide a valid result.")


+def _check_workers_are_alive(workers: List[str], client: "distributed.Client") -> None:


Not entirely sure what replacing wait with check does on some clusters. The wait was there for some cluster types like GKE, where the worker can take some time to be in sync with the scheduler somehow and client.scheduler_info()["workers"]. was not particularly reliable.

@trivialfis: You raise a good point here. When using an async client, client.scheduler_info() would get refreshed by a periodic callback. So this could lead to a situation where client.scheduler_info() is outdated. I've updated the PR to request the data directly from the scheduler, ensuring that workers is up-to-date. Generally, if a worker is alive and has executed work (e.g., loaded partitions of the DaskDMatrix), the scheduler will be aware of the worker.

trivialfis · 2023-07-28T15:24:29Z

Could you please take a look into the failing Python tests?

hendrikmakait · 2023-07-28T18:31:42Z

Could you please take a look into the failing Python tests?

Sure, I'll try to reproduce it locally.

hendrikmakait · 2023-08-03T10:11:16Z

@trivialfis: The tests should work now.

no await

d78be5a

hendrikmakait marked this pull request as ready for review July 26, 2023 12:06

hendrikmakait added 2 commits July 26, 2023 14:07

Minor

654a831

Linting

112300a

fjetter approved these changes Jul 26, 2023

View reviewed changes

hendrikmakait added 2 commits July 26, 2023 19:22

mypy

3d0c840

isort

69aa74d

trivialfis reviewed Jul 27, 2023

View reviewed changes

hendrikmakait added 3 commits July 27, 2023 10:20

Eagerly request scheduler info

2a33979

Minor

73a91c6

await

dee6ce9

black

1ae2214

hendrikmakait added 2 commits August 2, 2023 15:52

Don't check for leaks

27c45df

Increase dataset size to ensure each worker holds data

4bdfc8f

trivialfis approved these changes Aug 3, 2023

View reviewed changes

trivialfis merged commit f958e32 into dmlc:master Aug 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raise if expected workers are not alive in `xgboost.dask.train` #9421

Raise if expected workers are not alive in `xgboost.dask.train` #9421

hendrikmakait commented Jul 26, 2023 •

edited

Loading

fjetter left a comment

trivialfis left a comment

trivialfis Jul 27, 2023

hendrikmakait Jul 27, 2023

trivialfis commented Jul 28, 2023

hendrikmakait commented Jul 28, 2023

hendrikmakait commented Aug 3, 2023

		@@ -907,6 +905,13 @@ def _filter_empty(
		raise ValueError("None of the workers can provide a valid result.")


		def _check_workers_are_alive(workers: List[str], client: "distributed.Client") -> None:

Raise if expected workers are not alive in xgboost.dask.train #9421

Raise if expected workers are not alive in xgboost.dask.train #9421

Conversation

hendrikmakait commented Jul 26, 2023 • edited Loading

fjetter left a comment

Choose a reason for hiding this comment

trivialfis left a comment

Choose a reason for hiding this comment

trivialfis Jul 27, 2023

Choose a reason for hiding this comment

hendrikmakait Jul 27, 2023

Choose a reason for hiding this comment

trivialfis commented Jul 28, 2023

hendrikmakait commented Jul 28, 2023

hendrikmakait commented Aug 3, 2023

Raise if expected workers are not alive in `xgboost.dask.train` #9421

Raise if expected workers are not alive in `xgboost.dask.train` #9421

hendrikmakait commented Jul 26, 2023 •

edited

Loading