Change training stage from ResultStage to ShuffleMapStage #9423

jinmfeng001 · 2023-07-26T15:34:00Z

The training step runs in barrier model and is ResultStage, which means when one of the training task failed the training stage failed, and the whole application failed becuase no retry for ResultStage.
When there are many training tasks, it's easy that one task will fail due to cluster node issue, so the application may has a high failure rate.
In this PR, we add a repartition step to make the training stage as ShuffleMapStage, so that when the training stage failed, the stage can retry and the whole spark application won't fail.

wbo4958 · 2023-07-26T23:17:51Z

To fix #8336

wbo4958 · 2023-07-27T00:26:34Z

@jinmfeng001, Per my understanding, the xgboost training is deterministic, @trivialfis please correct me. if the first training fails, the retry, I guess, may still fail (the only difference I can think of is the training task may be distributed to different workers).

BTW, have you seen the retry can really help the xgboost training?

trivialfis · 2023-07-27T02:05:39Z

Indeed, there can be many reasons for failure and xgboost doesn't do well on large shared clusters due to the lack of a failure recovery mechanism.

@wbo4958 Please help review the PR. In addition, is there a way to test?

jinmfeng001 · 2023-07-27T06:29:41Z

Indeed, there can be many reasons for failure and xgboost doesn't do well on large shared clusters due to the lack of a failure recovery mechanism.

@wbo4958 Please help review the PR. In addition, is there a way to test?

In our situation, we changed it to ShuffleMapStage, and we don't see much application failures now.
It's a good question how to test this change. I don't there's any thing we can do to fail a task. Do you guys know?

wbo4958 · 2023-07-27T06:44:25Z

@jinmfeng001 would you like to provide the whole log with the retry working?

jinmfeng001 · 2023-07-27T06:45:52Z

We're focusing on resolving the failure cuased by hardware failure on one node(In a big cluster, it's common one node may fail due to some hardware issue). So if the task is reassigned to another node it may succeed.

In our situation, this help to remove the failure rate for the whol training applciation.

wbo4958 · 2023-07-27T07:20:50Z

@jinmfeng001 Thx for your explanation, just give me some time to learn it.

wbo4958 · 2023-08-03T03:52:44Z

@jinmfeng001 Thx for your PR, I spent some time learning the spark's retry mechanism. You're right, the spark won't retry the failed barrier ResultStage.

I don't know if this PR will introduce the performance issue since another shuffle stage will involve the disk write/read. I will do some test. Thx for your time.

wbo4958 · 2023-08-03T08:13:42Z

I ran the mortgage test with xgboost-spark-gpu on the spark standalone cluster with 1 worker and spark-rapids enabled. Here is the result

	1st	2nd	3rd
W/O PR	89.379	89.126	89.367
W/ PR	90.173	89.224	89.131

Seems this PR doesn't introduce much extra overhead.

wbo4958 · 2023-08-03T08:14:12Z

LGTM

Change training stage from ResultStage to ShuffleMapStage

2f50c4b

wbo4958 approved these changes Aug 3, 2023

View reviewed changes

trivialfis approved these changes Aug 3, 2023

View reviewed changes

trivialfis merged commit 04c9968 into dmlc:master Aug 3, 2023

trivialfis mentioned this pull request Aug 3, 2023

XGBoost training stage will not retry after one training task failed #8336

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change training stage from ResultStage to ShuffleMapStage #9423

Change training stage from ResultStage to ShuffleMapStage #9423

jinmfeng001 commented Jul 26, 2023

wbo4958 commented Jul 26, 2023

wbo4958 commented Jul 27, 2023

trivialfis commented Jul 27, 2023

jinmfeng001 commented Jul 27, 2023

wbo4958 commented Jul 27, 2023

jinmfeng001 commented Jul 27, 2023

wbo4958 commented Jul 27, 2023

wbo4958 commented Aug 3, 2023

wbo4958 commented Aug 3, 2023

wbo4958 commented Aug 3, 2023

Change training stage from ResultStage to ShuffleMapStage #9423

Change training stage from ResultStage to ShuffleMapStage #9423

Conversation

jinmfeng001 commented Jul 26, 2023

wbo4958 commented Jul 26, 2023

wbo4958 commented Jul 27, 2023

trivialfis commented Jul 27, 2023

jinmfeng001 commented Jul 27, 2023

wbo4958 commented Jul 27, 2023

jinmfeng001 commented Jul 27, 2023

wbo4958 commented Jul 27, 2023

wbo4958 commented Aug 3, 2023

wbo4958 commented Aug 3, 2023

wbo4958 commented Aug 3, 2023