Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change training stage from ResultStage to ShuffleMapStage #9423

Merged

Conversation

jinmfeng001
Copy link
Contributor

The training step runs in barrier model and is ResultStage, which means when one of the training task failed the training stage failed, and the whole application failed becuase no retry for ResultStage.
When there are many training tasks, it's easy that one task will fail due to cluster node issue, so the application may has a high failure rate.
In this PR, we add a repartition step to make the training stage as ShuffleMapStage, so that when the training stage failed, the stage can retry and the whole spark application won't fail.

@wbo4958
Copy link
Contributor

wbo4958 commented Jul 26, 2023

To fix #8336

@wbo4958
Copy link
Contributor

wbo4958 commented Jul 27, 2023

@jinmfeng001, Per my understanding, the xgboost training is deterministic, @trivialfis please correct me. if the first training fails, the retry, I guess, may still fail (the only difference I can think of is the training task may be distributed to different workers).

BTW, have you seen the retry can really help the xgboost training?

@trivialfis
Copy link
Member

Indeed, there can be many reasons for failure and xgboost doesn't do well on large shared clusters due to the lack of a failure recovery mechanism.

@wbo4958 Please help review the PR. In addition, is there a way to test?

@jinmfeng001
Copy link
Contributor Author

Indeed, there can be many reasons for failure and xgboost doesn't do well on large shared clusters due to the lack of a failure recovery mechanism.

@wbo4958 Please help review the PR. In addition, is there a way to test?

In our situation, we changed it to ShuffleMapStage, and we don't see much application failures now.
It's a good question how to test this change. I don't there's any thing we can do to fail a task. Do you guys know?

@wbo4958
Copy link
Contributor

wbo4958 commented Jul 27, 2023

@jinmfeng001 would you like to provide the whole log with the retry working?

@jinmfeng001
Copy link
Contributor Author

We're focusing on resolving the failure cuased by hardware failure on one node(In a big cluster, it's common one node may fail due to some hardware issue). So if the task is reassigned to another node it may succeed.

In our situation, this help to remove the failure rate for the whol training applciation.

@wbo4958
Copy link
Contributor

wbo4958 commented Jul 27, 2023

@jinmfeng001 Thx for your explanation, just give me some time to learn it.

@wbo4958
Copy link
Contributor

wbo4958 commented Aug 3, 2023

@jinmfeng001 Thx for your PR, I spent some time learning the spark's retry mechanism. You're right, the spark won't retry the failed barrier ResultStage.

I don't know if this PR will introduce the performance issue since another shuffle stage will involve the disk write/read. I will do some test. Thx for your time.

@wbo4958
Copy link
Contributor

wbo4958 commented Aug 3, 2023

I ran the mortgage test with xgboost-spark-gpu on the spark standalone cluster with 1 worker and spark-rapids enabled. Here is the result

1st 2nd 3rd
W/O PR 89.379 89.126 89.367
W/ PR 90.173 89.224 89.131

Seems this PR doesn't introduce much extra overhead.

@wbo4958
Copy link
Contributor

wbo4958 commented Aug 3, 2023

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants