-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change training stage from ResultStage to ShuffleMapStage #9423
Change training stage from ResultStage to ShuffleMapStage #9423
Conversation
To fix #8336 |
@jinmfeng001, Per my understanding, the xgboost training is deterministic, @trivialfis please correct me. if the first training fails, the retry, I guess, may still fail (the only difference I can think of is the training task may be distributed to different workers). BTW, have you seen the retry can really help the xgboost training? |
Indeed, there can be many reasons for failure and xgboost doesn't do well on large shared clusters due to the lack of a failure recovery mechanism. @wbo4958 Please help review the PR. In addition, is there a way to test? |
In our situation, we changed it to ShuffleMapStage, and we don't see much application failures now. |
@jinmfeng001 would you like to provide the whole log with the retry working? |
We're focusing on resolving the failure cuased by hardware failure on one node(In a big cluster, it's common one node may fail due to some hardware issue). So if the task is reassigned to another node it may succeed. In our situation, this help to remove the failure rate for the whol training applciation. |
@jinmfeng001 Thx for your explanation, just give me some time to learn it. |
@jinmfeng001 Thx for your PR, I spent some time learning the spark's retry mechanism. You're right, the spark won't retry the failed barrier ResultStage. I don't know if this PR will introduce the performance issue since another shuffle stage will involve the disk write/read. I will do some test. Thx for your time. |
I ran the mortgage test with xgboost-spark-gpu on the spark standalone cluster with 1 worker and spark-rapids enabled. Here is the result
Seems this PR doesn't introduce much extra overhead. |
LGTM |
The training step runs in barrier model and is ResultStage, which means when one of the training task failed the training stage failed, and the whole application failed becuase no retry for ResultStage.
When there are many training tasks, it's easy that one task will fail due to cluster node issue, so the application may has a high failure rate.
In this PR, we add a repartition step to make the training stage as ShuffleMapStage, so that when the training stage failed, the stage can retry and the whole spark application won't fail.