-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comet can't use CometShuffleManager on Yarn Cluster #592
Comments
|
@viirya Thank you for you reply. Excuse me, have you tried using Comet in the Yarn cluster? I retested TPCH, and I have removed all the SQLs that could not run. In the end, only 11 SQLs support the CometShuffleManager (sql2, sql6, sql7, sql12, sql13, sql16, sql18, sql19, sql21, sql22). For the SQLs that could not run, the ERROR level logs I saw in the logs are basically similar to “24/06/20 16:01:06 ERROR YarnScheduler: Lost executor 9 on apple4: Container from a bad node: container_1718611525172_0074_01_000010 on host: apple4. Exit status: 134. Diagnostics: utputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@apple1:37337)”, there are no other clear error messages, I will check later in conjunction with the Yarn cluster log information. |
We don't test Comet on Yarn. I'm not sure the data scale you are running with. Can you try to increase memory and try again? I roughly remember that |
Exit status 134 is a SIGABRT. Could be caused by OOM but also by other reasons like a stack overflow (though that's not likely here). |
Btw, although we don't test Comet on Yarn, as the shuffle implementation doesn't have any Yarn change, I don't see there is difference on shuffle behavior. The most possible issue I think is memory issue. |
this problem is not a bug |
Describe the bug
When I was testing the TPCH data, I used Spark 3.4.3 and submitted the Spark SQL to a Yarn cluster. If at this time I use the parameters "--conf spark.comet.exec.shuffle.enabled=true --conf spark.comet.exec.shuffle.mode=auto --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager", then the Executor will crash and exit without seeing a specific reason. If I do not use these three parameters, it will succeed. Below are some partial exception messages and spark command:
1. Spark command:
2.Error message:
[2024-06-20 16:01:32.054]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Steps to reproduce
No response
Expected behavior
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: