Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comet can't use CometShuffleManager on Yarn Cluster #592

Closed
dpengpeng opened this issue Jun 21, 2024 · 6 comments
Closed

Comet can't use CometShuffleManager on Yarn Cluster #592

dpengpeng opened this issue Jun 21, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@dpengpeng
Copy link

dpengpeng commented Jun 21, 2024

Describe the bug

When I was testing the TPCH data, I used Spark 3.4.3 and submitted the Spark SQL to a Yarn cluster. If at this time I use the parameters "--conf spark.comet.exec.shuffle.enabled=true --conf spark.comet.exec.shuffle.mode=auto --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager", then the Executor will crash and exit without seeing a specific reason. If I do not use these three parameters, it will succeed. Below are some partial exception messages and spark command:

1. Spark command:

2.Error message:

[2024-06-20 16:01:32.054]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :

Steps to reproduce

No response

Expected behavior

No response

Additional context

No response

@dpengpeng dpengpeng added the bug Something isn't working label Jun 21, 2024
@viirya
Copy link
Member

viirya commented Jun 21, 2024

24/06/20 16:01:20 WARN TaskSetManager: Lost task 135.1 in stage 106.1 (TID 2842) (apple5 executor 19): FetchFailed(null, shuffleId=20, mapIndex=-1, mapId=-1, reduceId=135, message=

FetchFailed is a warning message which is not root cause. Usually there should be a root cause causing failure on shuffle tasks for the reducers cannot fetch it. You probably need to look at it in the logs.

@dpengpeng
Copy link
Author

dpengpeng commented Jun 21, 2024

24/06/20 16:01:20 WARN TaskSetManager: Lost task 135.1 in stage 106.1 (TID 2842) (apple5 executor 19): FetchFailed(null, shuffleId=20, mapIndex=-1, mapId=-1, reduceId=135, message=

FetchFailed is a warning message which is not root cause. Usually there should be a root cause causing failure on shuffle tasks for the reducers cannot fetch it. You probably need to look at it in the logs.

@viirya Thank you for you reply.

Excuse me, have you tried using Comet in the Yarn cluster? I retested TPCH, and I have removed all the SQLs that could not run. In the end, only 11 SQLs support the CometShuffleManager (sql2, sql6, sql7, sql12, sql13, sql16, sql18, sql19, sql21, sql22). For the SQLs that could not run, the ERROR level logs I saw in the logs are basically similar to “24/06/20 16:01:06 ERROR YarnScheduler: Lost executor 9 on apple4: Container from a bad node: container_1718611525172_0074_01_000010 on host: apple4. Exit status: 134. Diagnostics: utputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@apple1:37337)”, there are no other clear error messages, I will check later in conjunction with the Yarn cluster log information.

@viirya
Copy link
Member

viirya commented Jun 21, 2024

We don't test Comet on Yarn.

I'm not sure the data scale you are running with. Can you try to increase memory and try again? I roughly remember that Exit status: 134 happens when it is out of memory on the containers.

@parthchandra
Copy link
Contributor

We don't test Comet on Yarn.

I'm not sure the data scale you are running with. Can you try to increase memory and try again? I roughly remember that Exit status: 134 happens when it is out of memory on the containers.

Exit status 134 is a SIGABRT. Could be caused by OOM but also by other reasons like a stack overflow (though that's not likely here).

@viirya
Copy link
Member

viirya commented Jun 21, 2024

Btw, although we don't test Comet on Yarn, as the shuffle implementation doesn't have any Yarn change, I don't see there is difference on shuffle behavior. The most possible issue I think is memory issue.

@dpengpeng
Copy link
Author

this problem is not a bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants