Comet can't use CometShuffleManager on Yarn Cluster #592

dpengpeng · 2024-06-21T02:08:45Z

Describe the bug

When I was testing the TPCH data, I used Spark 3.4.3 and submitted the Spark SQL to a Yarn cluster. If at this time I use the parameters "--conf spark.comet.exec.shuffle.enabled=true --conf spark.comet.exec.shuffle.mode=auto --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager", then the Executor will crash and exit without seeing a specific reason. If I do not use these three parameters, it will succeed. Below are some partial exception messages and spark command:

1. Spark command:

2.Error message:

[2024-06-20 16:01:32.054]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :

Steps to reproduce

No response

Expected behavior

No response

Additional context

No response

viirya · 2024-06-21T02:28:17Z

24/06/20 16:01:20 WARN TaskSetManager: Lost task 135.1 in stage 106.1 (TID 2842) (apple5 executor 19): FetchFailed(null, shuffleId=20, mapIndex=-1, mapId=-1, reduceId=135, message=

FetchFailed is a warning message which is not root cause. Usually there should be a root cause causing failure on shuffle tasks for the reducers cannot fetch it. You probably need to look at it in the logs.

dpengpeng · 2024-06-21T09:07:28Z

24/06/20 16:01:20 WARN TaskSetManager: Lost task 135.1 in stage 106.1 (TID 2842) (apple5 executor 19): FetchFailed(null, shuffleId=20, mapIndex=-1, mapId=-1, reduceId=135, message=

FetchFailed is a warning message which is not root cause. Usually there should be a root cause causing failure on shuffle tasks for the reducers cannot fetch it. You probably need to look at it in the logs.

@viirya Thank you for you reply.

Excuse me, have you tried using Comet in the Yarn cluster? I retested TPCH, and I have removed all the SQLs that could not run. In the end, only 11 SQLs support the CometShuffleManager (sql2, sql6, sql7, sql12, sql13, sql16, sql18, sql19, sql21, sql22). For the SQLs that could not run, the ERROR level logs I saw in the logs are basically similar to “24/06/20 16:01:06 ERROR YarnScheduler: Lost executor 9 on apple4: Container from a bad node: container_1718611525172_0074_01_000010 on host: apple4. Exit status: 134. Diagnostics: utputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@apple1:37337)”, there are no other clear error messages, I will check later in conjunction with the Yarn cluster log information.

viirya · 2024-06-21T13:15:47Z

We don't test Comet on Yarn.

I'm not sure the data scale you are running with. Can you try to increase memory and try again? I roughly remember that Exit status: 134 happens when it is out of memory on the containers.

parthchandra · 2024-06-21T18:17:09Z

We don't test Comet on Yarn.

I'm not sure the data scale you are running with. Can you try to increase memory and try again? I roughly remember that Exit status: 134 happens when it is out of memory on the containers.

Exit status 134 is a SIGABRT. Could be caused by OOM but also by other reasons like a stack overflow (though that's not likely here).

viirya · 2024-06-21T18:24:05Z

Btw, although we don't test Comet on Yarn, as the shuffle implementation doesn't have any Yarn change, I don't see there is difference on shuffle behavior. The most possible issue I think is memory issue.

dpengpeng · 2024-06-27T13:49:42Z

this problem is not a bug

dpengpeng added the bug Something isn't working label Jun 21, 2024

dpengpeng closed this as completed Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comet can't use CometShuffleManager on Yarn Cluster #592

Comet can't use CometShuffleManager on Yarn Cluster #592

dpengpeng commented Jun 21, 2024 •

edited

Loading

viirya commented Jun 21, 2024

dpengpeng commented Jun 21, 2024 •

edited

Loading

viirya commented Jun 21, 2024

parthchandra commented Jun 21, 2024

viirya commented Jun 21, 2024

dpengpeng commented Jun 27, 2024

Comet can't use CometShuffleManager on Yarn Cluster #592

Comet can't use CometShuffleManager on Yarn Cluster #592

Comments

dpengpeng commented Jun 21, 2024 • edited Loading

Describe the bug

1. Spark command:

2.Error message:

Steps to reproduce

Expected behavior

Additional context

viirya commented Jun 21, 2024

dpengpeng commented Jun 21, 2024 • edited Loading

viirya commented Jun 21, 2024

parthchandra commented Jun 21, 2024

viirya commented Jun 21, 2024

dpengpeng commented Jun 27, 2024

dpengpeng commented Jun 21, 2024 •

edited

Loading

dpengpeng commented Jun 21, 2024 •

edited

Loading