Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEZ-4336: ShuffleScheduler should try to report the original exception (when shuffle becomes unhealthy) #155

Merged
merged 1 commit into from
Nov 2, 2021

Conversation

abstractdog
Copy link
Contributor

@abstractdog abstractdog commented Oct 18, 2021

PR contains 2 changes:

  1. adds original cause to InputAttemptFetchFailure
  2. reporting hostFailures to AM

original exception in hive client was:

Caused by: java.io.IOException: Map_2: Shuffle failed with too many fetch failures and insufficient progress!failureCounts=45, pendingInputs=5947, fetcherHealthy=false, reducerProgressedEnough=false, reducerStalled=false
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isShuffleHealthy(ShuffleScheduler.java:1060)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:798)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:391)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:265)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:184)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:196)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGroupedWithInjectableErrors.callInternal(FetcherOrderedGroupedWithInjectableErrors.java:31)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:59)
	... 7 more
, errorMessage=Shuffle Runner Failed:org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in Fetcher_O {Map_2} #1
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:306)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:288)
	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Map_2: Shuffle failed with too many fetch failures and insufficient progress!failureCounts=45, pendingInputs=5947, fetcherHealthy=false, reducerProgressedEnough=false, reducerStalled=false
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isShuffleHealthy(ShuffleScheduler.java:1060)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:798)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:391)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:265)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:184)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:196)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGroupedWithInjectableErrors.callInternal(FetcherOrderedGroupedWithInjectableErrors.java:31)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:59)
	... 7 more

after the change, the original cause can be seen + hostFailures are also reported, e.g.

], TaskAttempt 1 failed, info=[Error: Error while running task ( failure ) : org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in Fetcher_O {Map_2} #3
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:306)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:288)
	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Map_2: Shuffle failed with too many fetch failures and insufficient progress!failureCounts=133, pendingInputs=5991, fetcherHealthy=false, reducerProgressedEnough=false, reducerStalled=false, hostFailures={ccycloud-5.hive-runtime-perf.root.hwx.site:33418=41, ccycloud-9.hive-runtime-perf.root.hwx.site:36057=41, ccycloud-6.hive-runtime-perf.root.hwx.site:45940=38}
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isShuffleHealthy(ShuffleScheduler.java:1062)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:799)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:391)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:265)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:184)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:196)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGroupedWithInjectableErrors.callInternal(FetcherOrderedGroupedWithInjectableErrors.java:31)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:59)
	... 7 more
Caused by: java.io.IOException: FetcherOrderedGroupedWithInjectableErrors tester made failure for host: ccycloud-6.hive-runtime-perf.root.hwx.site, input attempt: 0
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGroupedWithInjectableErrors.setupConnectionInternal(FetcherOrderedGroupedWithInjectableErrors.java:59)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:362)
	... 12 more
, errorMessage=Shuffle Runner Failed:org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in Fetcher_O {Map_2} #3
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:306)
	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:288)
	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

full log is uploaded to Jira: https://issues.apache.org/jira/secure/attachment/13035045/TEZ_4336_client_output.txt

here, the root cause was my injected error (I also had the WIP TEZ-4338 on the cluster, which includes this feature):

FetcherOrderedGroupedWithInjectableErrors tester made failure for host

under real-life circumstances, this exception could be any kind of connection problem

@tez-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 16m 58s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ master Compile Tests _
+1 💚 mvninstall 12m 47s master passed
+1 💚 compile 0m 35s master passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 compile 0m 33s master passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 checkstyle 1m 7s master passed
+1 💚 javadoc 0m 44s master passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 0m 31s master passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+0 🆗 spotbugs 1m 25s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚 findbugs 1m 24s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 0m 20s the patch passed
+1 💚 compile 0m 21s the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javac 0m 21s the patch passed
+1 💚 compile 0m 19s the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 javac 0m 19s the patch passed
+1 💚 checkstyle 0m 15s the patch passed
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 javadoc 0m 18s the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 0m 17s the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 findbugs 0m 51s the patch passed
_ Other Tests _
+1 💚 unit 5m 22s tez-runtime-library in the patch passed.
+1 💚 asflicense 0m 14s The patch does not generate ASF License warnings.
43m 50s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-155/1/artifact/out/Dockerfile
GITHUB PR #155
JIRA Issue TEZ-4336
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs checkstyle compile
uname Linux bd0d018440b9 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/tez.sh
git revision master / 58fca8b
Default Java Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-155/1/testReport/
Max. process+thread count 2099 (vs. ulimit of 5500)
modules C: tez-runtime-library U: tez-runtime-library
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-155/1/console
versions git=2.25.1 maven=3.6.3 findbugs=3.0.1
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Contributor

@rbalamohan rbalamohan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. LGTM for ordered shuffle scenario.

One thing to check later would be on the exception codepath for unordered case and add it as needed.

@abstractdog abstractdog merged commit 6863a2d into apache:master Nov 2, 2021
mark-bathori pushed a commit to mark-bathori/tez that referenced this pull request Feb 3, 2022
…uffle becomes unhealthy) (apache#155) (Laszlo Bodor reviewed by Rajesh Balamohan)

(cherry picked from commit 6863a2d)
prabhjyotsingh pushed a commit to acceldata-io/tez that referenced this pull request Nov 11, 2024
…n (when shuffle becomes unhealthy) (apache#155) (Laszlo Bodor reviewed by Rajesh Balamohan)

(cherry picked from commit 6863a2d)
prabhjyotsingh pushed a commit to acceldata-io/tez that referenced this pull request Nov 20, 2024
…n (when shuffle becomes unhealthy) (apache#155) (Laszlo Bodor reviewed by Rajesh Balamohan)

(cherry picked from commit 6863a2d)
(cherry picked from commit 436a790)
prabhjyotsingh added a commit to acceldata-io/tez that referenced this pull request Nov 20, 2024
…l exception (when shuffle becomes unhealthy) (apache#155) (Laszlo Bodor reviewed by Rajesh Balamohan) (#16)

(cherry picked from commit 6863a2d)
(cherry picked from commit 436a790)

Co-authored-by: Bodor Laszlo <[email protected]>
shubhluck pushed a commit to acceldata-io/tez that referenced this pull request Nov 21, 2024
…l exception (when shuffle becomes unhealthy) (apache#155) (Laszlo Bodor reviewed by Rajesh Balamohan) (#16)

(cherry picked from commit 6863a2d)
(cherry picked from commit 436a790)

Co-authored-by: Bodor Laszlo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants