[SPARK-2403] Catch all errors during serialization in DAGScheduler #1329

darabos · 2014-07-08T13:12:45Z

https://issues.apache.org/jira/browse/SPARK-2403

Spark hangs for us whenever we forget to register a class with Kryo. This should be a simple fix for that. But let me know if you have a better suggestion.

I did not write a new test for this. It would be pretty complicated and I'm not sure it's worthwhile for such a simple change. Let me know if you disagree.

AmplabJenkins · 2014-07-08T13:16:07Z

Can one of the admins verify this patch?

aarondav · 2014-07-08T16:28:10Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

@@ -768,6 +768,10 @@ class DAGScheduler(
          abortStage(stage, "Task not serializable: " + e.toString)
          runningStages -= stage
          return
+        case e: Throwable => // Other exceptions, such as IllegalArgumentException from Kryo.


Please catch NonFatal(e) instead. I think we should catch StackOverflowError here (as that is a possible error during serialization), but we should not catch OOMs and other such throwables except to re-throw them.

NB: Despite what the documentation says, NonFatal does indeed seem to catch StackOverflowError:

scala> NonFatal(new StackOverflowError()) res1: Boolean = true scala> NonFatal(new OutOfMemoryError()) res2: Boolean = false

I suspect you are testing this on 2.10. Looks like a change in 2.11:

scala/scala@6460365#diff-ff42321ce198f97308744271b7e17c76

I think their argument applies to Spark too. Sounds like it is not safe to try and recover from StackOverflowError.

Thanks for the comments! I'll update the pull request in a moment.

aarondav · 2014-07-08T16:48:50Z

Jenkins, ok to test.

AmplabJenkins · 2014-07-08T16:51:07Z

Merged build triggered.

AmplabJenkins · 2014-07-08T16:51:13Z

Merged build started.

…ion.

darabos · 2014-07-08T17:04:46Z

Thanks! I've added the suggested changes.

AmplabJenkins · 2014-07-08T17:06:07Z

Merged build triggered.

AmplabJenkins · 2014-07-08T17:06:13Z

Merged build started.

aarondav · 2014-07-08T17:14:19Z

LGTM. Regarding the initial problem you observed, did you see the actual exception via the DAGScheduler's OneForOneStrategy failure? Or were there no log messages containing the error?

AmplabJenkins · 2014-07-08T17:34:54Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-07-08T17:34:55Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16408/

darabos · 2014-07-08T17:40:59Z

LGTM. Regarding the initial problem you observed, did you see the actual exception via the DAGScheduler's OneForOneStrategy failure? Or were there no log messages containing the error?

Yes, the exception was logged from OneForOneStrategy. See the stack trace in https://issues.apache.org/jira/browse/SPARK-2403. (Well, except I omitted the first line which names OneForOneStrategy. Sorry about that.)

But after logging that, the system stalled. localhost:4040 was refusing connections and jstack showed just a number of waiting threads. (I've left work now, but I can paste more details tomorrow if you're interested.)

aarondav · 2014-07-08T17:43:23Z

Great, thanks! I just wanted to make sure it was actually printed somewhere, although I understand the behavior was not ideal.

aarondav · 2014-07-08T17:44:23Z

Merged into master and branch-1.0. Thanks!

https://issues.apache.org/jira/browse/SPARK-2403 Spark hangs for us whenever we forget to register a class with Kryo. This should be a simple fix for that. But let me know if you have a better suggestion. I did not write a new test for this. It would be pretty complicated and I'm not sure it's worthwhile for such a simple change. Let me know if you disagree. Author: Daniel Darabos <[email protected]> Closes #1329 from darabos/spark-2403 and squashes the following commits: 3aceaad [Daniel Darabos] Print full stack trace for miscellaneous exceptions during serialization. 52c22ba [Daniel Darabos] Only catch NonFatal exceptions. 361e962 [Daniel Darabos] Catch all errors during serialization in DAGScheduler. (cherry picked from commit c8a2313) Signed-off-by: Aaron Davidson <[email protected]>

AmplabJenkins · 2014-07-08T17:54:01Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-07-08T17:54:01Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16409/

https://issues.apache.org/jira/browse/SPARK-2403 Spark hangs for us whenever we forget to register a class with Kryo. This should be a simple fix for that. But let me know if you have a better suggestion. I did not write a new test for this. It would be pretty complicated and I'm not sure it's worthwhile for such a simple change. Let me know if you disagree. Author: Daniel Darabos <[email protected]> Closes apache#1329 from darabos/spark-2403 and squashes the following commits: 3aceaad [Daniel Darabos] Print full stack trace for miscellaneous exceptions during serialization. 52c22ba [Daniel Darabos] Only catch NonFatal exceptions. 361e962 [Daniel Darabos] Catch all errors during serialization in DAGScheduler.

…espect to aliases to avoid unneeded exchange/sort nodes ### What changes were proposed in this pull request? This pull request tries to remove unneeded exchanges/sorts by normalizing the output partitioning and sortorder information correctly with respect to aliases. Example: consider this join of three tables: |SELECT t2id, t3.id as t3id |FROM ( | SELECT t1.id as t1id, t2.id as t2id | FROM t1, t2 | WHERE t1.id = t2.id |) t12, t3 |WHERE t1id = t3.id The plan for this looks like: *(9) Project [t2id#1034L, id#1004L AS t3id#1035L] +- *(9) SortMergeJoin [t1id#1033L], [id#1004L], Inner :- *(6) Sort [t1id#1033L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(t1id#1033L, 5), true, [id=#1343] <------------------------------ : +- *(5) Project [id#996L AS t1id#1033L, id#1000L AS t2id#1034L] : +- *(5) SortMergeJoin [id#996L], [id#1000L], Inner : :- *(2) Sort [id#996L ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(id#996L, 5), true, [id=#1329] : : +- *(1) Range (0, 10, step=1, splits=2) : +- *(4) Sort [id#1000L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#1000L, 5), true, [id=#1335] : +- *(3) Range (0, 20, step=1, splits=2) +- *(8) Sort [id#1004L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#1004L, 5), true, [id=#1349] +- *(7) Range (0, 30, step=1, splits=2) In this plan, the marked exchange could have been avoided as the data is already partitioned on "t1.id". This happens because AliasAwareOutputPartitioning class handles aliases only related to HashPartitioning. This change normalizes all output partitioning based on aliasing happening in Project. ### Why are the changes needed? To remove unneeded exchanges. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT added. On TPCDS 1000 scale, this change improves the performance of query 95 from 330 seconds to 170 seconds by removing the extra Exchange. Closes #30300 from prakharjain09/SPARK-33399-outputpartitioning. Authored-by: Prakhar Jain <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>

…rtitioning and sortorder with respect to aliases to avoid unneeded exchange/sort nodes (#1092) * [SPARK-31078][SQL] Respect aliases in output ordering Currently, in the following scenario, an unnecessary `Sort` node is introduced: ```scala withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "0") { val df = (0 until 20).toDF("i").as("df") df.repartition(8, df("i")).write.format("parquet") .bucketBy(8, "i").sortBy("i").saveAsTable("t") val t1 = spark.table("t") val t2 = t1.selectExpr("i as ii") t1.join(t2, t1("i") === t2("ii")).explain } ``` ``` == Physical Plan == *(3) SortMergeJoin [i#8], [ii#10], Inner :- *(1) Project [i#8] : +- *(1) Filter isnotnull(i#8) : +- *(1) ColumnarToRow : +- FileScan parquet default.t[i#8] Batched: true, DataFilters: [isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int>, SelectedBucketsCount: 8 out of 8 +- *(2) Sort [ii#10 ASC NULLS FIRST], false, 0 <==== UNNECESSARY +- *(2) Project [i#8 AS ii#10] +- *(2) Filter isnotnull(i#8) +- *(2) ColumnarToRow +- FileScan parquet default.t[i#8] Batched: true, DataFilters: [isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int>, SelectedBucketsCount: 8 out of 8 ``` Notice that `Sort [ii#10 ASC NULLS FIRST], false, 0` is introduced even though the underlying data is already sorted. This is because `outputOrdering` doesn't handle aliases correctly. This PR proposes to fix this issue. To better handle aliases in `outputOrdering`. Yes, now with the fix, the `explain` prints out the following: ``` == Physical Plan == *(3) SortMergeJoin [i#8], [ii#10], Inner :- *(1) Project [i#8] : +- *(1) Filter isnotnull(i#8) : +- *(1) ColumnarToRow : +- FileScan parquet default.t[i#8] Batched: true, DataFilters: [isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int>, SelectedBucketsCount: 8 out of 8 +- *(2) Project [i#8 AS ii#10] +- *(2) Filter isnotnull(i#8) +- *(2) ColumnarToRow +- FileScan parquet default.t[i#8] Batched: true, DataFilters: [isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int>, SelectedBucketsCount: 8 out of 8 ``` Tests added. Closes #27842 from imback82/alias_aware_sort_order. Authored-by: Terry Kim <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-33399][SQL] Normalize output partitioning and sortorder with respect to aliases to avoid unneeded exchange/sort nodes This pull request tries to remove unneeded exchanges/sorts by normalizing the output partitioning and sortorder information correctly with respect to aliases. Example: consider this join of three tables: |SELECT t2id, t3.id as t3id |FROM ( | SELECT t1.id as t1id, t2.id as t2id | FROM t1, t2 | WHERE t1.id = t2.id |) t12, t3 |WHERE t1id = t3.id The plan for this looks like: *(9) Project [t2id#1034L, id#1004L AS t3id#1035L] +- *(9) SortMergeJoin [t1id#1033L], [id#1004L], Inner :- *(6) Sort [t1id#1033L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(t1id#1033L, 5), true, [id=#1343] <------------------------------ : +- *(5) Project [id#996L AS t1id#1033L, id#1000L AS t2id#1034L] : +- *(5) SortMergeJoin [id#996L], [id#1000L], Inner : :- *(2) Sort [id#996L ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(id#996L, 5), true, [id=#1329] : : +- *(1) Range (0, 10, step=1, splits=2) : +- *(4) Sort [id#1000L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#1000L, 5), true, [id=#1335] : +- *(3) Range (0, 20, step=1, splits=2) +- *(8) Sort [id#1004L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#1004L, 5), true, [id=#1349] +- *(7) Range (0, 30, step=1, splits=2) In this plan, the marked exchange could have been avoided as the data is already partitioned on "t1.id". This happens because AliasAwareOutputPartitioning class handles aliases only related to HashPartitioning. This change normalizes all output partitioning based on aliasing happening in Project. To remove unneeded exchanges. No New UT added. On TPCDS 1000 scale, this change improves the performance of query 95 from 330 seconds to 170 seconds by removing the extra Exchange. Closes #30300 from prakharjain09/SPARK-33399-outputpartitioning. Authored-by: Prakhar Jain <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> * [CARMEL-6306] Fix ut * [CARMEL-6306] Fix alias not compatible with ebay skew implementation Co-authored-by: Terry Kim <[email protected]> Co-authored-by: Prakhar Jain <[email protected]>

Catch all errors during serialization in DAGScheduler.

361e962

aarondav reviewed Jul 8, 2014
View reviewed changes

darabos added 2 commits July 8, 2014 19:03

Only catch NonFatal exceptions.

52c22ba

Print full stack trace for miscellaneous exceptions during serializat…

3aceaad

…ion.

asfgit closed this in c8a2313 Jul 8, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2403] Catch all errors during serialization in DAGScheduler #1329

[SPARK-2403] Catch all errors during serialization in DAGScheduler #1329

darabos commented Jul 8, 2014

AmplabJenkins commented Jul 8, 2014

aarondav Jul 8, 2014

darabos Jul 8, 2014

aarondav commented Jul 8, 2014

AmplabJenkins commented Jul 8, 2014

AmplabJenkins commented Jul 8, 2014

darabos commented Jul 8, 2014

AmplabJenkins commented Jul 8, 2014

AmplabJenkins commented Jul 8, 2014

aarondav commented Jul 8, 2014

AmplabJenkins commented Jul 8, 2014

AmplabJenkins commented Jul 8, 2014

darabos commented Jul 8, 2014

aarondav commented Jul 8, 2014

aarondav commented Jul 8, 2014

AmplabJenkins commented Jul 8, 2014

AmplabJenkins commented Jul 8, 2014

[SPARK-2403] Catch all errors during serialization in DAGScheduler #1329

[SPARK-2403] Catch all errors during serialization in DAGScheduler #1329

Conversation

darabos commented Jul 8, 2014

AmplabJenkins commented Jul 8, 2014

aarondav Jul 8, 2014

Choose a reason for hiding this comment

darabos Jul 8, 2014

Choose a reason for hiding this comment

aarondav commented Jul 8, 2014

AmplabJenkins commented Jul 8, 2014

AmplabJenkins commented Jul 8, 2014

darabos commented Jul 8, 2014

AmplabJenkins commented Jul 8, 2014

AmplabJenkins commented Jul 8, 2014

aarondav commented Jul 8, 2014

AmplabJenkins commented Jul 8, 2014

AmplabJenkins commented Jul 8, 2014

darabos commented Jul 8, 2014

aarondav commented Jul 8, 2014

aarondav commented Jul 8, 2014

AmplabJenkins commented Jul 8, 2014

AmplabJenkins commented Jul 8, 2014