[SPARK-2423] Clean up SparkSubmit for readability #1349

andrewor14 · 2014-07-09T22:17:40Z

It is currently non-trivial to trace through how different combinations of cluster managers (e.g. yarn) and deploy modes (e.g. cluster) are processed in SparkSubmit. Moving forward, it will be easier to extend SparkSubmit if we first re-organize the code by grouping related logic together.

This is a precursor to fixing standalone-cluster mode, which is currently broken (SPARK-2260).

This also moves the code block for python applications back to the right place.

AmplabJenkins · 2014-07-09T22:21:19Z

Merged build started.

AmplabJenkins · 2014-07-09T22:21:27Z

Merged build triggered.

sryza · 2014-07-09T23:20:24Z

core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

- * modes that Spark supports.
+ * Main gateway of launching a Spark application.
+ *
+ * This script handles setting up the classpath with relevant Spark dependencies and provides


I think saying "this script" is a little confusing. If the class encapsulated all the logic behind the script, it might make a little more sense, but the script has some logic of its own.

I agree. I'll call it a program or something

sryza · 2014-07-09T23:22:51Z

These changes seem reasonable to me.

AmplabJenkins · 2014-07-09T23:42:51Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-07-09T23:42:52Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16483/

andrewor14 · 2014-07-10T01:34:06Z

An important goal of this PR is to keep the functionality the same. I have reviewed my own changes closely to verify that the logic is preserved, but it would be best if others can also take a close look to make sure I didn't miss any case. I think @sryza and @pwendell are good candidates (though it looks like one of you already did).

I will do an end-to-end sanity test on each of the platforms to make sure this doesn't break anything.

In the yarn code block, we set the deploy mode to "cluster" the master is "yarn-cluster" and deply mode is not specified. However, by then we have already passed the error checks that prevent users from launching python and shell applications in cluster mode. This is fixed by re-ordering the two code blocks.

andrewor14 · 2014-07-11T01:14:25Z

I found a small bug by testing on YARN. If we run a shell or python application with --master yarn-cluster and no deploy mode, spark submit is supposed to get mad because cluster mode is not supported for either application. This is a simple ordering issue fixed in the latest commit.

SparkQA · 2014-07-11T01:17:31Z

QA tests have started for PR 1349. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16539/consoleFull

andrewor14 · 2014-07-11T01:18:22Z

As of this comment, I have tested these changes in local, standalone and yarn modes, running with additional python files and jars and various configs, and everything behaves as expected. I do not have access to a mesos cluster, but there is basically no mesos specific code in spark submit so I believe it's OK.

As expected, standalone cluster mode still doesn't work. This is to be fixed in a future PR that builds on top of this one.

SparkQA · 2014-07-11T01:22:36Z

QA tests have started for PR 1349. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16541/consoleFull

SparkQA · 2014-07-11T02:54:03Z

QA results for PR 1349:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
* (4) the main class for the child

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16539/consoleFull

SparkQA · 2014-07-11T02:59:05Z

QA results for PR 1349:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
* (4) the main class for the child

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16541/consoleFull

pwendell · 2014-07-17T08:13:11Z

Thanks Andrew, looks good!

It is currently non-trivial to trace through how different combinations of cluster managers (e.g. yarn) and deploy modes (e.g. cluster) are processed in SparkSubmit. Moving forward, it will be easier to extend SparkSubmit if we first re-organize the code by grouping related logic together. This is a precursor to fixing standalone-cluster mode, which is currently broken (SPARK-2260). Author: Andrew Or <[email protected]> Closes apache#1349 from andrewor14/submit-cleanup and squashes the following commits: 8f99200 [Andrew Or] script -> program (minor) 30f2e65 [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-cleanup fe484a1 [Andrew Or] Move deploy mode checks after yarn code 7167824 [Andrew Or] Re-order config options and update comments 0b01ff8 [Andrew Or] Clean up SparkSubmit for readability

…espect to aliases to avoid unneeded exchange/sort nodes ### What changes were proposed in this pull request? This pull request tries to remove unneeded exchanges/sorts by normalizing the output partitioning and sortorder information correctly with respect to aliases. Example: consider this join of three tables: |SELECT t2id, t3.id as t3id |FROM ( | SELECT t1.id as t1id, t2.id as t2id | FROM t1, t2 | WHERE t1.id = t2.id |) t12, t3 |WHERE t1id = t3.id The plan for this looks like: *(9) Project [t2id#1034L, id#1004L AS t3id#1035L] +- *(9) SortMergeJoin [t1id#1033L], [id#1004L], Inner :- *(6) Sort [t1id#1033L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(t1id#1033L, 5), true, [id=#1343] <------------------------------ : +- *(5) Project [id#996L AS t1id#1033L, id#1000L AS t2id#1034L] : +- *(5) SortMergeJoin [id#996L], [id#1000L], Inner : :- *(2) Sort [id#996L ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(id#996L, 5), true, [id=#1329] : : +- *(1) Range (0, 10, step=1, splits=2) : +- *(4) Sort [id#1000L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#1000L, 5), true, [id=#1335] : +- *(3) Range (0, 20, step=1, splits=2) +- *(8) Sort [id#1004L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#1004L, 5), true, [id=#1349] +- *(7) Range (0, 30, step=1, splits=2) In this plan, the marked exchange could have been avoided as the data is already partitioned on "t1.id". This happens because AliasAwareOutputPartitioning class handles aliases only related to HashPartitioning. This change normalizes all output partitioning based on aliasing happening in Project. ### Why are the changes needed? To remove unneeded exchanges. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT added. On TPCDS 1000 scale, this change improves the performance of query 95 from 330 seconds to 170 seconds by removing the extra Exchange. Closes #30300 from prakharjain09/SPARK-33399-outputpartitioning. Authored-by: Prakhar Jain <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>

…rtitioning and sortorder with respect to aliases to avoid unneeded exchange/sort nodes (#1092) * [SPARK-31078][SQL] Respect aliases in output ordering Currently, in the following scenario, an unnecessary `Sort` node is introduced: ```scala withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "0") { val df = (0 until 20).toDF("i").as("df") df.repartition(8, df("i")).write.format("parquet") .bucketBy(8, "i").sortBy("i").saveAsTable("t") val t1 = spark.table("t") val t2 = t1.selectExpr("i as ii") t1.join(t2, t1("i") === t2("ii")).explain } ``` ``` == Physical Plan == *(3) SortMergeJoin [i#8], [ii#10], Inner :- *(1) Project [i#8] : +- *(1) Filter isnotnull(i#8) : +- *(1) ColumnarToRow : +- FileScan parquet default.t[i#8] Batched: true, DataFilters: [isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int>, SelectedBucketsCount: 8 out of 8 +- *(2) Sort [ii#10 ASC NULLS FIRST], false, 0 <==== UNNECESSARY +- *(2) Project [i#8 AS ii#10] +- *(2) Filter isnotnull(i#8) +- *(2) ColumnarToRow +- FileScan parquet default.t[i#8] Batched: true, DataFilters: [isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int>, SelectedBucketsCount: 8 out of 8 ``` Notice that `Sort [ii#10 ASC NULLS FIRST], false, 0` is introduced even though the underlying data is already sorted. This is because `outputOrdering` doesn't handle aliases correctly. This PR proposes to fix this issue. To better handle aliases in `outputOrdering`. Yes, now with the fix, the `explain` prints out the following: ``` == Physical Plan == *(3) SortMergeJoin [i#8], [ii#10], Inner :- *(1) Project [i#8] : +- *(1) Filter isnotnull(i#8) : +- *(1) ColumnarToRow : +- FileScan parquet default.t[i#8] Batched: true, DataFilters: [isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int>, SelectedBucketsCount: 8 out of 8 +- *(2) Project [i#8 AS ii#10] +- *(2) Filter isnotnull(i#8) +- *(2) ColumnarToRow +- FileScan parquet default.t[i#8] Batched: true, DataFilters: [isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int>, SelectedBucketsCount: 8 out of 8 ``` Tests added. Closes #27842 from imback82/alias_aware_sort_order. Authored-by: Terry Kim <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-33399][SQL] Normalize output partitioning and sortorder with respect to aliases to avoid unneeded exchange/sort nodes This pull request tries to remove unneeded exchanges/sorts by normalizing the output partitioning and sortorder information correctly with respect to aliases. Example: consider this join of three tables: |SELECT t2id, t3.id as t3id |FROM ( | SELECT t1.id as t1id, t2.id as t2id | FROM t1, t2 | WHERE t1.id = t2.id |) t12, t3 |WHERE t1id = t3.id The plan for this looks like: *(9) Project [t2id#1034L, id#1004L AS t3id#1035L] +- *(9) SortMergeJoin [t1id#1033L], [id#1004L], Inner :- *(6) Sort [t1id#1033L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(t1id#1033L, 5), true, [id=#1343] <------------------------------ : +- *(5) Project [id#996L AS t1id#1033L, id#1000L AS t2id#1034L] : +- *(5) SortMergeJoin [id#996L], [id#1000L], Inner : :- *(2) Sort [id#996L ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(id#996L, 5), true, [id=#1329] : : +- *(1) Range (0, 10, step=1, splits=2) : +- *(4) Sort [id#1000L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#1000L, 5), true, [id=#1335] : +- *(3) Range (0, 20, step=1, splits=2) +- *(8) Sort [id#1004L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#1004L, 5), true, [id=#1349] +- *(7) Range (0, 30, step=1, splits=2) In this plan, the marked exchange could have been avoided as the data is already partitioned on "t1.id". This happens because AliasAwareOutputPartitioning class handles aliases only related to HashPartitioning. This change normalizes all output partitioning based on aliasing happening in Project. To remove unneeded exchanges. No New UT added. On TPCDS 1000 scale, this change improves the performance of query 95 from 330 seconds to 170 seconds by removing the extra Exchange. Closes #30300 from prakharjain09/SPARK-33399-outputpartitioning. Authored-by: Prakhar Jain <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> * [CARMEL-6306] Fix ut * [CARMEL-6306] Fix alias not compatible with ebay skew implementation Co-authored-by: Terry Kim <[email protected]> Co-authored-by: Prakhar Jain <[email protected]>

andrewor14 added 2 commits July 9, 2014 13:12

Clean up SparkSubmit for readability

0b01ff8

Re-order config options and update comments

7167824

This also moves the code block for python applications back to the right place.

sryza reviewed Jul 9, 2014
View reviewed changes

andrewor14 added 2 commits July 10, 2014 18:08

Merge branch 'master' of github.com:apache/spark into submit-cleanup

30f2e65

script -> program (minor)

8f99200

asfgit closed this in 9c73822 Jul 17, 2014

andrewor14 deleted the submit-cleanup branch July 17, 2014 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2423] Clean up SparkSubmit for readability #1349

[SPARK-2423] Clean up SparkSubmit for readability #1349

andrewor14 commented Jul 9, 2014

AmplabJenkins commented Jul 9, 2014

AmplabJenkins commented Jul 9, 2014

sryza Jul 9, 2014

andrewor14 Jul 10, 2014

sryza commented Jul 9, 2014

AmplabJenkins commented Jul 9, 2014

AmplabJenkins commented Jul 9, 2014

andrewor14 commented Jul 10, 2014

andrewor14 commented Jul 11, 2014

SparkQA commented Jul 11, 2014

andrewor14 commented Jul 11, 2014

SparkQA commented Jul 11, 2014

SparkQA commented Jul 11, 2014

SparkQA commented Jul 11, 2014

pwendell commented Jul 17, 2014

[SPARK-2423] Clean up SparkSubmit for readability #1349

[SPARK-2423] Clean up SparkSubmit for readability #1349

Conversation

andrewor14 commented Jul 9, 2014

AmplabJenkins commented Jul 9, 2014

AmplabJenkins commented Jul 9, 2014

sryza Jul 9, 2014

Choose a reason for hiding this comment

andrewor14 Jul 10, 2014

Choose a reason for hiding this comment

sryza commented Jul 9, 2014

AmplabJenkins commented Jul 9, 2014

AmplabJenkins commented Jul 9, 2014

andrewor14 commented Jul 10, 2014

andrewor14 commented Jul 11, 2014

SparkQA commented Jul 11, 2014

andrewor14 commented Jul 11, 2014

SparkQA commented Jul 11, 2014

SparkQA commented Jul 11, 2014

SparkQA commented Jul 11, 2014

pwendell commented Jul 17, 2014