Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7718] [SQL] Speed up partitioning by avoiding closure cleaning #6256

Closed
wants to merge 5 commits into from

Conversation

andrewor14
Copy link
Contributor

According to @yhuai we spent 6-7 seconds cleaning closures in a partitioning job that takes 12 seconds. Since we provide these closures in Spark we know for sure they are serializable, so we can bypass the cleaning.

@andrewor14
Copy link
Contributor Author

Also cc/ @pwendell

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 19, 2015

Test build #33060 has started for PR 6256 at commit e92dff2.

@andrewor14
Copy link
Contributor Author

@yhuai can you try this and see if you can notice the speedup? (Does the closure get serialized correctly?)

@liancheng
Copy link
Contributor

@andrewor14 I can do a local microbenchmark for this.

@andrewor14 andrewor14 force-pushed the sql-partition-speed-up branch from e92dff2 to f7fe143 Compare May 19, 2015 06:31
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 19, 2015

Test build #33061 has started for PR 6256 at commit f7fe143.

@SparkQA
Copy link

SparkQA commented May 19, 2015

Test build #33060 has finished for PR 6256 at commit e92dff2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33060/
Test PASSed.

@SparkQA
Copy link

SparkQA commented May 19, 2015

Test build #33061 has finished for PR 6256 at commit f7fe143.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33061/
Test PASSed.

@liancheng
Copy link
Contributor

Used the same micro benchmark configuration and code in #6225. Removing the clean() call does help, and leads to a ~6% performance gain.

(Not able to upload jvisualvm screenshots right now, will fill them in later.)

@andrewor14
Copy link
Contributor Author

Hm, 6% seems much lower than what I expected. Maybe it's not super worth it.

@yhuai
Copy link
Contributor

yhuai commented May 19, 2015

@andrewor14 Cheng's test did not include my pr (#6252). I just tested this one and #6252. The compilation time is down from 12.6s to 8.3s. My test table has 5000 partitions.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 19, 2015

Test build #33092 has started for PR 6256 at commit 523f042.

@andrewor14
Copy link
Contributor Author

I see. From @liancheng's PR description at #6225, the 6% speed up is from the 53s, which is roughly 3.2s. This is more or less consistent with what @yhuai observed after applying this patch to his PR, where the speed up is roughly 4.3s.

In other words the real speed up is about 34%!

@andrewor14
Copy link
Contributor Author

After some more offline discussion with @yhuai we decided to remove the unnecessary calls to Utils.getCallSite as well, which collectively took up to another 4s in his 12.6s. If we skip all of these calls for internal RDDs we may notice another significant speedup.

@SparkQA
Copy link

SparkQA commented May 19, 2015

Test build #33092 has finished for PR 6256 at commit 523f042.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33092/
Test PASSed.

Andrew Or added 2 commits May 20, 2015 11:34
…peed-up

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/sources/DataSourceStrategy.scala
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 20, 2015

Test build #33159 has started for PR 6256 at commit 10f7e3e.

@SparkQA
Copy link

SparkQA commented May 20, 2015

Test build #33159 has finished for PR 6256 at commit 10f7e3e.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33159/
Test FAILed.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 20, 2015

Test build #33161 has started for PR 6256 at commit a82b451.

@SparkQA
Copy link

SparkQA commented May 20, 2015

Test build #33161 has finished for PR 6256 at commit a82b451.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class MultilabelMetrics(JavaModelWrapper):

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33161/
Test PASSed.

@yhuai
Copy link
Contributor

yhuai commented May 21, 2015

LGTM. I am going to merge it to master and branch 1.4.

asfgit pushed a commit that referenced this pull request May 21, 2015
According to yhuai we spent 6-7 seconds cleaning closures in a partitioning job that takes 12 seconds. Since we provide these closures in Spark we know for sure they are serializable, so we can bypass the cleaning.

Author: Andrew Or <[email protected]>

Closes #6256 from andrewor14/sql-partition-speed-up and squashes the following commits:

a82b451 [Andrew Or] Fix style
10f7e3e [Andrew Or] Avoid getting call sites and cleaning closures
17e2943 [Andrew Or] Merge branch 'master' of github.com:apache/spark into sql-partition-speed-up
523f042 [Andrew Or] Skip unnecessary Utils.getCallSites too
f7fe143 [Andrew Or] Avoid unnecessary closure cleaning

(cherry picked from commit 5287eec)
Signed-off-by: Yin Huai <[email protected]>
@asfgit asfgit closed this in 5287eec May 21, 2015
@andrewor14 andrewor14 deleted the sql-partition-speed-up branch May 21, 2015 22:02
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
According to yhuai we spent 6-7 seconds cleaning closures in a partitioning job that takes 12 seconds. Since we provide these closures in Spark we know for sure they are serializable, so we can bypass the cleaning.

Author: Andrew Or <[email protected]>

Closes apache#6256 from andrewor14/sql-partition-speed-up and squashes the following commits:

a82b451 [Andrew Or] Fix style
10f7e3e [Andrew Or] Avoid getting call sites and cleaning closures
17e2943 [Andrew Or] Merge branch 'master' of github.com:apache/spark into sql-partition-speed-up
523f042 [Andrew Or] Skip unnecessary Utils.getCallSites too
f7fe143 [Andrew Or] Avoid unnecessary closure cleaning
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
According to yhuai we spent 6-7 seconds cleaning closures in a partitioning job that takes 12 seconds. Since we provide these closures in Spark we know for sure they are serializable, so we can bypass the cleaning.

Author: Andrew Or <[email protected]>

Closes apache#6256 from andrewor14/sql-partition-speed-up and squashes the following commits:

a82b451 [Andrew Or] Fix style
10f7e3e [Andrew Or] Avoid getting call sites and cleaning closures
17e2943 [Andrew Or] Merge branch 'master' of github.com:apache/spark into sql-partition-speed-up
523f042 [Andrew Or] Skip unnecessary Utils.getCallSites too
f7fe143 [Andrew Or] Avoid unnecessary closure cleaning
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
According to yhuai we spent 6-7 seconds cleaning closures in a partitioning job that takes 12 seconds. Since we provide these closures in Spark we know for sure they are serializable, so we can bypass the cleaning.

Author: Andrew Or <[email protected]>

Closes apache#6256 from andrewor14/sql-partition-speed-up and squashes the following commits:

a82b451 [Andrew Or] Fix style
10f7e3e [Andrew Or] Avoid getting call sites and cleaning closures
17e2943 [Andrew Or] Merge branch 'master' of github.com:apache/spark into sql-partition-speed-up
523f042 [Andrew Or] Skip unnecessary Utils.getCallSites too
f7fe143 [Andrew Or] Avoid unnecessary closure cleaning
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants