-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-7718] [SQL] Speed up partitioning by avoiding closure cleaning #6256
Conversation
Also cc/ @pwendell |
Merged build triggered. |
Merged build started. |
Test build #33060 has started for PR 6256 at commit |
@yhuai can you try this and see if you can notice the speedup? (Does the closure get serialized correctly?) |
@andrewor14 I can do a local microbenchmark for this. |
e92dff2
to
f7fe143
Compare
Merged build triggered. |
Merged build started. |
Test build #33061 has started for PR 6256 at commit |
Test build #33060 has finished for PR 6256 at commit
|
Merged build finished. Test PASSed. |
Test PASSed. |
Test build #33061 has finished for PR 6256 at commit
|
Merged build finished. Test PASSed. |
Test PASSed. |
Used the same micro benchmark configuration and code in #6225. Removing the (Not able to upload |
Hm, 6% seems much lower than what I expected. Maybe it's not super worth it. |
@andrewor14 Cheng's test did not include my pr (#6252). I just tested this one and #6252. The compilation time is down from 12.6s to 8.3s. My test table has 5000 partitions. |
Merged build triggered. |
Merged build started. |
Test build #33092 has started for PR 6256 at commit |
I see. From @liancheng's PR description at #6225, the 6% speed up is from the In other words the real speed up is about 34%! |
After some more offline discussion with @yhuai we decided to remove the unnecessary calls to |
Test build #33092 has finished for PR 6256 at commit
|
Merged build finished. Test PASSed. |
Test PASSed. |
…peed-up Conflicts: sql/core/src/main/scala/org/apache/spark/sql/sources/DataSourceStrategy.scala
Merged build triggered. |
Merged build started. |
Test build #33159 has started for PR 6256 at commit |
Test build #33159 has finished for PR 6256 at commit
|
Merged build finished. Test FAILed. |
Test FAILed. |
Merged build triggered. |
Merged build started. |
Test build #33161 has started for PR 6256 at commit |
Test build #33161 has finished for PR 6256 at commit
|
Merged build finished. Test PASSed. |
Test PASSed. |
LGTM. I am going to merge it to master and branch 1.4. |
According to yhuai we spent 6-7 seconds cleaning closures in a partitioning job that takes 12 seconds. Since we provide these closures in Spark we know for sure they are serializable, so we can bypass the cleaning. Author: Andrew Or <[email protected]> Closes #6256 from andrewor14/sql-partition-speed-up and squashes the following commits: a82b451 [Andrew Or] Fix style 10f7e3e [Andrew Or] Avoid getting call sites and cleaning closures 17e2943 [Andrew Or] Merge branch 'master' of github.com:apache/spark into sql-partition-speed-up 523f042 [Andrew Or] Skip unnecessary Utils.getCallSites too f7fe143 [Andrew Or] Avoid unnecessary closure cleaning (cherry picked from commit 5287eec) Signed-off-by: Yin Huai <[email protected]>
According to yhuai we spent 6-7 seconds cleaning closures in a partitioning job that takes 12 seconds. Since we provide these closures in Spark we know for sure they are serializable, so we can bypass the cleaning. Author: Andrew Or <[email protected]> Closes apache#6256 from andrewor14/sql-partition-speed-up and squashes the following commits: a82b451 [Andrew Or] Fix style 10f7e3e [Andrew Or] Avoid getting call sites and cleaning closures 17e2943 [Andrew Or] Merge branch 'master' of github.com:apache/spark into sql-partition-speed-up 523f042 [Andrew Or] Skip unnecessary Utils.getCallSites too f7fe143 [Andrew Or] Avoid unnecessary closure cleaning
According to yhuai we spent 6-7 seconds cleaning closures in a partitioning job that takes 12 seconds. Since we provide these closures in Spark we know for sure they are serializable, so we can bypass the cleaning. Author: Andrew Or <[email protected]> Closes apache#6256 from andrewor14/sql-partition-speed-up and squashes the following commits: a82b451 [Andrew Or] Fix style 10f7e3e [Andrew Or] Avoid getting call sites and cleaning closures 17e2943 [Andrew Or] Merge branch 'master' of github.com:apache/spark into sql-partition-speed-up 523f042 [Andrew Or] Skip unnecessary Utils.getCallSites too f7fe143 [Andrew Or] Avoid unnecessary closure cleaning
According to yhuai we spent 6-7 seconds cleaning closures in a partitioning job that takes 12 seconds. Since we provide these closures in Spark we know for sure they are serializable, so we can bypass the cleaning. Author: Andrew Or <[email protected]> Closes apache#6256 from andrewor14/sql-partition-speed-up and squashes the following commits: a82b451 [Andrew Or] Fix style 10f7e3e [Andrew Or] Avoid getting call sites and cleaning closures 17e2943 [Andrew Or] Merge branch 'master' of github.com:apache/spark into sql-partition-speed-up 523f042 [Andrew Or] Skip unnecessary Utils.getCallSites too f7fe143 [Andrew Or] Avoid unnecessary closure cleaning
According to @yhuai we spent 6-7 seconds cleaning closures in a partitioning job that takes 12 seconds. Since we provide these closures in Spark we know for sure they are serializable, so we can bypass the cleaning.