Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于第三章rdd的join的疑问? #23

Open
hangjianglaoweng opened this issue Apr 9, 2021 · 2 comments
Open

关于第三章rdd的join的疑问? #23

hangjianglaoweng opened this issue Apr 9, 2021 · 2 comments

Comments

@hangjianglaoweng
Copy link

Rdd的join和spark sql的join是否相同?
我看了一些资料说,“当前SparkSQL支持三种join算法:Shuffle Hash Join、Broadcast Hash Join以及Sort Merge Join。”。例如这篇文章,
https://segmentfault.com/a/1190000021033287
这里面提到的join和书中将的rdd join有大的区别。想问下rdd join和spark join是不同的实现?如果是相同的实现那么文章中的说法是错的?

@JerryLead
Copy link
Owner

@ hangjianglaoweng 本书主要介绍基于RDD接口的Join,书中介绍的join类似于SparkSQL中的Shuffle hash join。SparkSQL面向高性能的SQL查询分析,所以会对SQL语句实现做很多优化,Broadcast Hash Join以及Sort Merge Join是其中的两个优化,针对一些特殊的join场景执行效率会高些。关于SparkSQL的内核解析,可以阅读我师弟的著作《SparkSQL内核剖析》。

@hangjianglaoweng
Copy link
Author

@JerryLead 好的,感谢解答。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants