-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
第六章《Shuffle 机制》勘误与修改建议 #7
Comments
@hangjianglaoweng 感谢指出,目测目前的代码实现用了线性探测法。然而,AppendOnlyMap的注释中说使用了二次探测法(This implementation uses quadratic probing with a power-of-2 hash table * size, which is guaranteed to explore all spaces for each key (see * http://en.wikipedia.org/wiki/Quadratic_probing)。想要深究的话,可以给Spark提个issue,看社区到底是想采用二次探测还是线性探测。 |
第一个问题:数组的扩容机制 ==> Lijie: “如果Array存放不下,则会先扩容” 指的是“如果存放不下,会扩容为原来的二倍,扩容完后可以存放新的数据”。 第二个问题:Map的扩容机制 第三个问题:需要map()端combine,需要或者不需要按Key进行排序 => Lijie: 书中说“如果不需要按Key进行排序,如图6.7的上图所示,那么只按partitionId进行排序”,确实可以这样做,不需要key排序时,我们实际上没有必要对 key/hash(Key) 排序,比如如果不需要spill,直接在内存中进行combine,然后按照partition id排序后输出数据即可。如果需要spill,那么可以在spill过程中对key/hash(key)排序方便merge。这里英文里面也说"possibly second by key or hash key",也就是说在对partition id排序后,可能进行对key/hash(key)的排序,而不是一定。 |
@JerryLead 感谢利杰回复,第一点和第二点是我理解的不到位,关于第三点,我看代码确认了下 |
关于第三点, |
好的,理解了,感谢! |
参考 SPARK-4690,代码使用的是二次探测法。 |
No description provided.
The text was updated successfully, but these errors were encountered: