-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSQ: Change default clusterStatisticsMergeMode to SEQUENTIAL. #14310
Conversation
This is an undocumented parameter that controls how cluster-by statistics are merged. In PARALLEL mode, statistics are gathered from workers all at once. In SEQUENTIAL mode, statistics are gathered time chunk by time chunk. This improves accuracy for jobs with many time chunks, and reduces memory usage. The main downside of SEQUENTIAL is that it can take longer, but in most situations I've seen, PARALLEL is only really usable in cases where the sketches are small enough that SEQUENTIAL would also run relatively quickly. So it seems like SEQUENTIAL is a better default.
Segments cuts for sequential would be atleast equal to segment cuts in parallel mode. I have seen cases where the job runs 30% faster when changing modes from parallel to sequential when number of workers was 1000. |
We should also change
This can be done in another PR as well |
Ah, I'll do it in this patch, since it needs updates anyway due to one of the test cases failing. Currently, the test case |
@gianm Quite possible we missed this case since we throw InsertCannotByEmptyFault only after getting the partition boundaries.
The fix would be here WorkerSketcherFetcher#235 . Need to check if CompleteKeyStatisticsInformation is empty. |
TY, I added a block there that registers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes @gianm
…#14310) * MSQ: Change default clusterStatisticsMergeMode to SEQUENTIAL. This is an undocumented parameter that controls how cluster-by statistics are merged. In PARALLEL mode, statistics are gathered from workers all at once. In SEQUENTIAL mode, statistics are gathered time chunk by time chunk. This improves accuracy for jobs with many time chunks, and reduces memory usage. The main downside of SEQUENTIAL is that it can take longer, but in most situations I've seen, PARALLEL is only really usable in cases where the sketches are small enough that SEQUENTIAL would also run relatively quickly. So it seems like SEQUENTIAL is a better default. * Switch off-test from SEQUENTIAL to PARALLEL. * Fix sequential merge for situations where there are no time chunks at all. * Add a couple more tests.
This is an undocumented parameter that controls how cluster-by statistics are merged. In PARALLEL mode, statistics are gathered from workers all at once. In SEQUENTIAL mode, statistics are gathered time chunk by time chunk. This improves accuracy for jobs with many time chunks, and reduces memory usage.
The main downside of SEQUENTIAL is that it can take longer, but in most situations I've seen, PARALLEL is only really usable in cases where the sketches are small enough that SEQUENTIAL would also run relatively quickly. So it seems like SEQUENTIAL is a better default.