-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EPIC] Memory Limited Sort (Externalized / Spill) #1568
Comments
8 tasks
cc @yjshen here are some thoughts on the next steps for sorting -- what do you think? Are you planning to do any/all of this? Is there something in particular I could help the most with? |
4 tasks
5 tasks
Add #1611 to tracking spills in sort metrics. |
I think this issue is mostly done. We can track additional work items as follow on epics, etc |
alamb
changed the title
Memory Limited Sort (Externalized / Spill)
[EPIC] Memory Limited Sort (Externalized / Spill)
Nov 28, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Support Sorting "arbitrarily" large inputs (that don't fit in RAM or within some user defined budget)
This ticket concerns the memory potentially used the Sort operator -- it doesn't cover other potential targets (e.g. sort based aggregation, for example). That will be covered by other tasks tracked by #587
Describe the solution you'd like
N-way Merge of Sorted Streams
to build on top ofFor the Sort operator, I think the best behavior would be:
SortExec
), if the memory budget allows@yjshen 's PR #1526 has most of the individual pieces of this logic, but they aren't hooked up yet (that is if you run
SELECT .. FROM foo ORDER BY x
) it will still exhaust memory.Some ideas of how to break this task down
SortExec
code (so there is only a single sort operator that does in memory sorting if it has enough memory budget but then spills to disk if needed). #1571SortPreservingMergeStream
(which has quite good tests of what is often quite tricky code, and it will be performance critical) #1572spill_count
as well asspill_bytes
in sort metrics #1611Describe alternatives you've considered
Use two implementations of Sort (one in memory and one that can spill). This would ensure no performance regressions for the non-spilling case but would require users to decide which one to use.
Context
This is follow on work from the great PR from @yjshen in #1526 and part of the story of limiting memory used by DataFusion #587
The text was updated successfully, but these errors were encountered: