-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generify SortPreservingMerge (#5882) (#5879) #5886
Conversation
/// Will then drop any batches for which all rows have been yielded to the output | ||
/// | ||
/// Returns `None` if no pending rows | ||
pub fn build_record_batch(&mut self) -> Result<Option<RecordBatch>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is moved pretty much verbatim from sort_preserving_merge.rs
type CursorStream<C> = Box<dyn PartitionedStream<Output = Result<(C, RecordBatch)>>>; | ||
|
||
#[derive(Debug)] | ||
struct SortPreservingMergeStream<C> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ported from sort_preserving_merge.rs but tweaked heavily to make it slightly easier to follow (hopefully)
|
||
pub use cursor::SortKeyCursor; | ||
pub use index::RowIndex; | ||
|
||
pub(crate) struct SortedStream { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This abstraction was a tad pointless, it was only used by ExternalSorter
to propagate the size of the in-memory sorted batches. I adjusted it to just call init_mem_used itself, and created #5885 to track the broader pre-existing issue of memory accounting within merge streams
I have confirmed this has no discernible impact on the existing benchmarks, nor the benchmarks added in #5881 |
cc @yjshen and @jaylmiller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tustvold -- this looks like an improvement to me. I plan to run the sort benchmarks against this branch as well given how important sorting is to so many usecases.
I think we should leave this one open for a few days to allow others who might be interested to comment
cc @Dandandan in case you know of others
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the performance benchmarks on this branch (2023-04-06-sorting.txt) and they confirm what @tustvold reported (no descernable change)
5f7a3d6#diff-95c601f2909c85052924354d8d161a9f7bf539c47b59abb1d525a6d217b4e402R60-R61 shows how this can be used to specialize merge for a single primitive column |
Which issue does this PR close?
Part of #5882
Relates to #5879
Rationale for this change
In order to be able to special case cursors (#5882), we need to first decouple SortPreservingMerge from SortKeyCursor.
What changes are included in this PR?
Cursor
and accept a type-erasedCursorStream
Are these changes tested?
Are there any user-facing changes?
No