Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move SortKeyCursor and RowIndex into modules, add sort_key_cursor test #2645

Merged
merged 7 commits into from
May 31, 2022

Conversation

alamb
Copy link
Contributor

@alamb alamb commented May 29, 2022

Which issue does this PR close?

re : #2427

Rationale for this change

SortKeyCursor::compare is the bottleneck for merge performance (details on #2427 (comment))

As I prepare to improve the performance of this code (and make it more complicated), I want it more isolated and independent so I can:

  1. Understand the code better
  2. Validate my approach
  3. Ensure I don't introduce performance regressions

What changes are included in this PR?

  1. Move SortKeyCursor into its own module
  2. Move RowIndex into its own module
  3. add sort_key_cursor test

Are there any user-facing changes?

No

Does this PR break compatibility with Ballista?

no

@github-actions github-actions bot added core Core DataFusion crate datafusion Changes in the datafusion crate labels May 29, 2022
use std::cmp::Ordering;
use std::sync::Arc;

/// A `SortKeyCursor` is created from a `RecordBatch`, and a set of
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is moved, and I added stream_id() and batch_idx() accessors

@@ -0,0 +1,42 @@
/// A `RowIndex` identifies a specific row in a logical stream.
///
/// Each stream is identified by an `stream_idx` and is formed from a
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is moved into a new module, and I added some comments / ascii art to help my future self not have to read the code the next time

pub mod sort;
pub mod sort_preserving_merge;

/// A `SortKeyCursor` is created from a `RecordBatch`, and a set of
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is just moved

@@ -0,0 +1,135 @@
//! Contains tests for SortKeyCursor
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are new -- I plan to expand them over time

@alamb alamb force-pushed the alamb/sort_key branch from 5f540dc to dbc780a Compare May 29, 2022 12:17
@alamb alamb marked this pull request as ready for review May 29, 2022 12:17
@alamb alamb requested a review from tustvold May 30, 2022 11:32
@alamb
Copy link
Contributor Author

alamb commented May 30, 2022

FYI @yjshen @richox

datafusion/core/src/physical_plan/sorts/index.rs Outdated Show resolved Hide resolved
use datafusion_physical_expr::expressions::col;

/// Compares [`RowIndex`]es with a vector of strings, the result of
/// pretty formatting the [`RowIndex`]es. This is a macro so errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean unwrap() will give you a stack trace... I'm not a massive fan of macros as they are hard to read, and slow to compile...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to function in 0754ee5


/// Runs the two cursors to completion, sorting them, and
/// returning the sorted order of rows that would have produced
fn run(cursor1: &mut SortKeyCursor, cursor2: &mut SortKeyCursor) -> Vec<RowIndex> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code doesn't seem to be being used by SortPreservingMerge, is this intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, yes -- I want to unit test SoryKeyCursor. SortPreservingMerge already has reasonable test coverage in my mind

let mut cursor2 =
SortKeyCursor::new(2, 0, &batch2, &sort_key, Arc::clone(&sort_options)).unwrap();

let expected = vec![
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you were feeling fancy you could test stability, i.e. ties are broken by the lower stream idx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, will do

@alamb alamb merged commit acb245a into apache:master May 31, 2022
@alamb alamb deleted the alamb/sort_key branch August 8, 2023 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants