Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split out arrow-ord (#2594) #3299

Merged
merged 5 commits into from
Dec 8, 2022
Merged

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Dec 8, 2022

Which issue does this PR close?

Part of #2594

Rationale for this change

What changes are included in this PR?

  • Makes LexicographicalComparator public and tweaks its interface slightly
  • Moves ArrowNumericType to arrow-array
  • Moves ordering kernels to arrow-ord
  • Moves some tests left over from Split out arrow-string (#2594) #3295

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 8, 2022
- arrow-ord/**
- arrow-schema/**
- arrow-select/**
- arrow-string/**
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to add this in #3295

@tustvold tustvold marked this pull request as draft December 8, 2022 17:51
@tustvold tustvold force-pushed the split-out-arrow-ord branch from bbb07b7 to e844da8 Compare December 8, 2022 17:55
@@ -36,7 +36,6 @@ on:
- arrow-ipc/**
- arrow-csv/**
- arrow-json/**
- arrow-string/**
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet doesn't actually depend on any of these kernels, and so this can be removed

@tustvold tustvold force-pushed the split-out-arrow-ord branch from db961ea to 540af48 Compare December 8, 2022 17:58
@@ -35,7 +35,6 @@ on:
- arrow-ipc/**
- arrow-schema/**
- arrow-select/**
- arrow-string/**
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arrow-flight doesn't depend on this crate and so this can be removed

@@ -3804,44 +3860,6 @@ mod tests {
gt_eq_utf8_scalar,
vec![false, false, true, true]
);
test_flag_utf8!(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are moved to arrow-string

@@ -3881,8 +3899,7 @@ mod tests {
);
assert_eq!(eq_dyn_scalar(&array, 8).unwrap(), expected);

let array: ArrayRef = Arc::new(array);
let array = crate::compute::cast(&array, &DataType::Float64).unwrap();
let array = array.unary::<_, Float64Type>(|x| x as f64);
Copy link
Contributor Author

@tustvold tustvold Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes me happy that it is now really easy to define your own kernels. This avoids needing a dev dependency on arrow-cast

@@ -174,8 +180,8 @@ jobs:
run: cargo clippy -p arrow-data --all-targets --all-features -- -D warnings
- name: Clippy arrow-schema with all features
run: cargo clippy -p arrow-schema --all-targets --all-features -- -D warnings
- name: Clippy arrow-array with all features
run: cargo clippy -p arrow-array --all-targets --all-features -- -D warnings
- name: Clippy arrow-array with all features except SIMD
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created rust-lang/cargo#11467 to track making this less error prone

@tustvold tustvold marked this pull request as ready for review December 8, 2022 18:19
@alamb alamb added api-change Changes to the arrow API and removed api-change Changes to the arrow API labels Dec 8, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- again I think it is worth being overly conservative and running benchmarks again to ensure we didn't cause any performance regressions accidentally

[package]
name = "arrow-ord"
version = "28.0.0"
description = "Ordering kernels for arrow arrays"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would "comparison kernels" be a more accurate description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are sort kernels in there, so it is kind of kernels that relate to the ordering of elements? Maybe? I don't really feel strongly, was just trying to justify why it is arrow-ord and not arrow-cmp or something 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine 👍

@tustvold
Copy link
Contributor Author

tustvold commented Dec 8, 2022

Benchmarks just show noise

eq Float32              time:   [9.1657 µs 9.1710 µs 9.1772 µs]
                        change: [-0.9911% -0.7119% -0.4195%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  9 (9.00%) high severe

eq scalar Float32       time:   [7.2559 µs 7.2576 µs 7.2595 µs]
                        change: [-0.3298% -0.1205% +0.0054%] (p = 0.21 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

neq Float32             time:   [9.1818 µs 9.1866 µs 9.1919 µs]
                        change: [-0.6647% -0.5687% -0.4811%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe

neq scalar Float32      time:   [7.3054 µs 7.3089 µs 7.3127 µs]
                        change: [+0.0288% +0.2996% +0.6043%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

lt Float32              time:   [16.261 µs 16.279 µs 16.297 µs]
                        change: [-0.8641% -0.5817% -0.2890%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

lt scalar Float32       time:   [10.497 µs 10.505 µs 10.513 µs]
                        change: [+0.0614% +0.2876% +0.4345%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) low severe
  1 (1.00%) low mild
  5 (5.00%) high mild
  4 (4.00%) high severe

lt_eq Float32           time:   [16.305 µs 16.317 µs 16.327 µs]
                        change: [-0.8036% -0.7390% -0.6749%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  3 (3.00%) high mild
  2 (2.00%) high severe

lt_eq scalar Float32    time:   [10.500 µs 10.503 µs 10.507 µs]
                        change: [+0.2467% +0.2874% +0.3253%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

gt Float32              time:   [16.283 µs 16.290 µs 16.297 µs]
                        change: [-0.8649% -0.7883% -0.7155%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe

gt scalar Float32       time:   [10.489 µs 10.496 µs 10.502 µs]
                        change: [-0.2633% -0.0192% +0.1379%] (p = 0.89 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  8 (8.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

gt_eq Float32           time:   [16.319 µs 16.328 µs 16.340 µs]
                        change: [-0.4586% -0.2976% -0.0387%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

gt_eq scalar Float32    time:   [10.494 µs 10.500 µs 10.507 µs]
                        change: [-68.313% -64.521% -59.638%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 20 outliers among 100 measurements (20.00%)
  8 (8.00%) low severe
  6 (6.00%) low mild
  2 (2.00%) high mild
  4 (4.00%) high severe

eq MonthDayNano         time:   [74.482 µs 76.306 µs 78.269 µs]
                        change: [-3.9949% -1.5024% +0.8764%] (p = 0.23 > 0.05)
                        No change in performance detected.

eq scalar MonthDayNano  time:   [55.047 µs 55.123 µs 55.194 µs]
                        change: [+0.7321% +0.9766% +1.1585%] (p = 0.00 < 0.05)
                        Change within noise threshold.

like_utf8 scalar equals time:   [261.41 µs 261.48 µs 261.56 µs]
                        change: [-3.3439% -3.2519% -3.1792%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe

like_utf8 scalar contains
                        time:   [2.2575 ms 2.2582 ms 2.2590 ms]
                        change: [+4.1798% +4.2351% +4.2920%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild

like_utf8 scalar ends with
                        time:   [263.60 µs 263.73 µs 263.87 µs]
                        change: [-2.4051% -2.2131% -2.0984%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

like_utf8 scalar starts with
                        time:   [263.06 µs 263.12 µs 263.19 µs]
                        change: [-7.0292% -6.9977% -6.9663%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

Benchmarking like_utf8 scalar complex: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.3s, enable flat sampling, or reduce sample count to 60.
like_utf8 scalar complex
                        time:   [1.2523 ms 1.2529 ms 1.2535 ms]
                        change: [+1.1527% +1.3934% +1.6415%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

nlike_utf8 scalar equals
                        time:   [261.63 µs 261.71 µs 261.81 µs]
                        change: [-3.4367% -3.2447% -3.1349%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

nlike_utf8 scalar contains
                        time:   [2.2397 ms 2.2405 ms 2.2412 ms]
                        change: [+5.2539% +5.3153% +5.3741%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

nlike_utf8 scalar ends with
                        time:   [276.56 µs 276.62 µs 276.69 µs]
                        change: [+1.4265% +1.4781% +1.5193%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

nlike_utf8 scalar starts with
                        time:   [276.63 µs 276.68 µs 276.73 µs]
                        change: [-2.0428% -1.8609% -1.7543%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

Benchmarking nlike_utf8 scalar complex: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.3s, enable flat sampling, or reduce sample count to 60.
nlike_utf8 scalar complex
                        time:   [1.2530 ms 1.2536 ms 1.2542 ms]
                        change: [+2.0319% +2.1878% +2.4058%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe

ilike_utf8 scalar equals
                        time:   [2.3541 ms 2.3551 ms 2.3562 ms]
                        change: [+0.6097% +0.6557% +0.7048%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

ilike_utf8 scalar contains
                        time:   [4.2709 ms 4.2724 ms 4.2739 ms]
                        change: [+2.9535% +3.0047% +3.0547%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

ilike_utf8 scalar ends with
                        time:   [2.3666 ms 2.3678 ms 2.3691 ms]
                        change: [+1.7432% +1.8138% +1.8860%] (p = 0.00 < 0.05)
                        Performance has regressed.

ilike_utf8 scalar starts with
                        time:   [2.3430 ms 2.3435 ms 2.3441 ms]
                        change: [-0.0954% -0.0564% -0.0185%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe

Benchmarking ilike_utf8 scalar complex: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.5s, enable flat sampling, or reduce sample count to 50.
ilike_utf8 scalar complex
                        time:   [1.8554 ms 1.8562 ms 1.8571 ms]
                        change: [-2.8581% -2.8009% -2.7452%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

nilike_utf8 scalar equals
                        time:   [2.3854 ms 2.3862 ms 2.3872 ms]
                        change: [+0.2938% +0.3440% +0.3950%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe

nilike_utf8 scalar contains
                        time:   [4.3494 ms 4.3504 ms 4.3516 ms]
                        change: [+2.2445% +2.2771% +2.3107%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

nilike_utf8 scalar ends with
                        time:   [2.3997 ms 2.4004 ms 2.4011 ms]
                        change: [+1.0415% +1.0851% +1.1278%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

nilike_utf8 scalar starts with
                        time:   [2.3974 ms 2.3979 ms 2.3983 ms]
                        change: [+1.0685% +1.1003% +1.1324%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

Benchmarking nilike_utf8 scalar complex: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.9s, enable flat sampling, or reduce sample count to 50.
nilike_utf8 scalar complex
                        time:   [1.9500 ms 1.9509 ms 1.9519 ms]
                        change: [+1.2983% +1.4470% +1.5487%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

Benchmarking egexp_matches_utf8 scalar starts with: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.4s, enable flat sampling, or reduce sample count to 60.
egexp_matches_utf8 scalar starts with
                        time:   [1.2694 ms 1.2701 ms 1.2709 ms]
                        change: [+1.1513% +1.3886% +1.5364%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe

Benchmarking egexp_matches_utf8 scalar ends with: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.2s, enable flat sampling, or reduce sample count to 60.
egexp_matches_utf8 scalar ends with
                        time:   [1.2245 ms 1.2249 ms 1.2254 ms]
                        change: [-0.2995% -0.0241% +0.2666%] (p = 0.87 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

dict eq string          time:   [367.35 µs 367.60 µs 367.87 µs]
                        change: [+0.1617% +0.4220% +0.6674%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 15 outliers among 100 measurements (15.00%)
  11 (11.00%) high mild
  4 (4.00%) high severe

@tustvold
Copy link
Contributor Author

tustvold commented Dec 8, 2022

Sort kernels also show just noise

sort 2^10               time:   [104.00 µs 104.03 µs 104.06 µs]
                        change: [-1.2425% -1.1335% -0.9317%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

sort 2^12               time:   [483.66 µs 483.82 µs 484.01 µs]
                        change: [-1.5556% -1.4271% -1.2339%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

sort nulls 2^10         time:   [94.902 µs 94.929 µs 94.963 µs]
                        change: [-2.0767% -1.9419% -1.7205%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe

sort nulls 2^12         time:   [433.50 µs 434.22 µs 435.22 µs]
                        change: [+5.1418% +6.5300% +8.1729%] (p = 0.00 < 0.05)
                        Performance has regressed.

bool sort 2^12          time:   [227.31 µs 227.36 µs 227.42 µs]
                        change: [-4.8764% -4.5966% -4.3364%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe

bool sort nulls 2^12    time:   [254.96 µs 255.01 µs 255.07 µs]
                        change: [-1.0141% -0.8698% -0.6497%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

dict string 2^12        time:   [37.305 µs 37.338 µs 37.378 µs]
                        change: [+0.9199% +1.1477% +1.4259%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe

sort 2^12 limit 10      time:   [60.756 µs 60.768 µs 60.783 µs]
                        change: [+1.9449% +2.2653% +2.5586%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  5 (5.00%) high severe

sort 2^12 limit 100     time:   [68.230 µs 68.263 µs 68.300 µs]
                        change: [-1.3582% -1.0502% -0.7583%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

sort 2^12 limit 1000    time:   [185.30 µs 185.34 µs 185.40 µs]
                        change: [-0.0894% +0.0393% +0.2371%] (p = 0.77 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

sort 2^12 limit 2^12    time:   [486.93 µs 487.04 µs 487.17 µs]
                        change: [-1.1167% -0.9951% -0.8251%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

sort nulls 2^12 limit 10
                        time:   [118.80 µs 118.85 µs 118.90 µs]
                        change: [-1.1975% -0.9068% -0.6244%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

sort nulls 2^12 limit 100
                        time:   [120.90 µs 120.96 µs 121.01 µs]
                        change: [-1.2508% -0.9623% -0.6897%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

sort nulls 2^12 limit 1000
                        time:   [137.32 µs 137.37 µs 137.42 µs]
                        change: [-0.8628% -0.8102% -0.7603%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

sort nulls 2^12 limit 2^12
                        time:   [421.48 µs 421.59 µs 421.71 µs]
                        change: [-0.4674% -0.2091% +0.0443%] (p = 0.05 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  4 (4.00%) high mild
  6 (6.00%) high severe

@tustvold tustvold merged commit 7b3e94f into apache:master Dec 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants