-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance for arithmetic kernels with simd
feature enabled (except for division/modulo)
#1221
Conversation
…ization generates better code
@@ -153,6 +153,7 @@ impl Buffer { | |||
/// | |||
/// Note that this should be used cautiously, and the returned pointer should not be | |||
/// stored anywhere, to avoid dangling pointers. | |||
#[inline] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This did not get inlined in one of my benchmarks, which is weird for such a short method. I think it only gets called by a chain of methods marked as inline
, so maybe that confused the compiler.
Codecov Report
@@ Coverage Diff @@
## master #1221 +/- ##
==========================================
- Coverage 82.71% 82.70% -0.01%
==========================================
Files 175 175
Lines 51711 51711
==========================================
- Hits 42771 42770 -1
- Misses 8940 8941 +1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jhorstmann ! I have always wondered how much faster the simd
specific kernels actually were (given LLVM's ability to autovectorize)
Can you please provide the benchmark results (and also ideally how you ran them)?
I would like to run them on a hosted server machine
Hi @alamb, the benchmark results and my analysis are in the issue. For the removed kernels, the auto-vectorized version was about 40% faster on my laptop. The command was
(and the same without the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code of this PR looks great ❤️ (deleting code quite nice)
I compared the performance on a "cascade lake" class server machine with the following commands:
# Checkout only the changes in benchmark code
git chekcout bc751794b7a82bb2639b63d8065dfb25a951f458
RUSTFLAGS="-C target-cpu=native" cargo +nightly bench --features simd --bench arithmetic_kernels
Then I ran the results on the code in this PR
# Check out the tip of this branch
git checkout ac6a599873e0547c0932726562f87ec210595cab
RUSTFLAGS="-C target-cpu=native" cargo +nightly bench --features simd --bench arithmetic_kernels
My results are consistent with @jhorstmann's :tada
add time: [11.158 us 11.194 us 11.239 us]
change: [-46.244% -46.094% -45.905%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
3 (3.00%) low mild
1 (1.00%) high mild
3 (3.00%) high severe
subtract time: [12.028 us 12.052 us 12.087 us]
change: [-23.628% -23.089% -22.472%] (p = 0.00 < 0.05)
Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
1 (1.00%) high mild
9 (9.00%) high severe
multiply time: [12.048 us 12.066 us 12.087 us]
change: [-25.195% -24.787% -24.365%] (p = 0.00 < 0.05)
Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
3 (3.00%) high mild
8 (8.00%) high severe
divide time: [22.329 us 22.353 us 22.379 us]
change: [+24.502% +24.980% +25.378%] (p = 0.00 < 0.05)
Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
divide_unchecked time: [14.145 us 14.150 us 14.156 us]
change: [-19.806% -19.416% -19.086%] (p = 0.00 < 0.05)
Performance has improved.
Found 24 outliers among 100 measurements (24.00%)
22 (22.00%) high mild
2 (2.00%) high severe
divide_scalar time: [11.353 us 11.357 us 11.362 us]
change: [-33.601% -33.478% -33.317%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
2 (2.00%) high mild
3 (3.00%) high severe
modulo time: [377.00 us 377.28 us 377.57 us]
change: [+1.2601% +1.6124% +1.9082%] (p = 0.00 < 0.05)
Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
2 (2.00%) low mild
2 (2.00%) high mild
3 (3.00%) high severe
Benchmarking modulo_scalar: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.1s, enable flat sampling, or reduce sample count to 50.
modulo_scalar time: [1.4036 ms 1.4043 ms 1.4052 ms]
change: [-14.915% -14.615% -14.243%] (p = 0.00 < 0.05)
Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
5 (5.00%) high mild
12 (12.00%) high severe
add_nulls time: [11.334 us 11.381 us 11.454 us]
change: [-25.846% -25.633% -25.348%] (p = 0.00 < 0.05)
Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
1 (1.00%) low severe
1 (1.00%) low mild
2 (2.00%) high mild
5 (5.00%) high severe
divide_nulls time: [19.081 us 19.109 us 19.135 us]
change: [-1.4613% -0.9700% -0.5291%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 27 outliers among 100 measurements (27.00%)
11 (11.00%) low severe
1 (1.00%) low mild
13 (13.00%) high mild
2 (2.00%) high severe
divide_nulls_unchecked time: [13.564 us 13.569 us 13.575 us]
change: [-24.266% -23.988% -23.792%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
4 (4.00%) high mild
2 (2.00%) high severe
divide_scalar_nulls time: [11.252 us 11.257 us 11.264 us]
change: [-33.624% -33.527% -33.367%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
1 (1.00%) high mild
5 (5.00%) high severe
modulo_nulls time: [678.82 us 679.17 us 679.60 us]
change: [-0.7578% -0.4909% -0.2869%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
5 (5.00%) high mild
4 (4.00%) high severe
Benchmarking modulo_scalar_nulls: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.1s, enable flat sampling, or reduce sample count to 60.
modulo_scalar_nulls time: [1.0158 ms 1.0162 ms 1.0167 ms]
change: [-15.660% -14.722% -14.057%] (p = 0.00 < 0.05)
Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
1 (1.00%) low mild
3 (3.00%) high mild
6 (6.00%) high severe
Thanks @jhorstmann -- this is a really nice find and piece of work |
simd
feature enabled (except for division/modulo)
Which issue does this PR close?
Removes explicit simd arithmetic kernels, except for division/modulo, as compiler autovectorization actually generates better code. Also adds a new
divide_unchecked
kernel for floating point division, which does not need to return an error for division by zero. The arithmetic benchmarks are changed to use a bigger batch size because otherwise it measured mostly allocation and branch overhead. Benchmark results on my laptopCloses #1182.
Rationale for this change
Simplifies the code and increases the performance up to 2x when using the
simd
feature of this crate.rustc
andllvm
are quite good at generating optimized simd code.What changes are included in this PR?
Are there any user-facing changes?
No.
ArrowFloatNumericType::pow
is now unused though and could be removed. The trait should be kept since some kernels might make only sense on floating point data.