Improve performance for arithmetic kernels with `simd` feature enabled (except for division/modulo) #1221

jhorstmann · 2022-01-22T15:14:14Z

Which issue does this PR close?

Removes explicit simd arithmetic kernels, except for division/modulo, as compiler autovectorization actually generates better code. Also adds a new divide_unchecked kernel for floating point division, which does not need to return an error for division by zero. The arithmetic benchmarks are changed to use a bigger batch size because otherwise it measured mostly allocation and branch overhead. Benchmark results on my laptop

Closes #1182.

Rationale for this change

Simplifies the code and increases the performance up to 2x when using the simd feature of this crate. rustc and llvm are quite good at generating optimized simd code.

What changes are included in this PR?

Are there any user-facing changes?

No. ArrowFloatNumericType::pow is now unused though and could be removed. The trait should be kept since some kernels might make only sense on floating point data.

…ization generates better code

jhorstmann · 2022-01-22T15:16:43Z

arrow/src/buffer/immutable.rs

@@ -153,6 +153,7 @@ impl Buffer {
    ///
    /// Note that this should be used cautiously, and the returned pointer should not be
    /// stored anywhere, to avoid dangling pointers.
+    #[inline]


This did not get inlined in one of my benchmarks, which is weird for such a short method. I think it only gets called by a chain of methods marked as inline, so maybe that confused the compiler.

codecov-commenter · 2022-01-22T15:28:07Z

Codecov Report

Merging #1221 (ac6a599) into master (0377aae) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1221      +/-   ##
==========================================
- Coverage   82.71%   82.70%   -0.01%     
==========================================
  Files         175      175              
  Lines       51711    51711              
==========================================
- Hits        42771    42770       -1     
- Misses       8940     8941       +1

Impacted Files	Coverage Δ
arrow/src/buffer/immutable.rs	`98.92% <ø> (ø)`
arrow/src/compute/kernels/arithmetic.rs	`90.86% <100.00%> (ø)`
arrow/src/datatypes/datatype.rs	`66.38% <0.00%> (-0.43%)`	⬇️
arrow/src/datatypes/field.rs	`53.79% <0.00%> (-0.31%)`	⬇️
parquet_derive/src/parquet_field.rs	`66.43% <0.00%> (+0.22%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0377aae...ac6a599. Read the comment docs.

alamb

Thanks @jhorstmann ! I have always wondered how much faster the simd specific kernels actually were (given LLVM's ability to autovectorize)

Can you please provide the benchmark results (and also ideally how you ran them)?

I would like to run them on a hosted server machine

jhorstmann · 2022-01-23T14:41:55Z

Hi @alamb, the benchmark results and my analysis are in the issue. For the removed kernels, the auto-vectorized version was about 40% faster on my laptop. The command was

RUSTFLAGS="-C target-cpu=skylake" cargo +nightly bench --features simd --bench arithmetic_kernels

(and the same without the simd feature). Results on a server class machine and with target-cpu=skx should be even better.

alamb

The code of this PR looks great ❤️ (deleting code quite nice)

I compared the performance on a "cascade lake" class server machine with the following commands:

# Checkout only the changes in benchmark code
git chekcout bc751794b7a82bb2639b63d8065dfb25a951f458
RUSTFLAGS="-C target-cpu=native" cargo +nightly bench --features simd --bench arithmetic_kernels

Then I ran the results on the code in this PR

# Check out the tip of this branch
git checkout ac6a599873e0547c0932726562f87ec210595cab
RUSTFLAGS="-C target-cpu=native" cargo +nightly bench --features simd --bench arithmetic_kernels

My results are consistent with @jhorstmann's :tada

add                     time:   [11.158 us 11.194 us 11.239 us]                 
                        change: [-46.244% -46.094% -45.905%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe

subtract                time:   [12.028 us 12.052 us 12.087 us]                      
                        change: [-23.628% -23.089% -22.472%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) high mild
  9 (9.00%) high severe

multiply                time:   [12.048 us 12.066 us 12.087 us]                      
                        change: [-25.195% -24.787% -24.365%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe

divide                  time:   [22.329 us 22.353 us 22.379 us]                    
                        change: [+24.502% +24.980% +25.378%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

divide_unchecked        time:   [14.145 us 14.150 us 14.156 us]                              
                        change: [-19.806% -19.416% -19.086%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 24 outliers among 100 measurements (24.00%)
  22 (22.00%) high mild
  2 (2.00%) high severe

divide_scalar           time:   [11.353 us 11.357 us 11.362 us]                           
                        change: [-33.601% -33.478% -33.317%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

modulo                  time:   [377.00 us 377.28 us 377.57 us]                   
                        change: [+1.2601% +1.6124% +1.9082%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

Benchmarking modulo_scalar: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.1s, enable flat sampling, or reduce sample count to 50.
modulo_scalar           time:   [1.4036 ms 1.4043 ms 1.4052 ms]                           
                        change: [-14.915% -14.615% -14.243%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  5 (5.00%) high mild
  12 (12.00%) high severe

add_nulls               time:   [11.334 us 11.381 us 11.454 us]                       
                        change: [-25.846% -25.633% -25.348%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  5 (5.00%) high severe

divide_nulls            time:   [19.081 us 19.109 us 19.135 us]                          
                        change: [-1.4613% -0.9700% -0.5291%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 27 outliers among 100 measurements (27.00%)
  11 (11.00%) low severe
  1 (1.00%) low mild
  13 (13.00%) high mild
  2 (2.00%) high severe

divide_nulls_unchecked  time:   [13.564 us 13.569 us 13.575 us]                                    
                        change: [-24.266% -23.988% -23.792%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

divide_scalar_nulls     time:   [11.252 us 11.257 us 11.264 us]                                 
                        change: [-33.624% -33.527% -33.367%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe

modulo_nulls            time:   [678.82 us 679.17 us 679.60 us]                         
                        change: [-0.7578% -0.4909% -0.2869%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

Benchmarking modulo_scalar_nulls: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.1s, enable flat sampling, or reduce sample count to 60.
modulo_scalar_nulls     time:   [1.0158 ms 1.0162 ms 1.0167 ms]                                 
                        change: [-15.660% -14.722% -14.057%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  6 (6.00%) high severe

alamb · 2022-01-24T19:50:44Z

Thanks @jhorstmann -- this is a really nice find and piece of work

jhorstmann added 2 commits January 22, 2022 16:07

Extend arithmetic benchmarks

bc75179

Remove explicit simd arithmetic except for div/mod because autovector…

551be27

…ization generates better code

github-actions bot added the arrow Changes to the arrow crate label Jan 22, 2022

jhorstmann commented Jan 22, 2022

View reviewed changes

Remove unneeded return keywords

ac6a599

alamb reviewed Jan 23, 2022

View reviewed changes

alamb added the performance label Jan 24, 2022

alamb approved these changes Jan 24, 2022

View reviewed changes

alamb merged commit 2d6352b into apache:master Jan 24, 2022

alamb changed the title ~~Remove explicit simd arithmetic kernels except for division/modulo~~ Improve performance for arithmetic kernels with simd feature enabled (except for division/modulo) Feb 3, 2022

alamb mentioned this pull request Apr 1, 2022

packed_simd requires nightly #54

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance for arithmetic kernels with `simd` feature enabled (except for division/modulo) #1221

Improve performance for arithmetic kernels with `simd` feature enabled (except for division/modulo) #1221

jhorstmann commented Jan 22, 2022 •

edited by alamb

Loading

jhorstmann Jan 22, 2022

codecov-commenter commented Jan 22, 2022 •

edited

Loading

alamb left a comment

jhorstmann commented Jan 23, 2022

alamb left a comment

alamb commented Jan 24, 2022

Improve performance for arithmetic kernels with simd feature enabled (except for division/modulo) #1221

Improve performance for arithmetic kernels with simd feature enabled (except for division/modulo) #1221

Conversation

jhorstmann commented Jan 22, 2022 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

jhorstmann Jan 22, 2022

Choose a reason for hiding this comment

codecov-commenter commented Jan 22, 2022 • edited Loading

Codecov Report

alamb left a comment

Choose a reason for hiding this comment

jhorstmann commented Jan 23, 2022

alamb left a comment

Choose a reason for hiding this comment

alamb commented Jan 24, 2022

Improve performance for arithmetic kernels with `simd` feature enabled (except for division/modulo) #1221

Improve performance for arithmetic kernels with `simd` feature enabled (except for division/modulo) #1221

jhorstmann commented Jan 22, 2022 •

edited by alamb

Loading

codecov-commenter commented Jan 22, 2022 •

edited

Loading