Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up the substring kernel by about 2x #1512

Merged
merged 6 commits into from
Apr 7, 2022

Conversation

HaoYang670
Copy link
Contributor

@HaoYang670 HaoYang670 commented Mar 31, 2022

Which issue does this PR close?

Closes #1511.

Rationale for this change

Improve the performance and do some clean-up work.

What changes are included in this PR?

  1. calculate new_offsets first to estimate the memory needed for buffer
  2. clean up the code: remove redundant comparisons.

Benchmark

chip: Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz

substring               time:   [19.806 ms 19.824 ms 19.844 ms]                      
                        change: [-44.909% -44.829% -44.743%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) high mild
  9 (9.00%) high severe

About 2X speed-up

Are there any user-facing changes?

No.

Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 31, 2022
@HaoYang670 HaoYang670 changed the title Speed up substring Speed up the substring kernel Mar 31, 2022
offsets
.windows(2)
.map(|pair| (pair[1] + start).max(pair[0]))
.collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to do new_starts/new_length in one iteration over offsets and use unzip to materialize into two Vecs?

Copy link
Contributor

@Dandandan Dandandan Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it, I think actually materializing new_length to a Vec is not needed. It probably is faster (and saves allocating this Vec) to compute the length again when calculating new_values.

Copy link
Contributor Author

@HaoYang670 HaoYang670 Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to remove .collect, actually. But you know, closure values in if-else branch can get no two closures, even if identical, have the same type error. And I did not find a good solution except adding .collect
rust-lang/rust#87961

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is not possible in Rust (I think only by Boxing or manually inlining / macros).

Calculating the length in two places with something like let length = *end - *start within both the new_offsets new_values doesn´t seem too bad in terms of duplication?

Copy link
Contributor Author

@HaoYang670 HaoYang670 Apr 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is not possible in Rust (I think only by Boxing or manually inlining / macros).

Calculating the length in two places with something like let length = *end - *start within both the new_offsets new_values doesn´t seem too bad in terms of duplication?

One way is to allow the closure not to capture the environment by adding more parameters, and the compiler will downcast closure to function. This works for calculating new_starts, but doesn't work for new_length where we must capture Some(length) = length

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My later comment mentioned not allocating the new vec but not allocating the array at all (by moving the calculation to the iteration in new_values). Is that something you want to try?

Thinking about it, I think actually materializing new_length to a Vec is not needed. It probably is faster (and saves allocating this Vec) to compute the length again when calculating new_values.

let length_i: OffsetSize = offsets[i + 1] - offsets[i];
// compute where we should start slicing this entry
let start = offsets[i]
+ if start >= OffsetSize::zero() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to compare start with zero multiple time. So I move this outside the for loop

let start = start.max(offsets[i]).min(offsets[i + 1]);
// compute the length of the slice
let length: OffsetSize = length
.unwrap_or(length_i)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason! we don't need to unwrap length in each loop. So move this outside the loop.

Comment on lines 88 to 91
.map(|(start, length)| {
(start.to_usize().unwrap(), length.to_usize().unwrap())
})
.map(|(start, length)| &data[start..start + length])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.map(|(start, length)| {
(start.to_usize().unwrap(), length.to_usize().unwrap())
})
.map(|(start, length)| &data[start..start + length])
.map(|(start, length)| {
let start = start.to_usize().unwrap();
let length = length.to_usize().unwrap();
&data[start..start + length]
})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@codecov-commenter
Copy link

codecov-commenter commented Apr 1, 2022

Codecov Report

Merging #1512 (f66990a) into master (15c87ae) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1512      +/-   ##
==========================================
+ Coverage   82.72%   82.75%   +0.02%     
==========================================
  Files         189      190       +1     
  Lines       54561    54707     +146     
==========================================
+ Hits        45137    45274     +137     
- Misses       9424     9433       +9     
Impacted Files Coverage Δ
arrow/src/compute/kernels/substring.rs 98.31% <100.00%> (+0.07%) ⬆️
arrow/src/array/equal/utils.rs 74.45% <0.00%> (-3.24%) ⬇️
parquet/src/encodings/encoding.rs 93.37% <0.00%> (-0.19%) ⬇️
arrow/src/array/builder.rs 86.68% <0.00%> (-0.05%) ⬇️
arrow/src/array/equal/union.rs 100.00% <0.00%> (ø)
arrow/src/array/transform/mod.rs 86.46% <0.00%> (+0.11%) ⬆️
parquet_derive/src/parquet_field.rs 66.21% <0.00%> (+0.22%) ⬆️
arrow/src/array/equal/mod.rs 93.98% <0.00%> (+0.65%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 15c87ae...f66990a. Read the comment docs.

Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
@HaoYang670
Copy link
Contributor Author

HaoYang670 commented Apr 1, 2022

Hi @Dandandan ! These are some changes since your last review:

  1. Use dynamic Fn trait with Box, so that there is no repeated code.
  2. Give up using purely functional way. Because it is not suitable in this context. mut vec and for each are more straightforward.
  3. We have to materialize new_offset, new_length(new_end now) and new_start before calculating new_value. Because we need new_offset to estimate the size of array. And new_offset depends on new_length which depends on new_start.
  4. dyn Fn may introduce just a little performance penalty. I have updated the benchmark.

length_i + start
};
let cal_new_start: Box<dyn Fn(OffsetSize, OffsetSize) -> OffsetSize> = if start
>= zero
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cargo fmt likes this weird format.

length_so_far += length;
// start and end offsets for each substring
let mut new_starts_ends: Vec<(OffsetSize, OffsetSize)> =
Vec::with_capacity(array.len());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't materialize new_starts_ends here, we have to calculate them twice when calculating new_values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a time-space trade-off, but not the major factor affecting performance


new_offsets.push(length_so_far);
offsets.windows(2).for_each(|pair| {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Purely functional style is not appropriate because new_offsets and new_starts_ends have different number of elements. After trying some different ways, I find that mut vec + for each is more readable and efficient.

@HaoYang670 HaoYang670 requested a review from Dandandan April 2, 2022 09:13
@HaoYang670
Copy link
Contributor Author

Hi @Dandandan ! What's your opinion?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @HaoYang670

I spent some time reading this code but I ran out of time to properly review it. I am worried there are corner cases in here that might not be covered by tests -- did you review test coverage and is this properly covered?

I think some comments might help me understand the code more quickly as well.

I will try and find some more time later this week or next to review it more

@HaoYang670
Copy link
Contributor Author

Thank you very much @alamb. I have reviewed the Codedev report and all code branches have been covered.

@alamb alamb changed the title Speed up the substring kernel Speed up the substring kernel by about 2x Apr 7, 2022
@alamb
Copy link
Contributor

alamb commented Apr 7, 2022

Thanks @HaoYang670 and @Dandandan

@alamb alamb merged commit 497521f into apache:master Apr 7, 2022
@HaoYang670 HaoYang670 deleted the speed_up_substring branch April 7, 2022 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speed up the substring kernel
4 participants