-
-
Notifications
You must be signed in to change notification settings - Fork 699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up searches by removing repeated memsets coming from vec.resize() #2327
base: main
Are you sure you want to change the base?
Conversation
Also, reserve exactly the size needed, which is surprisingly needed to get the full speedup of ~5% on a good fraction of the queries.
positions.reserve_exact(id_range.len()); | ||
#[allow(clippy::uninit_vec)] | ||
unsafe { | ||
positions.set_len(id_range.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty sure this is undefined behaviour due to the validity requirements on integers which newly allocated memory does not fulfil.
If you want to apply this optimization, get_batch_u32s
needs to work with the return value of Vec::spare_capacity_mut
to write
into the MaybeUninit<u32>
values and only call set_len
after all of these values have been initialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, thanks for pointing this out, I see that in rust-lang/unsafe-code-guidelines#439 it was decided that the mere existence of an uninitialized integer is undefined behavior. There is also an interesting discussion on that here where it was controversially concluded that it was UB.
The trick is the same as in quickwit-oss/quickwit#4712, so we should probably decide whether we do neither or both. @fulmicoton
For that quickwit PR, I see that read_buf (see rfc, function doc) could be used, but it's not stable unfortunately.
For this PR, note that this vector ends up being filled by AVX2 instructions, so what you propose does not seem possible: we would either to take a performance hit by storing the output of the AVX2 computation to a temporary variable and write() it into the uninitialized vector, or use MaybeUninit.as_mut_ptr which would lead us back to UB. Or do you have an efficient suggestion to make it work with the AVX instructions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we should probably decide whether we do neither or both. @fulmicoton
I'd say we should do it for one and not for the other.
For the Read side, the one here is simply not needed. We also own the bitpacking crate. We can just add a method that takes a MaybeUninit
.
For the Read trait, this is well known problem that has a workaround but it is not stabilized yet.
The problem is that if a Read implementation were to read the buffer we would be in big troubles.
We put that in a place where the Read impl is well defined so I think we are ok.
I think there is a problem in the benchmark. That code is not used in most of the queries showing an improvements. |
Also, reserve exactly the size needed, which is surprisingly needed to get the full speedup of ~5% on a good fraction of the queries.
The memset (accounting for 4% of the CPU cycles) in the range_numeric query is indeed gone with this PR, see flame graphs:
Overall, it makes things a few percent faster on many queries:
Note: I made sure to use the same version of tantivy for the baseline and for this proposed change. The only difference between the two columns is this PR.
It would be possible to have this speedup without using unsafe by never truncating it (as is done in filter_vec_in_place()) and instead passing the number of final elements in the vector back to VecCursor.