-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Help LLVM vectorize comparison kernel ~50-80% faster #2646
Conversation
52032d3
to
6e56f3c
Compare
let buffer = unsafe { MutableBuffer::from_trusted_len_iter_bool(comparison) }; | ||
let mut buffer = MutableBuffer::new((left.len() + 7) / 8); | ||
|
||
let chunks = left.len() / 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we change the implementation of from_trusted_len_iter_bool
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that LLVM isn't able to elide the conditional due to Iterator::next, and this prevents it from working correctly. I couldn't find a way around this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add some comment about why we don't use iter
here, to avoid other developers bringing the iter
back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a new collect_bools method with docs on why it exists
.map(|i| unsafe { op(left.value_unchecked(i), right.value_unchecked(i)) }); | ||
// same size as $left.len() and $right.len() | ||
let buffer = unsafe { MutableBuffer::from_trusted_len_iter_bool(comparison) }; | ||
let mut buffer = MutableBuffer::new((left.len() + 7) / 8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have the ceil
function in bit_util.rs
, you can use it.
let buffer = unsafe { MutableBuffer::from_trusted_len_iter_bool(comparison) }; | ||
let mut buffer = MutableBuffer::new((left.len() + 7) / 8); | ||
|
||
let chunks = left.len() / 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add some comment about why we don't use iter
here, to avoid other developers bringing the iter
back.
Latest numbers
|
.map(|i| unsafe { op(left.value_unchecked(i), right.value_unchecked(i)) }); | ||
// same size as $left.len() and $right.len() | ||
let buffer = unsafe { MutableBuffer::from_trusted_len_iter_bool(comparison) }; | ||
let buffer = MutableBuffer::collect_bool(left.len(), |i| unsafe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add the safety comments?
Looks good! Happy about the scalar perf too. One remaining comment on the (un)safety remarks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thanks @tustvold 🚀 |
Benchmark runs are scheduled for baseline = 6d86472 and contender = b46fc92. b46fc92 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
result.push_unchecked(byte_accum); | ||
} | ||
result | ||
Self::collect_bool(len, |_| iterator.next().unwrap()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tustvold one last thought, I wonder if using unwrap_unchecked
here would result in the same performance (as it removes the condition).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave this a go and honestly I don't really understand the results. The scalar comparison is on par with this PR, but the non-scalar variants are worse by 50% compared to master... My 2 cents is that not using iterators for performance critical code makes for code that is easier to reason about, both by humans and LLVM, and so even if we can somehow finagle from_trusted_len_iter_bool to perform the same, I'm more comfortable with a simple loop 😅
Which issue does this PR close?
Closes #.
Rationale for this change
I was messing around in godbolt and realised these loops were not getting vectorised correctly
Default flags
Target CPU
Once this is in, I will give the same treatment to the scalar kernels.
As an added bonus this likely reduces the amount of generated LLVM IR, and so may help very slightly with compile times.
What changes are included in this PR?
Are there any user-facing changes?