-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize vec::retain
performance
#91527
Conversation
Add `into_iter().filter().collect()` as a comparison point since it was reported to be faster than `retain`. Remove clone inside benchmark loop to reduce allocator noise.
This simply moves the loops into the inner function which leads to better results. ``` old: test vec::bench_retain_100000 ... bench: 203,828 ns/iter (+/- 2,101) test vec::bench_retain_iter_100000 ... bench: 63,324 ns/iter (+/- 12,305) test vec::bench_retain_whole_100000 ... bench: 42,989 ns/iter (+/- 291) new: test vec::bench_retain_100000 ... bench: 42,180 ns/iter (+/- 451) test vec::bench_retain_iter_100000 ... bench: 65,167 ns/iter (+/- 11,971) test vec::bench_retain_whole_100000 ... bench: 33,736 ns/iter (+/- 12,404) ```
(rust-highfive has picked a reviewer for you, use r? to override) |
retain is used in some places in the compiler, some are fairly warm code, let's see if it has an impact. @bors try @rust-timer queue |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit 67180ef with merge 7c69e77aa6984d2d2e1aa554041d3fdf92ee19f5... |
Testing this PR on the system from my report in #91497 (Ryzen 5900X with Linux).
New:
Looking much better. |
Interesting, I don't even know why LLVM can't optimize this code, but the benchmark is very convincing. |
Performs similar to what I came up with before seeing that you claimed the issue: #[unstable(feature = "vec_retain_mut", issue = "90829")]
pub fn retain_mut<F>(&mut self, mut f: F)
where
F: FnMut(&mut T) -> bool,
{
let original_len = self.len();
// Avoid double drop if the drop guard is not executed,
// since we may make some holes during the process.
unsafe { self.set_len(0) };
// Vec: [Kept, Kept, Hole, Hole, Hole, Hole, Unchecked, Unchecked]
// |<- processed len ->| ^- next to check
// |<- deleted cnt ->|
// |<- original_len ->|
// Kept: Elements which predicate returns true on.
// Hole: Moved or dropped element slot.
// Unchecked: Unchecked valid elements.
//
// This drop guard will be invoked when predicate or `drop` of element panicked.
// It shifts unchecked elements to cover holes and `set_len` to the correct length.
// In cases when predicate and `drop` never panick, it will be optimized out.
struct BackshiftOnDrop<'a, T, A: Allocator> {
v: &'a mut Vec<T, A>,
processed_len: usize,
deleted_cnt: usize,
original_len: usize,
}
impl<T, A: Allocator> Drop for BackshiftOnDrop<'_, T, A> {
fn drop(&mut self) {
if self.deleted_cnt > 0 {
// SAFETY: Trailing unchecked items must be valid since we never touch them.
unsafe {
ptr::copy(
self.v.as_ptr().add(self.processed_len),
self.v.as_mut_ptr().add(self.processed_len - self.deleted_cnt),
self.original_len - self.processed_len,
);
}
}
// SAFETY: After filling holes, all items are in contiguous memory.
unsafe {
self.v.set_len(self.original_len - self.deleted_cnt);
}
}
}
let mut g = BackshiftOnDrop { v: self, processed_len: 0, deleted_cnt: 0, original_len };
// check_one returns the current element if it was retained.
#[inline(always)]
fn check_one<F, T, A: Allocator>(
f: &mut F,
g: &mut BackshiftOnDrop<'_, T, A>,
) -> Option<*const T>
where
F: FnMut(&mut T) -> bool,
{
// SAFETY: Unchecked element must be valid.
let cur = unsafe { &mut *g.v.as_mut_ptr().add(g.processed_len) };
let retain = f(cur);
// Advance early to avoid double drop if `drop_in_place` panics.
g.processed_len += 1;
if retain {
Some(cur)
} else {
g.deleted_cnt += 1;
// SAFETY: We never touch this element again after dropped.
unsafe { ptr::drop_in_place(cur) };
None
}
}
// Stage 1: Nothing was deleted.
while g.processed_len != original_len {
if check_one(&mut f, &mut g).is_none() {
break;
}
}
// Stage 2: Some elements were deleted.
while g.processed_len != original_len {
if let Some(cur) = check_one(&mut f, &mut g) {
// SAFETY: `deleted_cnt` > 0, so the hole slot must not overlap with current element.
// We use copy for move, and never touch this element again.
unsafe {
let hole_slot = g.v.as_mut_ptr().add(g.processed_len - 1 - g.deleted_cnt);
ptr::copy_nonoverlapping(cur, hole_slot, 1);
}
}
}
// All item are processed. This can be optimized to `set_len` by LLVM.
drop(g);
} |
☀️ Try build successful - checks-actions |
Queued 7c69e77aa6984d2d2e1aa554041d3fdf92ee19f5 with parent 887999d, future comparison URL. |
Finished benchmarking commit (7c69e77aa6984d2d2e1aa554041d3fdf92ee19f5): comparison url. Summary: This benchmark run did not return any relevant changes. If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf. @bors rollup=never |
// SAFETY: We never touch this element again after dropped. | ||
unsafe { ptr::drop_in_place(cur) }; | ||
// We already advanced the counter. | ||
if DELETED { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we write this more clearly? It is a quite complex, imho.
Maybe this would more readable:
match (DELETED, f(cur)){
(true, true) =>{
// move current item to freed hole
},
(true, false) => {
// increment cnts, drop in place,
// move items
}
(false, true) => {
// increase processed len, do not move because no items dropped yet
}
(false, false) => {
// increase cnts, drop in place, return
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// SAFETY: We never touch this element again after dropped.
g.deleted_cnt += 1;
unsafe { ptr::drop_in_place(cur) };
// SAFETY: We never touch this element again after dropped.
// We already advanced the counter.
unsafe { ptr::drop_in_place(cur) };
This code block would happen on two branches and thus would have to be duplicated.
Well, I guess processed_len
can be hoisted out of all the conditionals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really think that duplicate code is a big issue because it is very short. For me, control flow is not easy to follow but maybe it is something wrong with me and the code is OK.
// process_one return a bool indicates whether the processing element should be retained. | ||
#[inline(always)] | ||
fn process_one<F, T, A: Allocator, const DELETED: bool>( | ||
fn process_loop<F, T, A: Allocator, const DELETED: bool>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, dropping #[inline]
modifier should not be done before New Pass Manager stabilizes because this affects how early function would be inlined.
Also, fn drop
can have this modifier too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing I see that fn drop
always do a memmove
call despite the fact that almost always there is 0 items to copy (this is not true only if panic occurs).
LLVM doesn't do any checks before memmove call: godbolt
Maybe condition must look like if self.deleted_cnt > 0 && self.original_len != self.processed_len {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, dropping #[inline] modifier should not be done before New Pass Manager stabilizes because this affects how early function would be inlined.
Is it relevant though? I think most of the performance comes from the loops themselves being optimized properly, not from inlining the loops into the larger function. They only do unchecked pointer arithmetic anyway, so it's not like they benefit from bounds check elimination.
if DELETED { | ||
continue; | ||
} else { | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return
shows our intent more clearly.
r? @rust-lang/libs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
@bors r+ |
📌 Commit 67180ef has been approved by |
⌛ Testing commit 67180ef with merge 04a1625e6667c7a3e5e1a0621077d0053f09206c... |
💔 Test failed - checks-actions |
The job Click to see the possible cause of the failure (guessed by this bot)
|
Seems to be a spurious failure that occasionally also hits other PRs. @bors retry |
☀️ Test successful - checks-actions |
Finished benchmarking commit (a090c86): comparison url. Summary: This change led to large relevant mixed results 🤷 in compiler performance.
If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf. Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression |
This seems likely to be noise as a result of the issue tracked in rust-lang/rustc-perf#1126 -- marking as triaged, not something we should address directly. @rustbot label +perf-regression-triaged |
This simply moves the loops into the inner function which leads to better results.
Measured on x86_64-unknown-linux-gnu, Zen2
Fixes #91497