-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sched: Reverse runqueue order when CLZ is available #14722
Conversation
This commit reverses the runqueue_cache bit order when the architecture has a CLZ (count leading zeros) instruction. When the architecture supports CLZ, it is faster to determine the most significant set bit of a word than to determine the least significant bit set. Unfortunately when the instruction is not available, it is more efficient to determine the least significant bit set. Reversing the bit order shaves off another 4 cycles on the same54-xpro. From 147 to 143 ticks when testing with tests/bench_sched_nop. Architectures where no CLZ instruction is available are not affected.
88a2763
to
ab1d0b6
Compare
After reading the PR description, I expected pretty messy code. The helper functions however resulted in the distinction being nicely contained, so that the code actually looks quite good :-) |
On the Nucleo-F767ZI: PR:
|
Not sure what's going on here for you, I can't reproduce it on my nucleo-f746zg. On my nucleo-f746zg:
Master:
|
I can reproduce these results on a different nucleo-f767ZI though 😢 |
try cold if cold booting makes a difference. |
Doesn't make a difference on the nucleo-f746zg. I don't have local access to the nucleo-f767zi, I can't power cycle it. |
I get reproducible results on the Nucleo-F767ZI. If I use If this PR results in better performance for most boards, this shouldn't block this though. |
When I flash my nucleo-f746zg with an ELF file compiled for the nucleo-f767zi I get the same numbers posted earlier by @maribu for the nucleo-f767zi. |
The other way around also holds true. Flashing an nucleo-f767zi with a build for the nucleo-f746zg works perfectly and shows identical numbers as nucleo-f746zg hardware. I wonder if we're running into a software issue causing these odd results on the stm32f767zi |
Nucleo-f746zi:
Nucleo-f767zg:
|
A quick look at the binaries using |
I've put what I've found out so far in #14728. I don't consider the issue related to this PR and I would prefer not to have that block this PR here. |
Code looks OK. Seems like a nice optimization. Do we need benchmarks for some boards with / without CLZ? (ignoring the F7 for now) |
Ticks as printed by the tests. The samr21-xpro doesn't have the CLZ instruction and is indeed unaffected by this PR. Mutex ping-pong:
Sched nop:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK.
Small improvement. Code not too ugly.
Thanks! |
Contribution description
This PR reverses the runqueue_cache bit order when the architecture has a
CLZ (count leading zeros) instruction. When the architecture supports
CLZ, it is faster to determine the most significant set bit of a word
than to determine the least significant bit set. Unfortunately when the
instruction is not available, it is more efficient to determine the
least significant bit set.
Reversing the bit order shaves off another 4 cycles on the same54-xpro.
From 147 to 143 ticks when testing with tests/bench_sched_nop.
Architectures where no CLZ instruction is available are not affected.
Testing procedure
Background info
Previously the __builtin_ffs() is used to determine the least significant bit set. On platforms where CLZ is used this is implemented as
Reversing the bitorder allows direct use of the CLZ instruction removes the
v & -v
, shaving off these cycles.Issues/PRs references
None