Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sched: Reverse runqueue order when CLZ is available #14722

Merged
merged 1 commit into from
Sep 1, 2020

Conversation

bergzand
Copy link
Member

@bergzand bergzand commented Aug 6, 2020

Contribution description

This PR reverses the runqueue_cache bit order when the architecture has a
CLZ (count leading zeros) instruction. When the architecture supports
CLZ, it is faster to determine the most significant set bit of a word
than to determine the least significant bit set. Unfortunately when the
instruction is not available, it is more efficient to determine the
least significant bit set.

Reversing the bit order shaves off another 4 cycles on the same54-xpro.
From 147 to 143 ticks when testing with tests/bench_sched_nop.
Architectures where no CLZ instruction is available are not affected.

Testing procedure

  • Binaries without CLZ instruction are not changed.
  • scheduling still works on the affected architectures. (It breaks horribly and fast if this doesn't work)
  • As mentioned above, a slight increase in scheduling performance should be observed.

Background info

Previously the __builtin_ffs() is used to determine the least significant bit set. On platforms where CLZ is used this is implemented as

unsigned __builtin_ffs(uint32_t v)
{
    return 31 - CLZ(v & -v);
}

Reversing the bitorder allows direct use of the CLZ instruction removes the v & -v, shaving off these cycles.

Issues/PRs references

None

@bergzand bergzand added Type: enhancement The issue suggests enhanceable parts / The PR enhances parts of the codebase / documentation Area: core Area: RIOT kernel. Handle PRs marked with this with care! CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR CI: run tests If set, CI server will run tests on hardware for the labeled PR labels Aug 6, 2020
This commit reverses the runqueue_cache bit order when the architecture
has a CLZ (count leading zeros) instruction. When the architecture
supports CLZ, it is faster to determine the most significant set bit of
a word than to determine the least significant bit set. Unfortunately
when the instruction is not available, it is more efficient to determine
the least significant bit set.

Reversing the bit order shaves off another 4 cycles on the same54-xpro.
From 147 to 143 ticks when testing with tests/bench_sched_nop.
Architectures where no CLZ instruction is available are not affected.
@bergzand bergzand force-pushed the pr/sched/runqueue_clz branch from 88a2763 to ab1d0b6 Compare August 6, 2020 18:39
@maribu
Copy link
Member

maribu commented Aug 6, 2020

After reading the PR description, I expected pretty messy code. The helper functions however resulted in the distinction being nicely contained, so that the code actually looks quite good :-)

@maribu
Copy link
Member

maribu commented Aug 6, 2020

On the Nucleo-F767ZI:

PR:

  text	   data	    bss	    dec	    hex	filename
  13468	    132	   3948	  17548	   448c	/home/maribu/Repos/software/RIOT/tests/bench_thread_yield_pingpong/bin/nucleo-f767zi/tests_bench_thread_yield_pingpong.elf

main(): This is RIOT! (Version: test)
main starting
{ "result" : 304225, "ticks" : 710 }

master:

   text	   data	    bss	    dec	    hex	filename
  13468	    132	   3948	  17548	   448c	/home/maribu/Repos/software/RIOT/tests/bench_thread_yield_pingpong/bin/nucleo-f767zi/tests_bench_thread_yield_pingpong.elf

main(): This is RIOT! (Version: test)
main starting
{ "result" : 325301, "ticks" : 664 }

@bergzand
Copy link
Member Author

bergzand commented Aug 6, 2020

On the Nucleo-F767ZI:

Not sure what's going on here for you, I can't reproduce it on my nucleo-f746zg.

On my nucleo-f746zg:
PR:

2020-08-06 22:26:53,865 # main(): This is RIOT! (Version: 2020.10-devel-597-gab1d0b-pr/sched/runqueue_clz)
2020-08-06 22:26:53,866 # main starting
2020-08-06 22:26:54,869 # { "result" : 329053, "ticks" : 656 }
2020-08-06 22:26:57,785 # Exiting Pyterm

Master:

2020-08-06 22:27:15,295 # main(): This is RIOT! (Version: 2020.10-devel-596-g01e6b-HEAD)
2020-08-06 22:27:15,297 # main starting
2020-08-06 22:27:16,300 # { "result" : 302097, "ticks" : 715 }

@bergzand
Copy link
Member Author

bergzand commented Aug 6, 2020

On the Nucleo-F767ZI:

I can reproduce these results on a different nucleo-f767ZI though 😢

@kaspar030
Copy link
Contributor

I can reproduce these results on a different nucleo-f767ZI though cry

try cold if cold booting makes a difference.

@bergzand
Copy link
Member Author

bergzand commented Aug 6, 2020

try cold if cold booting makes a difference.

Doesn't make a difference on the nucleo-f746zg. I don't have local access to the nucleo-f767zi, I can't power cycle it.

@maribu
Copy link
Member

maribu commented Aug 6, 2020

I get reproducible results on the Nucleo-F767ZI. If I use BUILD_IN_DOCKER=1, the ROM gets smaller (my local newlib is more recent and more bloaty), but the performance numbers are identical. Those numbers remain identical after power cycling. Warm boot and cold boot don't make a difference.

If this PR results in better performance for most boards, this shouldn't block this though.

@bergzand
Copy link
Member Author

bergzand commented Aug 6, 2020

When I flash my nucleo-f746zg with an ELF file compiled for the nucleo-f767zi I get the same numbers posted earlier by @maribu for the nucleo-f767zi.

@bergzand
Copy link
Member Author

bergzand commented Aug 6, 2020

The other way around also holds true. Flashing an nucleo-f767zi with a build for the nucleo-f746zg works perfectly and shows identical numbers as nucleo-f746zg hardware.

I wonder if we're running into a software issue causing these odd results on the stm32f767zi

@bergzand
Copy link
Member Author

bergzand commented Aug 6, 2020

tests/bitarithm_timings confirms these numbers:

Nucleo-f746zi:

2020-08-06 23:34:22,979 # START
2020-08-06 23:34:22,984 # main(): This is RIOT! (Version: 2020.10-devel-596-g01e6b-HEAD)
2020-08-06 23:34:22,985 # Start.
2020-08-06 23:34:27,989 # + bitarithm_msb: 4529488 iterations per second
2020-08-06 23:34:32,994 # + bitarithm_lsb: 3793632 iterations per second
2020-08-06 23:34:37,998 # + bitarithm_bits_set: 2145251 iterations per second
2020-08-06 23:34:43,003 # + bitarithm_test_and_clear: 776978 iterations per second
2020-08-06 23:34:43,004 # Done.

Nucleo-f767zg:

2020-08-06 23:34:25,451 # START
2020-08-06 23:34:25,457 # main(): This is RIOT! (Version: 2020.10-devel-596-g01e6b-HEAD)
2020-08-06 23:34:25,457 # Start.
2020-08-06 23:34:30,462 # + bitarithm_msb: 4023283 iterations per second
2020-08-06 23:34:35,466 # + bitarithm_lsb: 4601862 iterations per second
2020-08-06 23:34:40,471 # + bitarithm_bits_set: 395107 iterations per second
2020-08-06 23:34:45,476 # + bitarithm_test_and_clear: 1830507 iterations per second
2020-08-06 23:34:45,477 # Done.

@maribu
Copy link
Member

maribu commented Aug 7, 2020

A quick look at the binaries using elf_diff seems that only vectors_cpu differ for those boards, and the remaining diffs are due to different target addresses for branches/jumps/calls due to the different memory layout (due to the different size of vectors_cpu). All other symbols have the same size.

@bergzand
Copy link
Member Author

bergzand commented Aug 7, 2020

I've put what I've found out so far in #14728. I don't consider the issue related to this PR and I would prefer not to have that block this PR here.

@kaspar030
Copy link
Contributor

Code looks OK. Seems like a nice optimization. Do we need benchmarks for some boards with / without CLZ? (ignoring the F7 for now)

@bergzand bergzand added Type: enhancement The issue suggests enhanceable parts / The PR enhances parts of the codebase / documentation and removed Type: enhancement The issue suggests enhanceable parts / The PR enhances parts of the codebase / documentation labels Sep 1, 2020
@bergzand
Copy link
Member Author

bergzand commented Sep 1, 2020

Ticks as printed by the tests. The samr21-xpro doesn't have the CLZ instruction and is indeed unaffected by this PR.

Mutex ping-pong:

Pre Post
samr21-xpro 735 735
nucleo-f103rb 656 641
same54-xpro 471 463

Sched nop:

Pre Post
samr21-xpro 241 241
nucleo-f103rb 209 198
same54-xpro 147 143

Copy link
Contributor

@kaspar030 kaspar030 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK.

Small improvement. Code not too ugly.

@kaspar030 kaspar030 merged commit f68f19a into RIOT-OS:master Sep 1, 2020
@bergzand
Copy link
Member Author

bergzand commented Sep 1, 2020

Thanks!

@bergzand bergzand deleted the pr/sched/runqueue_clz branch September 1, 2020 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: core Area: RIOT kernel. Handle PRs marked with this with care! CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR CI: run tests If set, CI server will run tests on hardware for the labeled PR Type: enhancement The issue suggests enhanceable parts / The PR enhances parts of the codebase / documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants