sched: Reverse runqueue order when CLZ is available #14722

bergzand · 2020-08-06T18:37:54Z

Contribution description

This PR reverses the runqueue_cache bit order when the architecture has a
CLZ (count leading zeros) instruction. When the architecture supports
CLZ, it is faster to determine the most significant set bit of a word
than to determine the least significant bit set. Unfortunately when the
instruction is not available, it is more efficient to determine the
least significant bit set.

Reversing the bit order shaves off another 4 cycles on the same54-xpro.
From 147 to 143 ticks when testing with tests/bench_sched_nop.
Architectures where no CLZ instruction is available are not affected.

Testing procedure

Binaries without CLZ instruction are not changed.
scheduling still works on the affected architectures. (It breaks horribly and fast if this doesn't work)
As mentioned above, a slight increase in scheduling performance should be observed.

Background info

Previously the __builtin_ffs() is used to determine the least significant bit set. On platforms where CLZ is used this is implemented as

unsigned __builtin_ffs(uint32_t v)
{
    return 31 - CLZ(v & -v);
}

Reversing the bitorder allows direct use of the CLZ instruction removes the v & -v, shaving off these cycles.

Issues/PRs references

None

This commit reverses the runqueue_cache bit order when the architecture has a CLZ (count leading zeros) instruction. When the architecture supports CLZ, it is faster to determine the most significant set bit of a word than to determine the least significant bit set. Unfortunately when the instruction is not available, it is more efficient to determine the least significant bit set. Reversing the bit order shaves off another 4 cycles on the same54-xpro. From 147 to 143 ticks when testing with tests/bench_sched_nop. Architectures where no CLZ instruction is available are not affected.

maribu · 2020-08-06T19:13:46Z

After reading the PR description, I expected pretty messy code. The helper functions however resulted in the distinction being nicely contained, so that the code actually looks quite good :-)

maribu · 2020-08-06T19:21:01Z

On the Nucleo-F767ZI:

PR:

  text	   data	    bss	    dec	    hex	filename
  13468	    132	   3948	  17548	   448c	/home/maribu/Repos/software/RIOT/tests/bench_thread_yield_pingpong/bin/nucleo-f767zi/tests_bench_thread_yield_pingpong.elf

main(): This is RIOT! (Version: test)
main starting
{ "result" : 304225, "ticks" : 710 }

master:

   text	   data	    bss	    dec	    hex	filename
  13468	    132	   3948	  17548	   448c	/home/maribu/Repos/software/RIOT/tests/bench_thread_yield_pingpong/bin/nucleo-f767zi/tests_bench_thread_yield_pingpong.elf

main(): This is RIOT! (Version: test)
main starting
{ "result" : 325301, "ticks" : 664 }

bergzand · 2020-08-06T20:29:13Z

On the Nucleo-F767ZI:

Not sure what's going on here for you, I can't reproduce it on my nucleo-f746zg.

On my nucleo-f746zg:
PR:

2020-08-06 22:26:53,865 # main(): This is RIOT! (Version: 2020.10-devel-597-gab1d0b-pr/sched/runqueue_clz)
2020-08-06 22:26:53,866 # main starting
2020-08-06 22:26:54,869 # { "result" : 329053, "ticks" : 656 }
2020-08-06 22:26:57,785 # Exiting Pyterm

Master:

2020-08-06 22:27:15,295 # main(): This is RIOT! (Version: 2020.10-devel-596-g01e6b-HEAD)
2020-08-06 22:27:15,297 # main starting
2020-08-06 22:27:16,300 # { "result" : 302097, "ticks" : 715 }

bergzand · 2020-08-06T20:46:59Z

On the Nucleo-F767ZI:

I can reproduce these results on a different nucleo-f767ZI though 😢

kaspar030 · 2020-08-06T20:50:43Z

I can reproduce these results on a different nucleo-f767ZI though cry

try cold if cold booting makes a difference.

bergzand · 2020-08-06T20:53:36Z

try cold if cold booting makes a difference.

Doesn't make a difference on the nucleo-f746zg. I don't have local access to the nucleo-f767zi, I can't power cycle it.

maribu · 2020-08-06T20:59:37Z

I get reproducible results on the Nucleo-F767ZI. If I use BUILD_IN_DOCKER=1, the ROM gets smaller (my local newlib is more recent and more bloaty), but the performance numbers are identical. Those numbers remain identical after power cycling. Warm boot and cold boot don't make a difference.

If this PR results in better performance for most boards, this shouldn't block this though.

bergzand · 2020-08-06T21:01:48Z

When I flash my nucleo-f746zg with an ELF file compiled for the nucleo-f767zi I get the same numbers posted earlier by @maribu for the nucleo-f767zi.

bergzand · 2020-08-06T21:13:07Z

The other way around also holds true. Flashing an nucleo-f767zi with a build for the nucleo-f746zg works perfectly and shows identical numbers as nucleo-f746zg hardware.

I wonder if we're running into a software issue causing these odd results on the stm32f767zi

bergzand · 2020-08-06T21:35:11Z

tests/bitarithm_timings confirms these numbers:

Nucleo-f746zi:

2020-08-06 23:34:22,979 # START
2020-08-06 23:34:22,984 # main(): This is RIOT! (Version: 2020.10-devel-596-g01e6b-HEAD)
2020-08-06 23:34:22,985 # Start.
2020-08-06 23:34:27,989 # + bitarithm_msb: 4529488 iterations per second
2020-08-06 23:34:32,994 # + bitarithm_lsb: 3793632 iterations per second
2020-08-06 23:34:37,998 # + bitarithm_bits_set: 2145251 iterations per second
2020-08-06 23:34:43,003 # + bitarithm_test_and_clear: 776978 iterations per second
2020-08-06 23:34:43,004 # Done.

Nucleo-f767zg:

2020-08-06 23:34:25,451 # START
2020-08-06 23:34:25,457 # main(): This is RIOT! (Version: 2020.10-devel-596-g01e6b-HEAD)
2020-08-06 23:34:25,457 # Start.
2020-08-06 23:34:30,462 # + bitarithm_msb: 4023283 iterations per second
2020-08-06 23:34:35,466 # + bitarithm_lsb: 4601862 iterations per second
2020-08-06 23:34:40,471 # + bitarithm_bits_set: 395107 iterations per second
2020-08-06 23:34:45,476 # + bitarithm_test_and_clear: 1830507 iterations per second
2020-08-06 23:34:45,477 # Done.

maribu · 2020-08-07T07:56:20Z

A quick look at the binaries using elf_diff seems that only vectors_cpu differ for those boards, and the remaining diffs are due to different target addresses for branches/jumps/calls due to the different memory layout (due to the different size of vectors_cpu). All other symbols have the same size.

bergzand · 2020-08-07T13:39:17Z

I've put what I've found out so far in #14728. I don't consider the issue related to this PR and I would prefer not to have that block this PR here.

kaspar030 · 2020-09-01T13:50:41Z

Code looks OK. Seems like a nice optimization. Do we need benchmarks for some boards with / without CLZ? (ignoring the F7 for now)

bergzand · 2020-09-01T14:02:04Z

Ticks as printed by the tests. The samr21-xpro doesn't have the CLZ instruction and is indeed unaffected by this PR.

Mutex ping-pong:

	Pre	Post
samr21-xpro	735	735
nucleo-f103rb	656	641
same54-xpro	471	463

Sched nop:

	Pre	Post
samr21-xpro	241	241
nucleo-f103rb	209	198
same54-xpro	147	143

kaspar030

ACK.

Small improvement. Code not too ugly.

bergzand · 2020-09-01T21:46:32Z

Thanks!

bergzand added Type: enhancement Area: core CI: ready for build CI: run tests labels Aug 6, 2020

bergzand requested review from benpicco, maribu, kaspar030 and fjmolinas August 6, 2020 18:37

bergzand force-pushed the pr/sched/runqueue_clz branch from 88a2763 to ab1d0b6 Compare August 6, 2020 18:39

bergzand added Type: enhancement and removed Type: enhancement labels Sep 1, 2020

kaspar030 approved these changes Sep 1, 2020

View reviewed changes

kaspar030 merged commit f68f19a into RIOT-OS:master Sep 1, 2020

bergzand deleted the pr/sched/runqueue_clz branch September 1, 2020 21:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sched: Reverse runqueue order when CLZ is available #14722

sched: Reverse runqueue order when CLZ is available #14722

bergzand commented Aug 6, 2020 •

edited

Loading

maribu commented Aug 6, 2020

maribu commented Aug 6, 2020

bergzand commented Aug 6, 2020 •

edited

Loading

bergzand commented Aug 6, 2020

kaspar030 commented Aug 6, 2020

bergzand commented Aug 6, 2020

maribu commented Aug 6, 2020

bergzand commented Aug 6, 2020

bergzand commented Aug 6, 2020

bergzand commented Aug 6, 2020

maribu commented Aug 7, 2020

bergzand commented Aug 7, 2020

kaspar030 commented Sep 1, 2020

bergzand commented Sep 1, 2020

kaspar030 left a comment

bergzand commented Sep 1, 2020

sched: Reverse runqueue order when CLZ is available #14722

sched: Reverse runqueue order when CLZ is available #14722

Conversation

bergzand commented Aug 6, 2020 • edited Loading

Contribution description

Testing procedure

Background info

Issues/PRs references

maribu commented Aug 6, 2020

maribu commented Aug 6, 2020

bergzand commented Aug 6, 2020 • edited Loading

bergzand commented Aug 6, 2020

kaspar030 commented Aug 6, 2020

bergzand commented Aug 6, 2020

maribu commented Aug 6, 2020

bergzand commented Aug 6, 2020

bergzand commented Aug 6, 2020

bergzand commented Aug 6, 2020

maribu commented Aug 7, 2020

bergzand commented Aug 7, 2020

kaspar030 commented Sep 1, 2020

bergzand commented Sep 1, 2020

kaspar030 left a comment

Choose a reason for hiding this comment

bergzand commented Sep 1, 2020

bergzand commented Aug 6, 2020 •

edited

Loading

bergzand commented Aug 6, 2020 •

edited

Loading