Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stm32f7: Large performance difference between stm32f746 and stm32f767 #14728

Open
bergzand opened this issue Aug 7, 2020 · 10 comments
Open

stm32f7: Large performance difference between stm32f746 and stm32f767 #14728

bergzand opened this issue Aug 7, 2020 · 10 comments
Assignees
Labels
Area: cpu Area: CPU/MCU ports Platform: ARM Platform: This PR/issue effects ARM-based platforms Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)

Comments

@bergzand
Copy link
Member

bergzand commented Aug 7, 2020

Description

The stm32f746 and stm32f767 are almost identical, only different peripherals and a different cache size, and one would expect at least somewhat identical benchmark results between the two cores. However, using tests/bitarithm_timings shows widely different results.

nucleo-f746zg
2020-08-07 15:20:21,461 # START
2020-08-07 15:20:21,467 # main(): This is RIOT! (Version: 2020.10-devel-596-g01e6b-HEAD)
2020-08-07 15:20:21,468 # Start.
2020-08-07 15:20:26,473 # + bitarithm_msb: 4529488 iterations per second
2020-08-07 15:20:31,476 # + bitarithm_lsb: 3793632 iterations per second
2020-08-07 15:20:36,481 # + bitarithm_bits_set: 2145251 iterations per second
2020-08-07 15:20:41,486 # + bitarithm_test_and_clear: 776978 iterations per second
2020-08-07 15:20:41,487 # Done.
nucleo-f767zi
2020-08-07 15:18:58,026 # Help: Press s to start test, r to print it is ready
2020-08-07 15:18:58,027 # START
2020-08-07 15:18:58,032 # main(): This is RIOT! (Version: 2020.10-devel-596-g01e6b-HEAD)
2020-08-07 15:18:58,033 # Start.
2020-08-07 15:19:03,037 # + bitarithm_msb: 4023283 iterations per second
2020-08-07 15:19:08,041 # + bitarithm_lsb: 4601862 iterations per second
2020-08-07 15:19:13,047 # + bitarithm_bits_set: 395107 iterations per second
2020-08-07 15:19:18,051 # + bitarithm_test_and_clear: 1830507 iterations per second
2020-08-07 15:19:18,052 # Done.

Notable is that on one bitarithm_msb is faster and on the other bitarithm_lsb is faster. This is odd considering that they have the same instruction set and inspecting the binaries also shows identical code for these functions.

Now after confirming this, I flashed the firmware built for the nucleo-f767zi on the nucleo-f746zg. The two boards and cores are identical enough to make this work. The other way around is also possible without any issues for this test application. In short, flashing firmware for board A on board B shows the same performance as flashing firmware for board A on board A. This also holds the other way around.

The compiled binaries for these two boards are almost identical. There is only a difference in the number of interrupts, 98 IRQ lines vs 110 IRQ lines. This shifts all function addresses a bit, so to easily compare the content of two firmware ELF files with eachother, I removed the 12 extra IRQ lines on the stm32f767. With this the two ELF files are almost identical, Two words were different. With only this difference remaining, I flashed the new binary on the boards, and voila, matching measurements between the two boards. By changing the number of allocated IRQ handlers, all functions are shifted by a certain amount, causing different measurements. Increasing the number of handlers beyond what is useful also changes (and not necessarily increases) the performance.

TL;DR (spoilers)

TL;DR: Modifying the number of allocated IRQ handlers changes the measured performance of tests/bitarithm_timings

Steps to reproduce the issue

With tests/bitarithm_timings:

  • Test firmware for the nucleo-f746zg on the nucleo-f746zg and the nucleo-f767zi.
  • Test firmware for the nucleo-f767zi on the nucleo-f746zg and the nucleo-f767zi.

This should reproduce my numbers above.

Expected results

Identical measurement between the two MCU's

Actual results

Different results between the two MCU's

@bergzand bergzand added Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) Platform: ARM Platform: This PR/issue effects ARM-based platforms Area: cpu Area: CPU/MCU ports labels Aug 7, 2020
@fjmolinas
Copy link
Contributor

The compiled binaries for these two boards are almost identical. There is only a difference in the number of interrupts, 98 IRQ lines vs 110 IRQ lines. This shifts all function addresses a bit, so to easily compare the content of two firmware ELF files with eachother, I removed the 12 extra IRQ lines on the stm32f767. With this the two ELF files are almost identical, Two words were different. With only this difference remaining, I flashed the new binary on the boards, and voila, matching measurements between the two boards. By changing the number of allocated IRQ handlers, all functions are shifted by a certain amount, causing different measurements. Increasing the number of handlers beyond what is useful also changes (and not necessarily increases) the performance.

Does this suggest that if we limit the amount of ISR vector to just the ones used by the application performance improves?

@bergzand
Copy link
Member Author

bergzand commented Aug 7, 2020

Does this suggest that if we limit the amount of ISR vector to just the ones used by the application performance improves?

Nope:

IRQ lines allocated msb lsb bits_set test_and_clear
98 4529488 3793632 2145251 776978
110 4023283 4601862 395107 1830507
256 4987011 3768809 2260300

@benpicco
Copy link
Contributor

benpicco commented Aug 7, 2020

If we disable the cache, does the difference disappear?

@kaspar030
Copy link
Contributor

kaspar030 commented Aug 7, 2020

So by allocating a specific amount of isrs, we move all memory that's behind by 4b per isr, right? A cache line on the f7 is 32b. If we graph the results with n isr, n+1, ..., n+8, n+9, ...and the results repeat every 8 isrs, this would at least partially hint at a cache line alignment issue.

@bergzand
Copy link
Member Author

bergzand commented Aug 7, 2020

IRQ lines allocated msb lsb bits_set test_and_clear
98 4529488 3793632 2145251 776978
99 4459353 3459459 2108601 764601
100 4663964 3522934 388272 1815125
101 4689280 4494147 388448 2037735
102 4023283 4601862 395107 1830507
103 3781180 4979827 1974857 829174
104 4987011 3768809 2260300 822856
105 4402547 3768809 2116352 789762
106 4529488 3793632 2145251 776978
107 4459353 3459459 2108601 764601
108 4663964 3522934 388272 1815125
109 4689280 4494147 388448 2037734
110 4023283 4601862 395107 1830507
256 4987011 3768809 2260300 822856
257 4402547 3768809 2116348 789762

My measurements for 256 IRQ lines were incorrect somehow. Fixed them here and above.

@kaspar030
Copy link
Contributor

Does the ART go through the l1 cache? Moving the isr vectors moves around both code and data.
Given that we have pretty close loops for the benchmarks, if the ART is not affected, this could be a stack effect, where some calculation on stack jumps more or less between cache lines depending on where the SP ends up when entering the benchmark loop.

Could you try decreasing the main stack size by 4 8, ... bytes, to see if that changes the results?

@kaspar030
Copy link
Contributor

kaspar030 commented Aug 7, 2020

Ah, no, chsnges to data size shouldn't affect where bss memory ends up in memory.

@kaspar030
Copy link
Contributor

Maybe take a look at nm output and see if the performance peaks correlate with symbols being 32b aligned?

@bergzand
Copy link
Member Author

bergzand commented Aug 7, 2020

Does the ART go through the l1 cache?

The current linker scripts link the code at 0x0800 0000 which uses the AXIM interface. I think this makes use of the L1 cache of the core. Linking to 0x0020 0000 maps the code to the flash memory through the ITCM interface which goes through the ART cache.

Initial measurements with 110 IRQ lines with firmware linked to use the ART cache:

IRQ lines allocated msb lsb bits_set test_and_clear
110 6182467 7215030 668860 2204080

@bergzand
Copy link
Member Author

bergzand commented Aug 8, 2020

What I have so far on the stm32f7's is:

  • Aligning functions to 32 byte boundaries helps in boosting and stabilizing the performance of the benchmarks. It removes the difference in performance between the different f7 subfamilies from the shift in IRQ lines. However, due to a bug in GCC, it doesn't play nicely together with -Os.
  • I hacked the linker script to change the VMA of the .text section to start at 0x0020 0000 reroutes the instruction fetches through the TCM interface of the cortex-m7 core, which on the stm32f7 avoids the L1-cache, but uses the ART accelerator. This boost performance also a huge amount.

With both these adjustments done, I get the following performance benchmarks on the nucleo-f746zg:

2020-08-08 15:26:35,234 # START
2020-08-08 15:26:35,239 # main(): This is RIOT! (Version: 2020.10-devel-596-g01e6b-HEAD)
2020-08-08 15:26:35,240 # Start.
2020-08-08 15:26:40,244 # + bitarithm_msb: 13500000 iterations per second
2020-08-08 15:26:45,248 # + bitarithm_lsb: 13447468 iterations per second
2020-08-08 15:26:50,251 # + bitarithm_bits_set: 4235292 iterations per second
2020-08-08 15:26:55,256 # + bitarithm_test_and_clear: 3999998 iterations per second
2020-08-08 15:26:55,257 # Done.

For these measurements I had to compile with -O2 and manually disable some flash-costly optimizations, otherwise -falign-functions doesn't have any effect.

with bench_mutex_pingpong:
master:

{ "result" : 208042, "ticks" : 1038 }

O2-aligned-ART:

{ "result" : 562134, "ticks" : 384 }

Only the ITCM bus is routed through the ART. On the stm32f4 boards the I-reads are always routed through the ART, the cortex-m4 doesn't have a separate TCM interface
Screenshot_20200808_154840

I think it should be possible to add this to the linker scripts, but it might not be easy. The text would have to keep the LMA base at 0x0800 0000, but only the code segments need a VMA base of 0x0020 0000. The rodata must keep the VMA base at 0x0800 0000. The DMA doesn't have access to the TCM address space, so moving the rodata also to 0x0020 0000 will break the DMA controllers when they are copying const data from the flash.

@MrKevinWeiss MrKevinWeiss added this to the Release 2021.07 milestone Jun 22, 2021
@MrKevinWeiss MrKevinWeiss removed this from the Release 2021.07 milestone Jul 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: cpu Area: CPU/MCU ports Platform: ARM Platform: This PR/issue effects ARM-based platforms Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

No branches or pull requests

5 participants