-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stm32f7: Large performance difference between stm32f746 and stm32f767 #14728
Comments
Does this suggest that if we limit the amount of ISR vector to just the ones used by the application performance improves? |
Nope:
|
If we disable the cache, does the difference disappear? |
So by allocating a specific amount of isrs, we move all memory that's behind by 4b per isr, right? A cache line on the f7 is 32b. If we graph the results with n isr, n+1, ..., n+8, n+9, ...and the results repeat every 8 isrs, this would at least partially hint at a cache line alignment issue. |
My measurements for 256 IRQ lines were incorrect somehow. Fixed them here and above. |
Does the ART go through the l1 cache? Moving the isr vectors moves around both code and data. Could you try decreasing the main stack size by 4 8, ... bytes, to see if that changes the results? |
Ah, no, chsnges to data size shouldn't affect where bss memory ends up in memory. |
Maybe take a look at nm output and see if the performance peaks correlate with symbols being 32b aligned? |
The current linker scripts link the code at Initial measurements with 110 IRQ lines with firmware linked to use the ART cache:
|
What I have so far on the stm32f7's is:
With both these adjustments done, I get the following performance benchmarks on the nucleo-f746zg:
For these measurements I had to compile with -O2 and manually disable some flash-costly optimizations, otherwise -falign-functions doesn't have any effect. with bench_mutex_pingpong:
O2-aligned-ART:
Only the ITCM bus is routed through the ART. On the stm32f4 boards the I-reads are always routed through the ART, the cortex-m4 doesn't have a separate TCM interface I think it should be possible to add this to the linker scripts, but it might not be easy. The text would have to keep the LMA base at |
Description
The stm32f746 and stm32f767 are almost identical, only different peripherals and a different cache size, and one would expect at least somewhat identical benchmark results between the two cores. However, using
tests/bitarithm_timings
shows widely different results.nucleo-f746zg
nucleo-f767zi
Notable is that on one
bitarithm_msb
is faster and on the otherbitarithm_lsb
is faster. This is odd considering that they have the same instruction set and inspecting the binaries also shows identical code for these functions.Now after confirming this, I flashed the firmware built for the nucleo-f767zi on the nucleo-f746zg. The two boards and cores are identical enough to make this work. The other way around is also possible without any issues for this test application. In short, flashing firmware for board A on board B shows the same performance as flashing firmware for board A on board A. This also holds the other way around.
The compiled binaries for these two boards are almost identical. There is only a difference in the number of interrupts, 98 IRQ lines vs 110 IRQ lines. This shifts all function addresses a bit, so to easily compare the content of two firmware ELF files with eachother, I removed the 12 extra IRQ lines on the stm32f767. With this the two ELF files are almost identical, Two words were different. With only this difference remaining, I flashed the new binary on the boards, and voila, matching measurements between the two boards. By changing the number of allocated IRQ handlers, all functions are shifted by a certain amount, causing different measurements. Increasing the number of handlers beyond what is useful also changes (and not necessarily increases) the performance.
TL;DR (spoilers)
TL;DR: Modifying the number of allocated IRQ handlers changes the measured performance of
tests/bitarithm_timings
Steps to reproduce the issue
With tests/bitarithm_timings:
This should reproduce my numbers above.
Expected results
Identical measurement between the two MCU's
Actual results
Different results between the two MCU's
The text was updated successfully, but these errors were encountered: