-
Notifications
You must be signed in to change notification settings - Fork 13.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UAVCAN performance audit #6829
UAVCAN performance audit #6829
Conversation
…e that are always zero. This allows the UAVCAN stack to always transfer only the minimum number of output values, avoiding redundant zeroes and the associated increase in bus load and CPU time
Looking at preliminary profiling results (profiling is still in progress) I noticed that the system performs a lot of 64-bit floating point arithmetic computations. I understand that in some cases it is intentional and perfectly valid, since the resolution of 32-bit float is really limited. However, certain cases look like a smoking gun to me. Have a look at this instance: Observe that a call stack rooted in the class Type norm() const {
const Vector &a(*this);
return Type(sqrt(a.dot(a)));
} It looks like the author of this code intended to use One might argue that if such an error took place, the compiler would trigger the carefully enabled warning There are dozens of similar suspicious instances where 64-bit soft FP routines are invoked (not only sqrt but also add, asin, mul, etc.) from where they probably shouldn't be. Have a look at this preliminary interactive SVG to get an idea: pmpn-flamegraph.tar.gz (tomorrow I will upload a higher resolution file - sampling is still in progress and it is slow). Paging @davids5. |
@pavel-kirienko has done some really good sleuthing herein. Can we look at avoiding the conversions and emulated FP overhead? |
Good stuff. Happy to make changes to matrix. Can you submit a PR if you get it pinned down? |
@pavel-kirienko I've changed the startup on our side so the CAN mixer only ever receives the multirotor component. So we should not have any overhead there. I'll check with @eyeam3 how many values the bus transports in our case. |
We have an octocopter setup, so the bus transports eight values. |
@pavel-kirienko Could you please fix the routines in matrix and send a PR? |
@LorenzMeier Already doing so. |
Here's the high resolution output: pmpn-flamegraph.tar.gz Here's some of the most common endpoint functions:
In this list, all routines prefixed with double underscore perform 64-bit floating point arithmetic. |
Way too much memset going on there. We might need to revisit our defensive programming strategies. |
Does anybody else find the 4% of Can DMA or at least an IRQ-driven logic be an option? |
That's a design decision originally done for a single sensor and lower-rate reads. The reason this is not an immediate concern from a controls perspective is that if you're not running at 100% load there is nothing you can do before that SPI update - you are waiting for a total system state update and unless you have new accel and gyro data there is no actionable information. No sensor data, no controller output and if there is no controller output to execute on there is nothing high-priority to do. For the same reason UART has RX DMA so that this prioritisation does not lead to communication drops. However, work is under way to longer-term replace this with SPI DMA. That will allow to run more CPU load on house-keeping tasks, but it will not improve the flight performance on current setups - because the actionable information does not become available before. The only reason SPI DMA can add value on future-generation systems is that you can go to higher oversampling of sensor data for new high-rate access and gyros. Please focus on the CPU load related aspects. These are the ones that delay executing actionable information. |
For the record here, the best performance can be achieved with |
Fixing the Matrix library brought CPU usage down by 10%. Before:
After:
|
I am going to profile this again. |
N.B.: This is the second instance when a serious issue was discovered by way of profiling. The first happened 2 years ago, when we removed profiling instrumentation calls that were used for stack checking, which dropped CPU usage down by 40%: #1660. I suspect the project could really benefit from adopting the practice of regular profiling. |
The conclusion that stack checking was a performance issue is incorrect. We removed an early stage safety net because the system was stable - it was a design intent to trade CPU load for safety. We might very well re-add it on the F7. In the case of Matrix lib it's a whole different story as it should not have promoted to double. I agree on the benefit of regular profiling. Could you create a page in the PX4 dev guide for it? |
And thanks for the optimization. We should bring Hans CI back online and compare task CPU loads. Any difference in regular load should reject or flag the PR. |
It was an issue in the sense that the actual performance hit (40%) was way above the expected penalty (20%). Profiling helped to discover that.
Yes, but I will have to push back on it now a bit. |
@davids5 The compiler seems to be underutilizing the FPU due to incorrect build configuration. I noticed that the fixed firmware still spends about 2% of the run time in unfiltered_phi_ = std::sqrt(square((Uq - Iq * Rs) / w) + square(Ud / w + Iq * Lq));
8024a70: edd4 5a08 vldr s11, [r4, #32]
8024a74: ed97 7a00 vldr s14, [r7]
8024a78: ed94 6a0a vldr s12, [r4, #40] ; 0x28
8024a7c: ee85 0a27 vdiv.f32 s0, s10, s15
8024a80: eea6 7ae5 vfms.f32 s14, s13, s11
8024a84: eea6 0a86 vfma.f32 s0, s13, s12
8024a88: eec7 6a27 vdiv.f32 s13, s14, s15
8024a8c: ee20 0a00 vmul.f32 s0, s0, s0
8024a90: eea6 0aa6 vfma.f32 s0, s13, s13
8024a94: eef1 7ac0 vsqrt.f32 s15, s0 ; <-- Computing square root here
8024a98: eef4 7a67 vcmp.f32 s15, s15
8024a9c: eef1 fa10 vmrs APSR_nzcv, fpscr
8024aa0: f040 80b8 bne.w 8024c14 <_ZN3foc8motor_id15FluxLinkageTask15onNextPWMPeriodERKN5Eigen6MatrixIfLi2ELi1ELi0ELi2ELi1EEERKNS3_IfLi3ELi1ELi0ELi3ELi1EEEf+0x414>
8024aa4: f504 5304 add.w r3, r4, #8448 ; 0x2100
8024aa8: edc3 7a01 vstr s15, [r3, #4] I have disassembled the FMUv4 binary and found out that the compiler emits exactly zero 080fc84c <__ieee754_sqrtf>:
80fc84c: ee10 3a10 vmov r3, s0
80fc850: f023 4200 bic.w r2, r3, #2147483648 ; 0x80000000
80fc854: f1b2 4fff cmp.w r2, #2139095040 ; 0x7f800000
80fc858: b470 push {r4, r5, r6}
80fc85a: d230 bcs.n 80fc8be <__ieee754_sqrtf+0x72>
80fc85c: b36a cbz r2, 80fc8ba <__ieee754_sqrtf+0x6e>
80fc85e: 2b00 cmp r3, #0
80fc860: db3d blt.n 80fc8de <__ieee754_sqrtf+0x92>
80fc862: f5b2 0f00 cmp.w r2, #8388608 ; 0x800000
80fc866: ea4f 50e3 mov.w r0, r3, asr #23
80fc86a: d32c bcc.n 80fc8c6 <__ieee754_sqrtf+0x7a>
80fc86c: f1a0 027f sub.w r2, r0, #127 ; 0x7f
80fc870: f3c3 0316 ubfx r3, r3, #0, #23
80fc874: 07d1 lsls r1, r2, #31
80fc876: f443 0300 orr.w r3, r3, #8388608 ; 0x800000
80fc87a: bf48 it mi
80fc87c: 005b lslmi r3, r3, #1
80fc87e: 2400 movs r4, #0
80fc880: 1056 asrs r6, r2, #1
80fc882: 005b lsls r3, r3, #1
80fc884: 4625 mov r5, r4
80fc886: 2119 movs r1, #25
80fc888: f04f 7280 mov.w r2, #16777216 ; 0x1000000
80fc88c: 18a8 adds r0, r5, r2
80fc88e: 4298 cmp r0, r3
80fc890: dc02 bgt.n 80fc898 <__ieee754_sqrtf+0x4c>
80fc892: 1a1b subs r3, r3, r0
80fc894: 1885 adds r5, r0, r2
80fc896: 4414 add r4, r2
80fc898: 3901 subs r1, #1
80fc89a: ea4f 0343 mov.w r3, r3, lsl #1
80fc89e: ea4f 0252 mov.w r2, r2, lsr #1
80fc8a2: d1f3 bne.n 80fc88c <__ieee754_sqrtf+0x40>
80fc8a4: b113 cbz r3, 80fc8ac <__ieee754_sqrtf+0x60>
80fc8a6: f004 0301 and.w r3, r4, #1
80fc8aa: 441c add r4, r3
80fc8ac: 1064 asrs r4, r4, #1
80fc8ae: f104 547c add.w r4, r4, #1056964608 ; 0x3f000000
80fc8b2: eb04 53c6 add.w r3, r4, r6, lsl #23
80fc8b6: ee00 3a10 vmov s0, r3
80fc8ba: bc70 pop {r4, r5, r6}
80fc8bc: 4770 bx lr
80fc8be: eea0 0a00 vfma.f32 s0, s0, s0
80fc8c2: bc70 pop {r4, r5, r6}
80fc8c4: 4770 bx lr
80fc8c6: f413 0200 ands.w r2, r3, #8388608 ; 0x800000
80fc8ca: d001 beq.n 80fc8d0 <__ieee754_sqrtf+0x84>
80fc8cc: e00c b.n 80fc8e8 <__ieee754_sqrtf+0x9c>
80fc8ce: 460a mov r2, r1
80fc8d0: 005b lsls r3, r3, #1
80fc8d2: 021c lsls r4, r3, #8
80fc8d4: f102 0101 add.w r1, r2, #1
80fc8d8: d5f9 bpl.n 80fc8ce <__ieee754_sqrtf+0x82>
80fc8da: 1a80 subs r0, r0, r2
80fc8dc: e7c6 b.n 80fc86c <__ieee754_sqrtf+0x20>
80fc8de: ee70 7a40 vsub.f32 s15, s0, s0
80fc8e2: ee87 0aa7 vdiv.f32 s0, s15, s15
80fc8e6: e7e8 b.n 80fc8ba <__ieee754_sqrtf+0x6e>
80fc8e8: f04f 32ff mov.w r2, #4294967295
80fc8ec: e7f5 b.n 80fc8da <__ieee754_sqrtf+0x8e>
80fc8ee: bf00 nop Should I move this into a separate issue? |
I will sum up my findings here, apart from what has been already said above. Poll is very slowAbout half of the run time of the UAVCAN driver is spent in the The poll system call must be optimized, I will report this as a separate issue. Performance counters are slowMany performance counters in the system do not worth the performance impact they introduce. I removed them from the UAVCAN driver, which lowered the CPU usage by 0.7%. I would advice to either review their implementation, or just remove them from all high frequency modules. My estimate is that this change alone would drop the CPU usage by up to 5%. MAVLinkAbout 1% of the total run time is spent in this check in if (dt > (interval - (_mavlink->get_main_loop_delay() / 10) * 4)) { The performance impact can be seen here on the right, it is significant: I wouldn't venture to fix it myself, so I propose to register this as a separate issue. FPU is misbehavingSee the previous post. I propose to move this issue into a separate ticket, too. UAVCAN stackThe UAVCAN stack itself does not have any detectable performance issues. |
New report; note that it was collected before perfcounters were removed from the UAVCAN module: pmpn-flamegraph.tar.gz |
Just talked with @davids5 about this, @pavel-kirienko Can you update matrix in this PR so we can check it passes and we can make sure the build size looks good. |
@jgoppert it is already updated here |
@pavel-kirienko we could look at enabling this clang-tidy check in the builds. |
@dagar That would be useful. |
I don't see a significant change in build size, so will merge the changes to matrix. build size of master
build size of this branch:
|
@jgoppert - Thank you - at least it went in the right directions. |
This is ready for merge. |
Apparently not yet. Please have a look at the submode commit. See Revert "UAVCAN performance audit" #6846 |
* UAVCAN ESC output: removing ESC output channels from published message that are always zero. This allows the UAVCAN stack to always transfer only the minimum number of output values, avoiding redundant zeroes and the associated increase in bus load and CPU time * Added a separate mixer file for CAN quadrotor * Sampling profiler improvements * PMSP: Output more endpoints * Matrix update * libc usage workaround * Removed UAVCAN perfcounters * Matrix submodule update
This probably happened because I merged the PR on matrix and deleted the old branch and he just needs to update the matrix submodule again. |
Yes, it's fixed already. |
I will be logging my efforts to improve the UAVCAN driver, especially in the sense of CPU load, in this pull request. Please do not merge it yet.
Considerable time has been spent on my fruitless attempts to squeeze the driver into the FMUv2 build by switching some optimization flags and removing some not-so-critical components. Unfortunately, during my experiments the system tended to either overflow ROM, fail in a confusing way during boot, or just refuse to perform basic functions such as connecting to QGC. At some point I gave up (with much regret) and replaced FMUv2 with FMUv4.
According to the reports from @LorenzMeier, a system consisting of Zubax GNSS + a Sapog ESC + FMUv1 creates a 12% CPU load in the UAVCAN driver. I am observing a slightly different number of 7.6% with my FMUv4 setup, perhaps the difference can be explained by a higher clock rate of the newer board.
Looking at the CAN bus traffic I noticed that the ESC command messages carry six values, whereas my setup requires only four. It turned out that the default CAN quadrotor mixer is configured to generate 6 outputs rather than 4, the extra 2 serving some obscure purpose I couldn't understand. In order to validate this, I defined a new mixer and made the CAN quadrotor vehicle config refer to it. Expectedly, this change reduced the number of values per command message from 6 to 4, which allowed the UAVCAN stack to send only one frame per message. This change has dropped the CPU usage from 7.6% to 5.7%.
@LorenzMeier @AndreasAntener I strongly advice to review your VTOL mixer files and make sure that no unnecessary outputs are being generated, they slow things down.
I also added a check in the UAVCAN ESC output bridge that removes the trailing zero setpoint values from the output. This should help to somewhat improve performance even when buggy mixing is afoot.
My next step is to let the system run for several hours under a sampling profiler in order to find bottlenecks. Has this ever been done before?