-
Notifications
You must be signed in to change notification settings - Fork 13.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hard-fault in NuttX? #11703
Comments
Here are the steps I did so far:
Not sure yet what to make of this. |
Oh, there is more below, thanks to @thomasgubler:
|
the Program counter is pc:0x000000f0 - out in the weeds (this is maybe a clue*) Then there is the stack and stack pointer and it looks like you have the right idea on how to debug it.
I would argue that the poll struct has been corrupted. The last time I saw something this is was the offset being off from C to C++ because of the anonymous union. But that was a hard error (happened every boot). |
We tried to reproduce in SITL with valgrind, but nothing shows up. |
In this flight why was |
@nicovanduijn was there any other testing with this build? Any other testing of this exact revision on other hardware? If the vehicle survived could you try to reproduce the failure? |
@dagar the |
@nicovanduijn out of curiosity could you share a good complete log from the same vehicle flying, same binary, same config? |
@LorenzMeier you asked to check if the poll structure was initialized correctly and I've checked that. It looks correct. |
I would say we're here: but I don't know what that would mean. |
@davids5 Could you please comment on this? Thanks! |
@LorenzMeier I believe this was a memory overwrite issue. See comments above. |
@davids5 if it was a memory overwrite issue, do we have a clue which module would have caused it? |
There are 2 classes of things to look at. Wild pointer bug or stack crash. For the latter rebuild the code to print the stack allocation or use the debugger and nuttx macros to dump the tcb's and see where memory lies. The former requires duplicating the bug. Putting a HW breakpoint on the suspected target of the wild write. |
@nicovanduijn - https://github.com/PX4/Firmware/tree/master_2ebb9d_v4_stackcheck will load (not boot complete, nor get to a prompt) The value is that when tested with the HW and same params, we can see if something has a greedy stack as configured. Please load it on the platform. Then report back what the console shows. If it is a hardfault we can debug it from there. If not we can try to pare down the the config to be able to do more real world test. |
Ok, using the stackcheck build with that branch repeatedly boot-loops with the following printed on the terminal. (I'm not sure exactly where it started, this stuff scrolls by pretty fast)
|
Can you increase the pmw stack by ~200 bytes to make sure we're not seeing this as the result of stackcheck needing more resources? We still of course need to dig into this particular log to see if we find an offender. |
I believe this branch already has that courtesy of David, 540f890 |
No it doesn't. That increases stack for one module, you need to increase it for the pmw driver. |
@jkflying - This is a smoking gun: it is the PMW3901 driver that needs the stack size boosted |
@jkflying It can be an iterative and tedious process to debug stack check build. There is a margin, that is overly conservative. This is based on stacking and having a separate interrupt stack. But it nevertheless is highly diagnostic. |
Ok, after the stack bump from David and also setting the stack size on the cm8j driver (which wasn't set), I get full boots. |
this is probably a missing part ported from the sf0x driver (have a look at its CMakeLists.txt ): I indeed used that driver as reference for setting up the cm8jl65 one. |
@jkflying Could we please wrap up the testing on Monday, get any stack changes upstream and close this? Thanks! |
@LorenzMeier - current status: @jkflying and I did a debug session and evaluated the task memory layout. Nothing firstoreder obvious stood out but it could have been the pmw3901 clobbering the mpu9250. He stated that when the system was left running over the week end it was not uncommon to have hard faults. Using a build with the stack sizes increased we left the system running with GDB and a break set on up_hardfault. Tuesday we will review. @dagar recommended we review all the SD cards hard_fault logs as well. I have asked that @jkflying email them to me on his return Tuesday. |
@cmic0 Just an FYI there is a default of 1024. So it is not wrong per se, but my not be adequate @dagar we should look at the what the minimum should now be with the added call layering in Nuttx of xxx() -> nx_xxx() |
With the increased stack sizes I didn't experience a hardfault over the 3 days I left it running, with otherwise exactly the same commit that gave us the hardfault before. Before (from looking at timestamps of the log files) it was faulting once every 8 - 12 hours, so I'm pretty confident this is solved now, even though we can't really prove a negative this way. Here are the earlier faults, it's possible they were on the same build but no guarantees (that's why we didn't attach them to the earlier report when we found them): |
I didn't dig in too deeply, but the hardfault location isn't consistent. This reasonably aligns with the current theory/solution.
|
I loaded the build at 2ebb9d2 and had a poke around. I was originally assuming the poll struct got clobbered. But from what I see it looks like the restore context is what got clobbered. and still does reasonably align with the current theory/solution. The Acid test would be to back out the patch set and run it on that same HW and prove it will happen without the fix and not with it. |
Describe the bug
We flew in position control testing some of the collision prevention capabilities (together with the avoidance on the companion), when suddenly the drone fell out of the sky.
Log Files and Screenshots
Firmware: 2ebb9d
Logfile
.elf file
Drone :
Quad X racer, with a pixracer
My hard-fault debugging skills are severely limited, best I could tell is that it seemed to come from
fs_poll.c
, but I wouldn't trust myself about that. Maybe @bkueng or @julianoes are more qualified to debug this?The text was updated successfully, but these errors were encountered: