Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hard-fault in NuttX? #11703

Closed
nicovanduijn opened this issue Mar 22, 2019 · 32 comments
Closed

Hard-fault in NuttX? #11703

nicovanduijn opened this issue Mar 22, 2019 · 32 comments
Assignees

Comments

@nicovanduijn
Copy link
Contributor

nicovanduijn commented Mar 22, 2019

Describe the bug
We flew in position control testing some of the collision prevention capabilities (together with the avoidance on the companion), when suddenly the drone fell out of the sky.

Log Files and Screenshots
Firmware: 2ebb9d
Logfile
.elf file

Drone :
Quad X racer, with a pixracer

My hard-fault debugging skills are severely limited, best I could tell is that it seemed to come from fs_poll.c, but I wouldn't trust myself about that. Maybe @bkueng or @julianoes are more qualified to debug this?

@julianoes
Copy link
Contributor

Here are the steps I did so far:

  1. Open the elf file in gdb:
arm-none-eabi-gdb Downloads/px4_fmu-v4_default.elf
  1. Look for addresses starting with 0x08:
(gdb) info line *0x08004f57
Line 214 of "armv7-m/gnu/up_exception.S" starts at address 0x8004f56 <exception_common+46> and ends at 0x8004f58 <exception_common+48>.

(gdb) info line *0x080057f5
Line 94 of "armv7-m/up_doirq.c" starts at address 0x80057f4 <up_doirq+24> and ends at 0x80057f6 <up_doirq+26>.

(gdb) info line *0x080072c7
Line 224 of "irq/irq_dispatch.c" starts at address 0x80072c6 <irq_dispatch+22> and ends at 0x80072ce <irq_dispatch+30>.

(gdb) info line *0x08005775  
Line 148 of "armv7-m/up_hardfault.c" starts at address 0x800576c <up_hardfault+128> and ends at 0x80057a4 <up_svcall>.

(gdb) info line *0x0800760f
Line 204 of "semaphore/sem_wait.c" starts at address 0x800760e <nxsem_wait+82> and ends at 0x8007614 <nxsem_wait+88>.

(gdb) info line *0x08005a3b
Line 388 of "armv7-m/up_assert.c" starts at address 0x8005a3a <up_assert+442> and ends at 0x8005a3e <up_assert+446>.

(gdb) info line *0x0811edbf
Line 257 of "../../src/drivers/boards/common/board_crashdump.c" starts at address 0x811edbe <board_crashdump+138> and ends at 0x811edc4 <board_crashdump+144>.

(gdb) info line *0x08005be9
Line 155 of "machine/arm/armv7-m/gnu/arch_memcpy.S" starts at address 0x8005be8 <memcpy+8> and ends at 0x8005bea <memcpy+10>.

(gdb) info line *0x08005e79
Line 112 of "syslog/lib_syslog.c" starts at address 0x8005e78 <syslog+16> and ends at 0x8005e82 <cfgetspeed>.

(gdb) info line *0x08005e5d
Line 84 of "syslog/lib_syslog.c" starts at address 0x8005e5c <vsyslog+24> and ends at 0x8005e68 <syslog>.

(gdb) info line *0x0800ff49
Line 71 of "stdio/lib_libnoflush.c" starts at address 0x800ff48 <lib_noflush> and ends at 0x800ff4c <lib_mode2oflags>.

(gdb) info line *0x0800ff49
Line 71 of "stdio/lib_libnoflush.c" starts at address 0x800ff48 <lib_noflush> and ends at 0x800ff4c <lib_mode2oflags>.

(gdb) info line *0x08006b2d
Line 61 of "syslog/syslog_emergstream.c" starts at address 0x8006b2c <emergstream_putc> and ends at 0x8006b32 <emergstream_putc+6>.

(gdb) info line *0x08006ae7
Line 160 of "syslog/vsyslog.c" starts at address 0x8006ae6 <nx_vsyslog+26> and ends at 0x8006aea <nx_vsyslog+30>.

(gdb) info line *0x08005851
Line 132 of "armv7-m/up_assert.c" starts at address 0x8005850 <up_taskdump> and ends at 0x8005856 <up_taskdump+6>.

(gdb) info line *0x0800ff49
Line 71 of "stdio/lib_libnoflush.c" starts at address 0x800ff48 <lib_noflush> and ends at 0x800ff4c <lib_mode2oflags>.

(gdb) info line *0x0800ff29
Line 53 of "stdio/lib_nulloutstream.c" starts at address 0x800ff28 <nulloutstream_putc> and ends at 0x800ff2e <nulloutstream_putc+6>.

Line 248 of "/home/nico/src/Firmware/build/px4_fmu-v4_default/NuttX/nuttx/include/arch/armv7-m/irq.h" starts at address 0x8004668 <up_restoreusartint+16>
   and ends at 0x800466e <up_restoreusartint+22>.

(gdb) info line *0x0800fbfb
Line 1208 of "stdio/lib_libvsprintf.c" starts at address 0x800fbfa <lib_vsprintf+134> and ends at 0x800fc00 <lib_vsprintf+140>.

(gdb) info line *0x08006b2d
Line 61 of "syslog/syslog_emergstream.c" starts at address 0x8006b2c <emergstream_putc> and ends at 0x8006b32 <emergstream_putc+6>.

(gdb) info line *0x08006b39
Line 75 of "syslog/syslog_emergstream.c" starts at address 0x8006b38 <emergstream_putc+12> and ends at 0x8006b3c <emergstream_putc+16>.

(gdb) info line *0x08004dc9
Line 2927 of "chip/stm32_serial.c" starts at address 0x8004dc8 <up_putc+40> and ends at 0x8004dd4 <stm32_pwr_enablebkp>.

(gdb) info line *0x08004669
Line 248 of "/home/nico/src/Firmware/build/px4_fmu-v4_default/NuttX/nuttx/include/arch/armv7-m/irq.h" starts at address 0x8004668 <up_restoreusartint+16>
   and ends at 0x800466e <up_restoreusartint+22>.

(gdb) info line *0x08006b39
Line 75 of "syslog/syslog_emergstream.c" starts at address 0x8006b38 <emergstream_putc+12> and ends at 0x8006b3c <emergstream_putc+16>.

(gdb) info line *0x08004dc9
Line 2927 of "chip/stm32_serial.c" starts at address 0x8004dc8 <up_putc+40> and ends at 0x8004dd4 <stm32_pwr_enablebkp>.

(gdb) info line *0x08004669
Line 248 of "/home/nico/src/Firmware/build/px4_fmu-v4_default/NuttX/nuttx/include/arch/armv7-m/irq.h" starts at address 0x8004668 <up_restoreusartint+16>
   and ends at 0x800466e <up_restoreusartint+22>.

(gdb) info line *0x08134b01
Line 289 of "../../src/lib/perf/perf_counter.c" starts at address 0x8134afc <perf_end+28> and ends at 0x8134b0c <perf_end+44>.

(gdb) info line *0x080416d3
Line 173 of "../../src/drivers/imu/mpu9250/mpu9250_spi.cpp" starts at address 0x80416d2 <MPU9250_SPI::read(unsigned int, void*, unsigned int)+76>
   and ends at 0x80416d8 <MPU9250_SPI::read(unsigned int, void*, unsigned int)+82>.

(gdb) info line *0x08005be9
Line 155 of "machine/arm/armv7-m/gnu/arch_memcpy.S" starts at address 0x8005be8 <memcpy+8> and ends at 0x8005bea <memcpy+10>.

(gdb) info line *0x08120a75
Line 189 of "../../src/lib/drivers/device/nuttx/SPI.cpp" starts at address 0x8120a74 <device::SPI::_transfer(unsigned char*, unsigned char*, unsigned int)+88>
   and ends at 0x8120a7a <device::SPI::transfer(unsigned char*, unsigned char*, unsigned int)>.

(gdb) info line *0x0811d6d9
Line 127 of "../../boards/px4/fmu-v4/src/spi.c" starts at address 0x811d6d8 <stm32_spi1select> and ends at 0x811d6e4 <stm32_spi1select+12>.

(gdb) info line *0x0811cdc5
Line 1615 of "chip/stm32_spi.c" starts at address 0x811cdc4 <spi_exchange+36> and ends at 0x811cdc8 <spi_exchange+40>.

Not sure yet what to make of this.

@julianoes julianoes added the bug label Mar 22, 2019
@julianoes
Copy link
Contributor

Oh, there is more below, thanks to @thomasgubler:

(gdb) info line *0x0817d8a4
No line number information available for address 0x817d8a4 <__orb_actuator_controls_0>

(gdb) info line *0x08153e78
No line number information available for address 0x8153e78 <_ZTV26MulticopterAttitudeControl+40>

(gdb) info line *0x08153e58
No line number information available for address 0x8153e58 <_ZTV26MulticopterAttitudeControl+8>

(gdb) info line *0x080aebe9
Line 168 of "../../src/platforms/px4_module.h" starts at address 0x80aebe8 <ModuleBase<MulticopterAttitudeControl>::run_trampoline(int, char**)>
   and ends at 0x80aebea <ModuleBase<MulticopterAttitudeControl>::run_trampoline(int, char**)+2>.

(gdb) info line *0x080aec07
Line 170 of "../../src/platforms/px4_module.h" starts at address 0x80aec06 <ModuleBase<MulticopterAttitudeControl>::run_trampoline(int, char**)+30>
   and ends at 0x80aec08 <ModuleBase<MulticopterAttitudeControl>::run_trampoline(int, char**)+32>.

(gdb) info line *0x08125b3b  
Line 508 of "vfs/fs_poll.c" starts at address 0x8125b3a <poll+66> and ends at 0x8125b3e <poll+70>

(gdb) info line *0x08126343
Line 158 of "semaphore/sem_tickwait.c" starts at address 0x8126342 <nxsem_tickwait+86> and ends at 0x8126346 <nxsem_tickwait+90>.

@thomasgubler
Copy link
Contributor

@dagar @davids5 FYI

@davids5
Copy link
Member

davids5 commented Mar 23, 2019

the Program counter is pc:0x000000f0 - out in the weeds (this is maybe a clue*)
The Link register (return address) is lr:0x0800760f - this MAY be valid - what is there in the code?

Then there is the stack and stack pointer and it looks like you have the right idea on how to debug it.

User sp memory region, stack pointer lies within stack
0x20019c09 0x00000000
0x20019c08 0x07178e98
0x20019c07 0x00000101
0x20019c06 0x01010000
0x20019c05 0x00010100
0x20019c04 0x00010001
0x20019c03 0x00000000
0x20019c02 0x070c13b7
0x20019c01 0x00000001
0x20019c00 0x03010001
0x20019bff 0x03000000
0x20019bfe 0x00000000
0x20019bfd 0x00000000
0x20019bfc 0x00000000
0x20019bfb 0x00000000
0x20019bfa 0x00000000
0x20019bf9 0x00000000
0x20019bf8 0x00000000
0x20019bf7 0x00000000
0x20019bf6 0x3f0245ae
0x20019bf5 0x00000000
0x20019bf4 0x3f274e9d
0x20019bf3 0x00000000
0x20019bf2 0x07174070
0x20019bf1 0xbe200f8f
0x20019bf0 0x00000000
0x20019bef 0x00000000
0x20019bee 0xbba98722
0x20019bed 0x3e56806d
0x20019bec 0xbee7e3fc
0x20019beb 0x00000000
0x20019bea 0x07178f48
0x20019be9 0x00000000
0x20019be8 0x00000000
0x20019be7 0x00000001
0x20019be6 0xbe200f8f
0x20019be5 0x00000000
0x20019be4 0x00000000
0x20019be3 0x3f723b93
0x20019be2 0x3b56f43e
0x20019be1 0x3b97814e
0x20019be0 0xbea59dd7
0x20019bdf 0x00000000
0x20019bde 0xc01ee65f
0x20019bdd 0xbc322062
0x20019bdc 0x3b52c598
0x20019bdb 0x00000000
0x20019bda 0x07177beb
0x20019bd9 0x20016204
0x20019bd8 0xba1b81be
0x20019bd7 0xb70f750b
0x20019bd6 0xb7730e72
0x20019bd5 0x3f7ffffe
0x20019bd4 0xbf71ede6
0x20019bd3 0xbd29cae4
0x20019bd2 0xbc0c9525
0x20019bd1 0x3ea5ff56
0x20019bd0 0xbd1feb96
0x20019bcf 0x3e84e086
0x20019bce 0xbcbadb26
0x20019bcd 0x00000000
0x20019bcc 0x07177e37
0x20019bcb 0x3f800000
0x20019bca 0x0817d8a4
0x20019bc9 0x20013980
0x20019bc8 0x1000fc40
0x20019bc7 0x200133e0
0x20019bc6 0x1000e480
0x20019bc5 0x20013750
0x20019bc4 0x00000000
0x20019bc3 0x00000002
0x20019bc2 0x00000011
0x20019bc1 0x00000010
0x20019bc0 0x0000000f
0x20019bbf 0x0000000e
0x20019bbe 0xffffffff
0x20019bbd 0x0000000d
0x20019bbc 0x0000000c
0x20019bbb 0x0000000b
0x20019bba 0x0000000a
0x20019bb9 0x00000009
0x20019bb8 0x00000008
0x20019bb7 0x00000007
0x20019bb6 0x00000006
0x20019bb5 0x00000005
0x20019bb4 0x00000004
0x20019bb3 0x00000003
0x20019bb2 0x00000000
0x20019bb1 0x00000000
0x20019bb0 0x08153e78
0x20019baf 0x20003400
0x20019bae 0x08153e58
0x20019bad 0x800006b0
0x20019bac 0x00000560
0x20019bab 0x00000000
0x20019baa 0xdeadbeef
0x20019ba9 0xdeadbe00
0x20019ba8 0x74726174
0x20019ba7 0x73006c6f
0x20019ba6 0x72746e6f
0x20019ba5 0x635f7474
0x20019ba4 0x615f636d
0x20019ba3 0x00000000
0x20019ba2 0x20019c97
0x20019ba1 0x20019c88
0x20019ba0 0xdeadbeef
0x20019b9f 0x00000000
0x20019b9e 0x00000000
0x20019b9d 0x08008169
0x20019b9c 0x00000000
0x20019b9b 0x200134d0
0x20019b9a 0x080aebe9
0x20019b99 0x080aec07
0x20019b98 0x00000000
0x20019b97 0x00000000
0x20019b96 0x00000000
0x20019b95 0x00000000
0x20019b94 0x00000000
0x20019b93 0x00000000
0x20019b92 0x00000000
0x20019b91 0x20005168
0x20019b90 0x00000000
0x20019b8f 0x00000000
0x20019b8e 0x00000000
0x20019b8d 0x00000000
0x20019b8c 0x00000000
0x20019b8b 0x00000000
0x20019b8a 0x00000000
0x20019b89 0x00000000
0x20019b88 0x00000000
0x20019b87 0x20019174
0x20019b86 0x00000101
0x20019b85 0x20019b94
0x20019b84 0x00000000
0x20019b83 0x0000000c
0x20019b82 0x3f800000
0x20019b81 0x00000000
0x20019b80 0x00000000
0x20019b7f 0x00000000
0x20019b7e 0x3f800000
0x20019b7d 0x00000000
0x20019b7c 0x00000000
0x20019b7b 0x00000000
0x20019b7a 0x3f800000
0x20019b79 0x00000000
0x20019b78 0x004c4753
0x20019b77 0x080aea6b
0x20019b76 0x00000000
0x20019b75 0x080af5c7
0x20019b74 0x20019ce8
0x20019b73 0x000000c3
0x20019b72 0x00000000
0x20019b71 0x07178e7d
0x20019b70 0x00000000
0x20019b6f 0x07178e7d
0x20019b6e 0x00000001
0x20019b6d 0x20019cb0
0x20019b6c 0x0000000e
0x20019b6b 0x08010000
0x20019b6a 0x00000000
0x20019b69 0x00000000
0x20019b68 0x00000000
0x20019b67 0x20010000
0x20019b66 0x0000260b
0x20019b65 0x00000001
0x20019b64 0x07178e7d
0x20019b63 0x08125b3b
0x20019b62 0x00000064
0x20019b61 0x20019c04
0x20019b60 0x20019b94
0x20019b5f 0x00000001
0x20019b5e 0x00000000
0x20019b5d 0x20019c04
0x20019b5c 0x00000064
0x20019b5b 0x0001d113
0x20019b5a 0x00000149
0x20019b59 0x08126343
0x20019b58 0x00000064<-- User sp

I would argue that the poll struct has been corrupted.

The last time I saw something this is was the offset being off from C to C++ because of the anonymous union. But that was a hard error (happened every boot).

@bkueng
Copy link
Member

bkueng commented Mar 25, 2019

We tried to reproduce in SITL with valgrind, but nothing shows up.

@dagar
Copy link
Member

dagar commented Mar 25, 2019

In this flight why was SDLOG_PROFILE set to 1? The mc_att_control module polls the gyro specified by sensor_selection, but it's not in the log (it is by default though).

@dagar
Copy link
Member

dagar commented Mar 25, 2019

@nicovanduijn was there any other testing with this build? Any other testing of this exact revision on other hardware? If the vehicle survived could you try to reproduce the failure?

@nicovanduijn
Copy link
Contributor Author

nicovanduijn commented Mar 26, 2019

@dagar the SDLOG_PROFILE we set that way because we are overwhelming our serial link with log streaming, which led to ~60% message drops.
We've flow this exact configuration successfully many times (I'd say around 10 missions) without any issues. So I guess it's more of a tail-end occurrence and won't be easy to reproduce

@dagar
Copy link
Member

dagar commented Mar 26, 2019

@nicovanduijn out of curiosity could you share a good complete log from the same vehicle flying, same binary, same config?

@julianoes
Copy link
Contributor

@LorenzMeier you asked to check if the poll structure was initialized correctly and I've checked that. It looks correct.

@nicovanduijn
Copy link
Contributor Author

@dagar Yes, same day we flew three times:
Flight 1
Flight 2
Flight 3

@julianoes
Copy link
Contributor

@davids5

The Link register (return address) is lr:0x0800760f - this MAY be valid - what is there in the code?

I would say we're here:
https://github.com/PX4/NuttX/blob/px4_firmware_nuttx-7.28%2B/fs/vfs/fs_poll.c#L505

but I don't know what that would mean.

@LorenzMeier
Copy link
Member

@davids5 Could you please comment on this? Thanks!

@davids5
Copy link
Member

davids5 commented Apr 1, 2019

@LorenzMeier I believe this was a memory overwrite issue. See comments above.

@julianoes
Copy link
Contributor

julianoes commented Apr 2, 2019

@davids5 if it was a memory overwrite issue, do we have a clue which module would have caused it?

@davids5
Copy link
Member

davids5 commented Apr 2, 2019

There are 2 classes of things to look at. Wild pointer bug or stack crash. For the latter rebuild the code to print the stack allocation or use the debugger and nuttx macros to dump the tcb's and see where memory lies. The former requires duplicating the bug. Putting a HW breakpoint on the suspected target of the wild write.

@davids5 davids5 self-assigned this Apr 3, 2019
@davids5
Copy link
Member

davids5 commented Apr 4, 2019

@nicovanduijn - https://github.com/PX4/Firmware/tree/master_2ebb9d_v4_stackcheck will load (not boot complete, nor get to a prompt) The value is that when tested with the HW and same params, we can see if something has a greedy stack as configured. Please load it on the platform. Then report back what the console shows. If it is a hardfault we can debug it from there. If not we can try to pare down the the config to be able to do more real world test.

@jkflying
Copy link
Contributor

jkflying commented Apr 5, 2019

Ok, using the stackcheck build with that branch repeatedly boot-loops with the following printed on the terminal.

.elf file

(I'm not sure exactly where it started, this stuff scrolls by pretty fast)

WARN  [rgbled] no RGB led on bus #1
WARN  [rgbled_ncp5623c] no RGB led on bus #1
nsh: rgbled_pwm: command not found
WARN  [blinkm] I2C init failed
WARN  [blinkm] init failed
ERROR [param] Parameter SYS_USE_IO not found
Board sensors: /etc/init.d/rc.board_sensors
reset done, 50 ms
HMC5883_I2C on I2C bus 1 at 0x1e (bus: 100 KHz, max: 400 KHz)
WARN  [lis3mdl] no device on bus 2
INFO  [ist8310] no device on bus 1
BMP280_I2C on I2C bus 1 at 0x76 (bus: 100 KHz, max: 100 KHz)
WARN  [bmp280] id of your baro is not: 0x58
WARN  [bmp280] bus option number is 2
ERROR [bmp280] driver start failed
ERROR [bmm150] Internal I2C not available
HMC5883_SPI on SPI bus 1 at 5 (11000 KHz)
WARN  [hmc5883] no device on bus 1 (type: 3)
WARN  [lis3mdl] no device on bus 2
LIS3MDL_SPI on SPI bus 1 at 7 (11000 KHz)
INFO  [lis3mdl] Poll rate set to max (80hz)
MPU6000 on SPI bus 1 at 11 (1000 KHz)
INFO  [mpu9250] Bus probed: 3
MPU9250 on SPI bus 1 at 4 (1000 KHz)
INFO  [mpu9250] accel cutoff set to 30.00 Hz
INFO  [mpu9250] gyro cutoff set to 100.00 Hz
PMW3901 up_assert: Assertion failed at file:irq/irq_unexpectedisr.c line: 65 task: pmw3901
up_registerdump: R0: 08009115 00000004 e000ed04 80000000 200033e8 08009199 10005130 08008edd
up_registerdump: R8: 08007ed3 10009ba4 100099e0 1000b034 fffffff8 10009aa8 08009125 08005620
up_registerdump: xPSR: 81000000 BASEPRI: 00000080 CONTROL: 00000000
up_registerdump: EXC_RETURN: ffffffe9
up_dumpstate: sp:     20003278
up_dumpstate: IRQ stack:
up_dumpstate:   base: 200032f0
up_dumpstate:   size: 000002e8
up_dumpstate:   used: 00000254
up_stackdump: 20003260: 20003278 10009e28 0000048c 10005130 20003278 10009e28 0000048c 10005130
up_stackdump: 20003280: 200032f0 080066b5 1000528c 0818ab90 00000041 00000002 100099d4 00000000
up_stackdump: 200032a0: 1000b034 0800681d 00000041 1000528c 00000002 080089e5 080089c9 08008995
up_stackdump: 200032c0: 200032f0 08006411 0800567b 0800643d 00000080 100099d4 08009199 10005130
up_stackdump: 200032e0: 08008edd 08007ed3 10009ba4 0800567b 100099d4 00000001 00000000 00000000
up_dumpstate: sp:     10009aa8
up_dumpstate: User stack:
up_dumpstate:   base: 10009e28
up_dumpstate:   size: 0000048c
up_dumpstate:   used: 00000458
up_stackdump: 10009aa0: 00000010 deadbeef 10005130 080091d3 200033e8 08009199 08009207 10005130
up_stackdump: 10009ac0: 200033e8 080091d3 080091ed 08008e77 200033e8 08009207 00000000 10005130
up_stackdump: 10009ae0: 000000f0 08008e77 200033e8 08008449 0000007b 080082ed 08007ed3 08008edd
up_stackdump: 10009b00: ffffffff 200033e8 08008329 08008449 0000003e 00000030 200033e8 08008329
up_stackdump: 10009b20: 1000b035 080082bd 00000027 200033e8 10004ed0 080082ed 080082a5 0800808b
up_stackdump: 10009b40: 00000027 080082bd 00000000 08008121 1000b025 0800808b 00000000 1000b025
up_stackdump: 10009b60: 1000ac90 08008121 000000f0 1000b025 00000000 1000ac90 10004ed0 00000000
up_stackdump: 10009b80: 1000b020 0816f0db 1000b040 081d0998 1000b020 1000b040 081db9b0 08007ed3
up_stackdump: 10009ba0: 1000b020 00000000 000001b6 00000000 1000b040 080299b7 00000000 0816f0db
up_stackdump: 10009bc0: 00000000 00000000 08028cff 080299b7 0000004b 00000001 10004e20 10004e30
up_stackdump: 10009be0: 00000000 0000004b 00000001 0000004b 6a626f2f 6275732f 74737973 695f6d65
up_stackdump: 10009c00: 306f666e 08028d00 10009c68 08019717 00000000 10009c6d 00000000 1000ac90
up_stackdump: 10009c20: fffffffe 08019833 10009c68 08008671 100045b0 08028d9f 081db9b0 00000000
up_stackdump: 10009c40: 0000004b 100045b0 00000000 08028cff ffffffff 00000000 081db9b0 08028e05
up_stackdump: 10009c60: 00000001 08028d9f 6a626f2f 6275732f 74737973 695f6d65 306f666e 00000000
up_stackdump: 10009c80: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
up_stackdump: 10009ca0: 00000000 00000000 00000000 00000000 081db9b0 0816affd 100045b0 00000001
up_stackdump: 10009cc0: 10009d40 08028e05 0000004b 0800adad 0806908d 08008d7f 00000000 0816afd1
up_stackdump: 10009ce0: 0816b035 100045b0 081db9b0 10009d40 00000001 0816affd 0000004b 00000001
up_stackdump: 10009d00: 10009d40 0816b015 08069039 081db9b0 10009d40 10008dd4 000f4243 0816b035
up_stackdump: 10009d20: 00000000 10009e60 00000001 0806953f 00002000 08069039 00000001 08009031
up_stackdump: 10009d40: 0030fd03 00000000 00000040 00000000 00010101 00000000 10009e60 0816f54f
up_stackdump: 10009d60: 000f4243 0806953f 0816f529 080194f3 10008dd4 08069579 00002000 0816f54f
up_stackdump: 10009d80: 0816f529 10008dd4 0801954b 00002000 000f4243 00000002 081999f5 080194f3
up_stackdump: 10009da0: 08023ead 00000000 080692cd 00002000 000f4243 0801954b 00000003 10008dd4
up_stackdump: 10009dc0: 000f4243 20004440 00000000 00000000 00000004 080692cd 00000000 10009e40
up_stackdump: 10009de0: 0800a0c1 00000004 10009e2c 08069407 10009dfc 00000002 00000001 10009e40
up_stackdump: 10009e00: 10008d30 10005130 00000000 00000000 00000000 00000000 00000000 0800a0c1
up_stackdump: 10009e20: 00000000 00000000 deadbeef 10009e38 10009e40 00000000 33776d70 00313039
up_taskdump: Idle Task: PID=0 Stack Used=0 of 0
up_taskdump: hpwork: PID=1 Stack Used=920 of 1780
up_taskdump: lpwork: PID=2 Stack Used=872 of 1780
up_taskdump: init: PID=3 Stack Used=2176 of 2604
up_taskdump: dataman: PID=17 Stack Used=840 of 1180
up_taskdump: pmw3901: PID=123 Stack Used=1112 of 1164
[boot] Fault Log info File No 4 Length 3177 flags:0x01 state:0
[boot] Fault Logged on 2019-04-05-08:47:06 - Valid
[boot] There is a hard fault logged. Hold down the SPACE BAR, while booting to review!
[boot] 8266_GPIO2 - Low Initialize SPI port 4
sercon: Registering CDC/ACM serial driver
sercon: Successfully registered the CDC/ACM serial driver
HW arch: PX4_FMU_V4
FW git-hash: 540f890faa11bfc004860e59fec57b7e4a037ee4
FW version: 1.9.0 80 (17367168)
FW git-branch: master_2ebb9d_v4_stackcheck
OS: NuttX
OS version: Release 7.28.0 (119275775)
OS git-hash: 1f2f6d8cbe7c85f30fd75ec53c20566e91b14f21
Build datetime: Apr  5 2019 10:45:28
Build uri: localhost
Toolchain: GNU GCC, 7.2.1 20170904 (release) [ARM/embedded-7-branch revision 255204]
PX4GUID: 0001000000003737343033355117001c0034
MCU: STM32F42x, rev. 3
[hardfault_log] Fault Log info File No 4 Length 3177 flags:0x01 state:0
[hardfault_log] Fault Logged on 2019-04-05-08:47:06 - Valid
INFO  [tune_control] Publishing standard tune 2
[hardfault_log] Saving Fault Log file /fs/microsd/fault_2019_04_05_08_47_06.log
[hardfault_log] Fault Log is Armed
[hardfault_log] Done saving Fault Log file
[hardfault_log] No ULog to append to
INFO  [param] selected parameter default file /fs/mtd_params
Board defaults: /etc/init.d/rc.board_defaults
INFO  [dataman] Unknown restart, data manager file '/fs/microsd/dataman' size is 362560 bytes
WARN  [rgbled] no RGB led on bus #1
WARN  [rgbled_ncp5623c] no RGB led on bus #1
nsh: rgbled_pwm: command not found
WARN  [blinkm] I2C init failed
WARN  [blinkm] init failed
ERROR [param] Parameter SYS_USE_IO not found
Board sensors: /etc/init.d/rc.board_sensors
reset done, 50 ms
HMC5883_I2C on I2C bus 1 at 0x1e (bus: 100 KHz, max: 400 KHz)

@LorenzMeier
Copy link
Member

Can you increase the pmw stack by ~200 bytes to make sure we're not seeing this as the result of stackcheck needing more resources? We still of course need to dig into this particular log to see if we find an offender.

@jkflying
Copy link
Contributor

jkflying commented Apr 5, 2019

I believe this branch already has that courtesy of David, 540f890

@LorenzMeier
Copy link
Member

No it doesn't. That increases stack for one module, you need to increase it for the pmw driver.

@davids5
Copy link
Member

davids5 commented Apr 5, 2019

@jkflying - This is a smoking gun: it is the PMW3901 driver that needs the stack size boosted

@davids5
Copy link
Member

davids5 commented Apr 5, 2019

@jkflying see f27c56e
Please retest.

@davids5
Copy link
Member

davids5 commented Apr 5, 2019

@jkflying It can be an iterative and tedious process to debug stack check build. There is a margin, that is overly conservative. This is based on stacking and having a separate interrupt stack. But it nevertheless is highly diagnostic.

@jkflying
Copy link
Contributor

jkflying commented Apr 5, 2019

Ok, after the stack bump from David and also setting the stack size on the cm8j driver (which wasn't set), I get full boots.

@cmic0
Copy link
Contributor

cmic0 commented Apr 5, 2019

Ok, after the stack bump from David and also setting the stack size on the cm8j driver (which wasn't set), I get full boots.

this is probably a missing part ported from the sf0x driver (have a look at its CMakeLists.txt ): I indeed used that driver as reference for setting up the cm8jl65 one.
Maybe should be added even there? (Although not properly related to this hard-fault )

@LorenzMeier
Copy link
Member

@jkflying Could we please wrap up the testing on Monday, get any stack changes upstream and close this? Thanks!

@davids5
Copy link
Member

davids5 commented Apr 6, 2019

@LorenzMeier - current status:

@jkflying and I did a debug session and evaluated the task memory layout. Nothing firstoreder obvious stood out but it could have been the pmw3901 clobbering the mpu9250. He stated that when the system was left running over the week end it was not uncommon to have hard faults. Using a build with the stack sizes increased we left the system running with GDB and a break set on up_hardfault. Tuesday we will review. @dagar recommended we review all the SD cards hard_fault logs as well. I have asked that @jkflying email them to me on his return Tuesday.

@davids5
Copy link
Member

davids5 commented Apr 6, 2019

@cmic0 Just an FYI there is a default of 1024. So it is not wrong per se, but my not be adequate @dagar we should look at the what the minimum should now be with the added call layering in Nuttx of xxx() -> nx_xxx()

@jkflying
Copy link
Contributor

jkflying commented Apr 9, 2019

With the increased stack sizes I didn't experience a hardfault over the 3 days I left it running, with otherwise exactly the same commit that gave us the hardfault before. Before (from looking at timestamps of the log files) it was faulting once every 8 - 12 hours, so I'm pretty confident this is solved now, even though we can't really prove a negative this way.

Here are the earlier faults, it's possible they were on the same build but no guarantees (that's why we didn't attach them to the earlier report when we found them):
hard-faults.zip

@dagar
Copy link
Member

dagar commented Apr 9, 2019

Here are the earlier faults, it's possible they were on the same build but no guarantees (that's why we didn't attach them to the earlier report when we found them):

I didn't dig in too deeply, but the hardfault location isn't consistent. This reasonably aligns with the current theory/solution.

  • navigator
  • ekf2
  • fmu
  • mc_att_control
  • mc_pos_control

@davids5
Copy link
Member

davids5 commented Apr 10, 2019

I loaded the build at 2ebb9d2 and had a poke around.

CrashCapture

I was originally assuming the poll struct got clobbered. But from what I see it looks like the restore context is what got clobbered. and still does reasonably align with the current theory/solution.

The Acid test would be to back out the patch set and run it on that same HW and prove it will happen without the fix and not with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants