Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add post mortem debugging #20492

Merged

Conversation

X-Ryl669
Copy link
Contributor

@X-Ryl669 X-Ryl669 commented Dec 17, 2020

Requirements

This is a generic feature for Marlin, in link with issue #20477.
Without such feature, development or simply bug fixing is a real PITA or completely impossible.

Description

This adds a generic mechanism to hook into CPU exceptions (like hard fault, memory fault, bus fault, usage fault, etc...) and when any happen, it captures the CPU state (stack, registers, callstack) and either break under a debugger if any is attached and/or dump those on the main serial output. Care is taken not to trigger another fault when it happens.
Hooking the vector table is done at runtime (not in flash) since many platform don't support this overloading.
Update: libmaple does not like this, so it's done at compile time now.
In the future, it could even be "installed" at runtime upon changing the debug level via M111.

Please notice that this feature partially existed in LPC1768 and DUE but with many code duplication (platform are very different). I've made them more generic and support for STM32F1 is added (my board is a STM32F1).

Work is still required for other CPU type, but the work will be minimal for for ARMv7 CPU (most of the generic code is in place, only the serial device configuration must be written).
I don't know if it's possible to write a driver for AVR, but it's probably possible for ESP32 (I once written a GDBstub for this CPU, it's not too hard).

For debugging with a ST-Link V2 probe, you can check my tutorial here

Benefits

You'll get this in the serial console in case of a bug:

start
PowerUp
Marlin bugfix-2.0.x

echo: Last Updated: 2020-12-16 | Author: X-Ryl669
echo:Compiled: Dec 17 2020
echo: Free Memory: 8119  PlannerBufferBytes: 1280
echo:SD card ok


## Software Fault detected ##
Cause: Hard
R0   : 0x00000006
R1   : 0x20002400
R2   : 0x0000007B
R3   : 0x20002549
R12  : 0x00000000
LR   : 0x0800EE9B
PC   : 0x0800F368
PSR  : 0x01000000
CFSR : 0x00000400
HFSR : 0x40000000
DFSR : 0x0000000B
AFSR : 0x00000000
MMAR : 0xE000ED34
BFAR : 0xE000ED38
ExcLR: 0xFFFFFFF9
ExcSP: 0x2000FF90
Backtrace:#1 : unknown@0x080073F8+32624 PC:0x0800F368

You'll get this in GDB if you are attached while a bug happens:

(gdb) c
Continuing.

Program received signal SIGTRAP, Trace/breakpoint trap.
0x0800eb3a in CommonHandler_C (sp=0x2000ff90, lr=4294967289, cause=1) at Marlin/src/HAL/shared/cpu_exception/exception_arm.cpp:78
78        HALT_IF_DEBUGGING();
(gdb) bt
#0  0x0800eb3a in CommonHandler_C (sp=0x2000ff90, lr=4294967289, cause=1) at Marlin/src/HAL/shared/cpu_exception/exception_arm.cpp:78
#1  <signal handler called>
#2  0x0800f368 in setup () at Marlin/src/MarlinCore.cpp:1345
#3  0x0800ac56 in main () at ~/.platformio/packages/framework-arduinoststm32-maple/STM32F1/cores/maple/main.cpp:38
(gdb) up 2
#2  0x0800f368 in setup () at Marlin/src/MarlinCore.cpp:1345
1345   *(int*)0 = 123;

For post mortem debugging, you can use addr2line to get the source line for each PC reported.

Configurations

I've added a new option in Configuration_adv.h that's called POST_MORTEM_DEBUGGING. Feel free to enable it to have the function enabled in your next build!

Related Issues

Upon complete implementation for all CPU, it'll fix #20477.
Right now, it works for STM32F1 (the CPU I have), and should work for DUE and LPC1768 CPU.

@thinkyhead thinkyhead added PR: New Feature T: Development Makefiles, PlatformIO, Python scripts, etc. labels Dec 18, 2020
@X-Ryl669
Copy link
Contributor Author

X-Ryl669 commented Dec 18, 2020

@sjasonsmith Jason, if you have a STM32 board with a USB serial port, can you try enabling it and skipping the MinSerial specialization ?
That is, don't call install_min_serial() function in HAL_init. This should use the default serial implementation (falling back to SERIAL_CHAR macro) which in turns should use the USB's serial port.
In order to trigger an error, add *(int*)0 = 123; in MarlinCore.cpp's setup() function after hook installation.
Then start your machine with the serial port attached with a console/terminal that's able to auto-resume on the serial port disconnection/connection and reboot your machine (this is required as it'll happen very fast and you might miss it if connected too late). Under linux, minicom is known to work, on Windows, use TeraTerm.

After the usual bootup procedure, you should get a dump on the serial port.

If what I read from the STM32F103 technical reference manual is true (and the libmaple implementation seems to confirm it's non blocking), this might work as is, since it's using a dedicated memory buffer (so it's less likely to be corrupted by the main software) and hopefully, the interrupt handler might not depend on something stopped while the exception is being handled. That would really brighten my day if it works.

@X-Ryl669
Copy link
Contributor Author

I think I might be able to leave interrupts running on STM32 so it should work for USB serial but I can't test it. @sjasonsmith or @thinkyhead, do you have a board with USB serial to test ?

@X-Ryl669
Copy link
Contributor Author

I've updated the PR with a new feature.
Now, upon the first entry of the exception handler, I'm fixing the CPU status to "no fault" and hooked the return value with a dumper function. This dumper function is run not as an interrupt handler but as a normal function, so all interrupts are alive. It will dump using the usual SERIAL_CHAR macro (bypassing the setup 'direct HW UART' code) so it should work with USB port too.

If we re-enter the exception handler (or if the fault does not allow the former), then it'll not bypass the HW UART code and disable interrupt.

In that case, it's the best that can be done IMHO.
Please test and report, thanks!

@X-Ryl669
Copy link
Contributor Author

Any news ? This works perfectly fine on my board, and if someone with a USB serial port can confirm it does on his board, we could merge it ?

@sjasonsmith
Copy link
Contributor

I have boards, but also other priorities. If it depends upon interrupts it seems like a bad idea. What if the crash happens inside an already running interrupt? What if it happens inside the USB interrupt?

@X-Ryl669
Copy link
Contributor Author

X-Ryl669 commented Dec 21, 2020

It does not depends on interrupt. Former code disabled some interrupts, new code does not on its default path.
It's the CPU that disable interrupts while servicing the crash handler, there is nothing you can do against it. No interrupt can pre-empt a crash handler.

Concerning crash inside an interrupt handler, you'll have to refer to the CPU's technical reference manual. For ARM board, it's quite simple, the crash handler has higher priority than any interrupt. So if it happens during an interrupt, you'll get into the crash handler and it'll not use the "user space" code since it'll detect it's coming from an async handler already and instead fallback on the "disabled interrupt" code (so in that case, no USB output).

It's not possible to get a crash report on USB if the crash happens in USB's interrupt handler (obviously).

I think it's the best we can do, it's simply not possible to debug an interrupt crashing code using interrupts. You must use a debugger for this (and it's not what this code is).

But: In general interrupt code is quite small, so it's unlikely to crash. USB interrupt code is almost buffer free on STM32F so I don't think it'll be corrupted. When you'll have some free time, please test, I'm quite confident it'll work on USB directly.

@X-Ryl669
Copy link
Contributor Author

FYI the code does this in the crash handler:

  const bool faulted_from_exception = ((frame->xpsr & 0xFF) != 0);   // True if coming from interrupt
  if (!faulted_from_exception) { 
[...]
    frame->pc = (uint32_t)resume_from_fault; // Patch where to return to
[...]  
    MinSerial::force_using_default_output = true; // Force using default SERIAL_CHAR output
    return; // The CPU will resume in our handler, re-enabling interrupts meanwhile
  }
  else resume_from_fault(); // Call the crash dump code withing the handler, that is with interrupts disabled

@rhapsodyv
Copy link
Member

I don't think we will be able to make this a feature usable for a "standard user". Then, if this is intended to developers, it's surely very useful to send data to serial, but we probably will have a breakpoint set on the crash handler.

So I would not bother spending to much time trying to make it work with usb serial or any type of customisation.

The most important part, for me, is just the crash handler.

@sjasonsmith
Copy link
Contributor

When you'll have some free time, please test, I'm quite confident it'll work on USB directly.

Can you prove instructions to easily trigger the crash handler, to use for testing? Perhaps you could add a D-Code to gcode_d.cpp which will trigger a crash for testing purposes. Similar to D100 I added for testing the watchdog timer.

@sjasonsmith
Copy link
Contributor

we probably will have a breakpoint set on the crash handler

I do not usually have a debugger attached. There is some overhead in connecting a debugger, so I only do it when necessary.

@X-Ryl669
Copy link
Contributor Author

When you'll have some free time, please test, I'm quite confident it'll work on USB directly.

Can you prove instructions to easily trigger the crash handler, to use for testing? Perhaps you could add a D-Code to gcode_d.cpp which will trigger a crash for testing purposes. Similar to D100 I added for testing the watchdog timer.

A simple *(int*)0 = 123; triggers a crash each time. You can divide by 0 also, or perform an unaligned access char buf[5]; *(uint32_t*)&buf[1] = 3;, or an invalid instruction void (*func)() = (void (*)()) 0xE0000001; func();
In my testing code, I'm adding the first example where I want to get a crash. For testing, in the end of the MarlinCore's setup is perfect (just after the hook is installed).

I'll add the D code handler meanwhile.

@X-Ryl669
Copy link
Contributor Author

X-Ryl669 commented Dec 21, 2020

we probably will have a breakpoint set on the crash handler

I do not usually have a debugger attached. There is some overhead in connecting a debugger, so I only do it when necessary.

Same for me. In fact, I think many boards are sold with locked bootloader. You can't connect them with OpenOCD (or with many shortcoming like not being able to read memory from the debugger since it's in protected region). So this is also very useful in that case.

@rhapsodyv
Copy link
Member

I keep my option: we can just change the serial if we need, and try again. Of course we can make it as complete as possible, but if we need to keep it working only with hardware serial, I don't see any problem on that, as the target user for that feature may be able to connect to any serial on the board, not only using the usb plug one.

@X-Ryl669
Copy link
Contributor Author

Ok, I've added debug G-Code to trigger the faults above D451 D452 D453 D454. I haven't tested them (too late for me tonight), it should work. I've also fixed a variable rename issue breaking build.

@X-Ryl669
Copy link
Contributor Author

X-Ryl669 commented Dec 22, 2020

Ok, following Victor's advice, there is now a single debug gcode for this: D451 (in reference to Fahrenheit 451 book) taking the crash type from a T parameter. I've tested all of them and they work (at least, to the extend of what this could mean)

Documentation for this new code:

Debug G-Code D451

Format: D451 [Tx]
Trigger a crash on the system to test the crash handling code. After emitting this code, the system should answer by dumping a stack trace and rebooting. In case a debugger is attached, the debugger will be waken up on the crashing code. This requires enabling POST_MORTEM_DEBUGGING in Configuration_adv.h

By default the crash type for T is 1.
Accepted value for crash types are:

Type Meaning Crash resistance
1 Illegal memory access All CPU are expected to crash for this type
2 Divide by zero Some CPU will accept dividing by zero without crashing, like ARM
3 Unaligned access to memory Some CPU will accept accessing memory at unaligned address
4 Illegal instruction All CPU are expected to crash for this type

@X-Ryl669 X-Ryl669 force-pushed the AddPostMortemDebugging branch from 607cb73 to a1815bc Compare December 22, 2020 18:19
@sjasonsmith
Copy link
Contributor

Please don't force-push to your PR once it is being looked at by other people. It breaks continuity for anyone else looking at or touching the code.

@sjasonsmith
Copy link
Contributor

Can you explain this warning which is introduced by this feature?

Linking .pio\build\STM32F103RET6_creality\firmware-20201222-103339.elf
c:/users/jason/.platformio/packages/[email protected]/bin/../lib/gcc/arm-none-eabi/7.2.1/../../../../arm-none-eabi/bin/ld.exe: warning: changing start of section .bss by 104 bytes
c:/users/jason/.platformio/packages/[email protected]/bin/../lib/gcc/arm-none-eabi/7.2.1/../../../../arm-none-eabi/bin/ld.exe: warning: changing start of section .bss by 104 bytes
c:/users/jason/.platformio/packages/[email protected]/bin/../lib/gcc/arm-none-eabi/7.2.1/../../../../arm-none-eabi/bin/ld.exe: warning: changing start of section .bss by 104 bytes
c:/users/jason/.platformio/packages/[email protected]/bin/../lib/gcc/arm-none-eabi/7.2.1/../../../../arm-none-eabi/bin/ld.exe: warning: changing start of section .bss by 104 bytes
c:/users/jason/.platformio/packages/[email protected]/bin/../lib/gcc/arm-none-eabi/7.2.1/../../../../arm-none-eabi/bin/ld.exe: warning: changing start of section .bss by 104 bytes
Checking size .pio\build\STM32F103RET6_creality\firmware-20201222-103339.elf

@sjasonsmith
Copy link
Contributor

sjasonsmith commented Dec 22, 2020

Enabling this feature puts my Creality 4.2.7 board (STM32F1 HAL) into a reboot loop. I see both splash screens then it reboots.

Configuration.zip

@sjasonsmith
Copy link
Contributor

Won't compile at all for my Rumba32 (STM32 HAL). MinSerial seems to be missing.
Configuration.zip

c:/users/jason/.platformio/packages/[email protected]/bin/../lib/gcc/arm-none-eabi/9.2.1/../../../../arm-none-eabi/bin/ld.exe: .pio/build/rumba32/src/src/HAL/STM32/HAL.cpp.o: in function `HAL_init()':
C:\Users\jason\Documents\Marlin/Marlin\src\HAL\STM32/HAL.cpp:90: undefined reference to `install_min_serial()'

@sjasonsmith sjasonsmith added Needs: Work More work is needed S: Don't Merge Work in progress or under discussion. labels Dec 22, 2020
@X-Ryl669
Copy link
Contributor Author

Sorry, I wanted to squash my commit to reuse it on one of my branches to apply on top of bugfix.

I can't explain the linker warning (yet). I don't have the same values are yours. I'll try to figure it out. Doesn't seem to break anything on my side.

What does the debugger say on your Creality ? I got a reboot loop at first because some code was crashing later on on my Sapphire Plus, and it was gone unnoticed until the handler were installed.
You should enable MARLIN_DEV_MODE (in order to trigger the crash via the serial link) and disable USE_WATCHDOG (so it's struck and not rebooting and you can attach a debugger).

I'll have a look to the Rumba build. I thought the CI was checking this.

@sjasonsmith
Copy link
Contributor

sjasonsmith commented Dec 22, 2020

What does the debugger say on your Creality ?

It just prints this repeatedly. No output from your code at all.

Recv: BL2start
Recv: PowerUp
Recv: Marlin bugfix-2.0.x
Recv: 
Recv: echo: Last Updated: 2020-12-22 | Author: (Dust, Ender-3)
Recv: echo:Compiled: Dec 22 2020
Recv: echo: Free Memory: 49079  PlannerBufferBytes: 1344

@ejtagle
Copy link
Contributor

ejtagle commented Jan 10, 2021

I would not be surprised if fixing that macro also fixed the unwind code when there is no unwind table (of course, unwind tables will always be better)

@ejtagle
Copy link
Contributor

ejtagle commented Jan 10, 2021

I'm not very fond of the database of flash address in the unwind code. If one adds a platform, it's quick to forget to declare it. What about this instead:

static bool validate_addr (uint32_t read_address)
{
    bool works = true;

    // Clear bus fault indicator first (write 1 to clear)
    HWREG (NVIC_FAULT_STAT) |= NVIC_FAULT_STAT_BFARV;
    // Ignore bus fault interrupt
    HWREG (NVIC_CFG_CTRL) |= NVIC_CFG_CTRL_BFHFNMIGN;
    // Disable interrupts
    cli();
    // Probe address
    *(volatile uint32_t*)read_address;
    // Check if a fault happened 
    if ((HWREG (NVIC_FAULT_STAT) & NVIC_FAULT_STAT_BFARV) != 0)
        works = false;
    // Enable interrupts again
    sli();
    // Revert the no-fault interrupt flag
    HWREG (NVIC_CFG_CTRL) &= ~NVIC_CFG_CTRL_BFHFNMIGN;

    return works;
}

Pros: No need to declare any platform, if the address is readable, it says so.
Cons: Would return true if testing one of the core register or devices memory area. But since this should not happen in real code, I'm not sure it is a problem.

Reenabling interrupts is not good. You should store the interrupt enable status and restore it... ;)

@X-Ryl669
Copy link
Contributor Author

I don't know if backtrace is used anywhere else than the fault handler. In that case, we absolutely don't care about re-enabling interrupt unconditionally. The fault handler works in dual mode, it first tries to dump with interrupt enabled (by returning the handler) so there is no need to check for it (they are on), and if it's re-entering the fault handler, it's being run with the highest priority anyway and enabling or disabling interrupt is useless in that case since no interrupt can interrupt the current fault handler. We never exit the fault handler, but instead reboot the system.

I'll add the check so the code is generic and can be used anywhere.

@X-Ryl669
Copy link
Contributor Author

I've just tested it, it does not work without the unwind tables.
I don't know why, and honestly, I think it's too much work to debug the virtual machine code here.

So, it's as simple as possible, once you enable the POSTMORTEM_DEBUGGING flag, the unwind table are built in the binary. It's a small cost for a useful feature, so I think it's ok this way.

I've added the check for the interrupt status so now, if the interrupts are disabled, while entering the validate_addr function, they are not enabled upon leaving. It's completely ARM generic code, so it does not depend on libmaple or STSTM32 framework, it should work on any ARM CPU.

@ejtagle
Copy link
Contributor

ejtagle commented Jan 11, 2021

I've just tested it, it does not work without the unwind tables.
I don't know why, and honestly, I think it's too much work to debug the virtual machine code here.

So, it's as simple as possible, once you enable the POSTMORTEM_DEBUGGING flag, the unwind table are built in the binary. It's a small cost for a useful feature, so I think it's ok this way.

I've added the check for the interrupt status so now, if the interrupts are disabled, while entering the validate_addr function, they are not enabled upon leaving. It's completely ARM generic code, so it does not depend on libmaple or STSTM32 framework, it should work on any ARM CPU.

I do agree. Its not the VM fault. GCC inlining functions makes impossible to get useful traces without the unwind tables.

@X-Ryl669
Copy link
Contributor Author

X-Ryl669 commented Jan 11, 2021

Can someone remove the "needs work" flag (I think it's done now) ?
I've tested that it builds fine wherever appropriate and there is nothing more I can add or remove.

@ejtagle
Copy link
Contributor

ejtagle commented Jan 12, 2021

Can someone remove the "needs work" flag (I think it's done now) ?
I've tested that it builds fine wherever appropriate and there is nothing more I can add or remove.

Tag @thinkyhead ;) .... I also agree this is quite an improvement on what i did, that was only tested on Arduino DUE, so i see no valid reason to refuse merging

@X-Ryl669
Copy link
Contributor Author

@sjasonsmith Can you remove the "Need work" and "Don't merge" flags please ?

I think I've resolved all concerns you had, and tested on all platform (I've run the make tests-all-local and I correctly get failure where the platform wasn't supported, as expected, and it built correctly where it was supported).

This function is very very useful while developing (at least on my STM32F1 system) since instead of a crashed printer with a dumb backtrace "__error/ usart_disable" in GDB, I'm getting real point of crash in debugger. It's a real time saver.

@sjasonsmith sjasonsmith removed Needs: Work More work is needed S: Don't Merge Work in progress or under discussion. labels Jan 18, 2021
@sjasonsmith
Copy link
Contributor

I removed the labels as @X-Ryl669 requested, but have not performed any new testing on my end.

I suggest holding this and merging it just after a 2.0.8 is tagged, assuming that happens soon.

@thinkyhead thinkyhead force-pushed the AddPostMortemDebugging branch from ccee25d to e91d121 Compare February 17, 2021 23:36
thinkyhead added a commit to MarlinFirmware/Configurations that referenced this pull request Feb 17, 2021
@thinkyhead thinkyhead merged commit 8d28853 into MarlinFirmware:bugfix-2.0.x Feb 21, 2021
Copy link

@Narin4488 Narin4488 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPBor0528

@X-Ryl669
Copy link
Contributor Author

@Narin4488 What ?

kylix-rd pushed a commit to kylix-rd/Marlin that referenced this pull request Feb 22, 2021
kylix-rd pushed a commit to kylix-rd/Marlin that referenced this pull request Feb 22, 2021
zillarob pushed a commit to zillarob/Marlin that referenced this pull request Feb 25, 2021
vyacheslav-shubin pushed a commit to vyacheslav-shubin/Marlin that referenced this pull request Mar 10, 2021
vyacheslav-shubin pushed a commit to vyacheslav-shubin/Marlin that referenced this pull request Mar 10, 2021
W4tel-BiDi pushed a commit to W4tel-BiDi/Marlin that referenced this pull request Apr 5, 2021
chrisjenda pushed a commit to chrisjenda/Marlin that referenced this pull request Apr 7, 2021
tonokip pushed a commit to ultimachine/Marlin that referenced this pull request Apr 29, 2021
@quiret
Copy link
Contributor

quiret commented Sep 19, 2022

Please make a tutorial for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR: New Feature T: Development Makefiles, PlatformIO, Python scripts, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants