-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add post mortem debugging #20492
Add post mortem debugging #20492
Conversation
@sjasonsmith Jason, if you have a STM32 board with a USB serial port, can you try enabling it and skipping the MinSerial specialization ? After the usual bootup procedure, you should get a dump on the serial port. If what I read from the STM32F103 technical reference manual is true (and the libmaple implementation seems to confirm it's non blocking), this might work as is, since it's using a dedicated memory buffer (so it's less likely to be corrupted by the main software) and hopefully, the interrupt handler might not depend on something stopped while the exception is being handled. That would really brighten my day if it works. |
I think I might be able to leave interrupts running on STM32 so it should work for USB serial but I can't test it. @sjasonsmith or @thinkyhead, do you have a board with USB serial to test ? |
I've updated the PR with a new feature. If we re-enter the exception handler (or if the fault does not allow the former), then it'll not bypass the HW UART code and disable interrupt. In that case, it's the best that can be done IMHO. |
Any news ? This works perfectly fine on my board, and if someone with a USB serial port can confirm it does on his board, we could merge it ? |
I have boards, but also other priorities. If it depends upon interrupts it seems like a bad idea. What if the crash happens inside an already running interrupt? What if it happens inside the USB interrupt? |
It does not depends on interrupt. Former code disabled some interrupts, new code does not on its default path. Concerning crash inside an interrupt handler, you'll have to refer to the CPU's technical reference manual. For ARM board, it's quite simple, the crash handler has higher priority than any interrupt. So if it happens during an interrupt, you'll get into the crash handler and it'll not use the "user space" code since it'll detect it's coming from an async handler already and instead fallback on the "disabled interrupt" code (so in that case, no USB output). It's not possible to get a crash report on USB if the crash happens in USB's interrupt handler (obviously). I think it's the best we can do, it's simply not possible to debug an interrupt crashing code using interrupts. You must use a debugger for this (and it's not what this code is). But: In general interrupt code is quite small, so it's unlikely to crash. USB interrupt code is almost buffer free on STM32F so I don't think it'll be corrupted. When you'll have some free time, please test, I'm quite confident it'll work on USB directly. |
FYI the code does this in the crash handler:
|
I don't think we will be able to make this a feature usable for a "standard user". Then, if this is intended to developers, it's surely very useful to send data to serial, but we probably will have a breakpoint set on the crash handler. So I would not bother spending to much time trying to make it work with usb serial or any type of customisation. The most important part, for me, is just the crash handler. |
Can you prove instructions to easily trigger the crash handler, to use for testing? Perhaps you could add a D-Code to gcode_d.cpp which will trigger a crash for testing purposes. Similar to D100 I added for testing the watchdog timer. |
I do not usually have a debugger attached. There is some overhead in connecting a debugger, so I only do it when necessary. |
A simple I'll add the D code handler meanwhile. |
Same for me. In fact, I think many boards are sold with locked bootloader. You can't connect them with OpenOCD (or with many shortcoming like not being able to read memory from the debugger since it's in protected region). So this is also very useful in that case. |
I keep my option: we can just change the serial if we need, and try again. Of course we can make it as complete as possible, but if we need to keep it working only with hardware serial, I don't see any problem on that, as the target user for that feature may be able to connect to any serial on the board, not only using the usb plug one. |
Ok, I've added debug G-Code to trigger the faults above D451 D452 D453 D454. I haven't tested them (too late for me tonight), it should work. I've also fixed a variable rename issue breaking build. |
Ok, following Victor's advice, there is now a single debug gcode for this: D451 (in reference to Fahrenheit 451 book) taking the crash type from a Documentation for this new code: Debug G-Code D451Format: By default the crash type for
|
607cb73
to
a1815bc
Compare
Please don't force-push to your PR once it is being looked at by other people. It breaks continuity for anyone else looking at or touching the code. |
Can you explain this warning which is introduced by this feature?
|
Enabling this feature puts my Creality 4.2.7 board (STM32F1 HAL) into a reboot loop. I see both splash screens then it reboots. |
Won't compile at all for my Rumba32 (STM32 HAL). MinSerial seems to be missing.
|
Sorry, I wanted to squash my commit to reuse it on one of my branches to apply on top of bugfix. I can't explain the linker warning (yet). I don't have the same values are yours. I'll try to figure it out. Doesn't seem to break anything on my side. What does the debugger say on your Creality ? I got a reboot loop at first because some code was crashing later on on my Sapphire Plus, and it was gone unnoticed until the handler were installed. I'll have a look to the Rumba build. I thought the CI was checking this. |
It just prints this repeatedly. No output from your code at all.
|
I would not be surprised if fixing that macro also fixed the unwind code when there is no unwind table (of course, unwind tables will always be better) |
Reenabling interrupts is not good. You should store the interrupt enable status and restore it... ;) |
I don't know if backtrace is used anywhere else than the fault handler. In that case, we absolutely don't care about re-enabling interrupt unconditionally. The fault handler works in dual mode, it first tries to dump with interrupt enabled (by returning the handler) so there is no need to check for it (they are on), and if it's re-entering the fault handler, it's being run with the highest priority anyway and enabling or disabling interrupt is useless in that case since no interrupt can interrupt the current fault handler. We never exit the fault handler, but instead reboot the system. I'll add the check so the code is generic and can be used anywhere. |
I've just tested it, it does not work without the unwind tables. So, it's as simple as possible, once you enable the POSTMORTEM_DEBUGGING flag, the unwind table are built in the binary. It's a small cost for a useful feature, so I think it's ok this way. I've added the check for the interrupt status so now, if the interrupts are disabled, while entering the validate_addr function, they are not enabled upon leaving. It's completely ARM generic code, so it does not depend on libmaple or STSTM32 framework, it should work on any ARM CPU. |
I do agree. Its not the VM fault. GCC inlining functions makes impossible to get useful traces without the unwind tables. |
Can someone remove the "needs work" flag (I think it's done now) ? |
Tag @thinkyhead ;) .... I also agree this is quite an improvement on what i did, that was only tested on Arduino DUE, so i see no valid reason to refuse merging |
@sjasonsmith Can you remove the "Need work" and "Don't merge" flags please ? I think I've resolved all concerns you had, and tested on all platform (I've run the This function is very very useful while developing (at least on my STM32F1 system) since instead of a crashed printer with a dumb backtrace "__error/ usart_disable" in GDB, I'm getting real point of crash in debugger. It's a real time saver. |
I removed the labels as @X-Ryl669 requested, but have not performed any new testing on my end. I suggest holding this and merging it just after a 2.0.8 is tagged, assuming that happens soon. |
ccee25d
to
e91d121
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CPBor0528
@Narin4488 What ? |
Please make a tutorial for this. |
Requirements
This is a generic feature for Marlin, in link with issue #20477.
Without such feature, development or simply bug fixing is a real PITA or completely impossible.
Description
This adds a generic mechanism to hook into CPU exceptions (like hard fault, memory fault, bus fault, usage fault, etc...) and when any happen, it captures the CPU state (stack, registers, callstack) and either break under a debugger if any is attached and/or dump those on the main serial output. Care is taken not to trigger another fault when it happens.
Hooking the vector table is done at runtime (not in flash) since many platform don't support this overloading.Update: libmaple does not like this, so it's done at compile time now.
In the future, it could even be "installed" at runtime upon changing the debug level via M111.
Please notice that this feature partially existed in LPC1768 and DUE but with many code duplication (platform are very different). I've made them more generic and support for STM32F1 is added (my board is a STM32F1).
Work is still required for other CPU type, but the work will be minimal for for ARMv7 CPU (most of the generic code is in place, only the serial device configuration must be written).
I don't know if it's possible to write a driver for AVR, but it's probably possible for ESP32 (I once written a GDBstub for this CPU, it's not too hard).
For debugging with a ST-Link V2 probe, you can check my tutorial here
Benefits
You'll get this in the serial console in case of a bug:
You'll get this in GDB if you are attached while a bug happens:
For post mortem debugging, you can use
addr2line
to get the source line for each PC reported.Configurations
I've added a new option in
Configuration_adv.h
that's calledPOST_MORTEM_DEBUGGING
. Feel free to enable it to have the function enabled in your next build!Related Issues
Upon complete implementation for all CPU, it'll fix #20477.
Right now, it works for STM32F1 (the CPU I have), and should work for DUE and LPC1768 CPU.