-
-
Notifications
You must be signed in to change notification settings - Fork 31k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault in faulthandler signal handler #116008
Comments
Thinking about it some more, I suspect that this is a race condition where a thread is executing Python code while the faulthandler signal handler code is running. The main thread handling the signal is looking at the stack frames of a thread while that thread is actively executing. The stack frames are being changed as the traceback executes. |
I've just been investigating a similar issue myself - here's a short reproducer:
Whilst the above is running, spam it with SIGUSR1 calls, eg:
and it will instantly segfault (have tried Python 3.10.12, 3.11.8 and 3.12.2). It seems to me that faulthandler is flawed by design - it doesn't hold the GIL, so state may get changed (eg, stack frames or thread states freed) whilst it is producing the the traceback. |
A workaround is to avoid using
Then the GIL will be held when the traceback is output, and it no longer crashes. |
The crash I am seeing does not involve rapid starting and stopping of threads. In my case, I have a relatively small number of threads that are busy making sequences of nested calls, meaning that the call stack is changing quite rapidly. |
I'm guessing that the issue is caused by a thread exiting whilst the traceback is being generated - my silly reproducer obviously does that a lot, but your traceback segfaults at the I was able to crash Python 3.10 with a single additional thread running a while loop, but 3.11+ have so far required thread creation/destruction thrash to trigger a segfault (with the faulthandler+SIGUSR1 combo). I get very similar backtraces to you in gdb (using the above reproducer on Python 3.11.8): First crash:
Second crash:
|
The threads in my system are definitely not exiting. They are all long-lived. |
Here's a reproducer that only needs one long-lived thread: import faulthandler
import signal
faulthandler.register(signal.SIGUSR1)
import threading
import time
def rec(x):
a = 1
b = 2
c = 3
if x>0: return rec(x-1)
def loop():
while True:
rec(123)
threading.Thread(target=loop).start()
time.sleep(1000) Again, run this whilst spamming it with USR1s from another terminal with:
It crashes instantly on Python 3.11.8 and 3.12.2 (Linux, x86_64 on Ryzen 6800U). It needs those three assignments in the loop, and sufficiently deep recursion, to trigger the bug (maybe this is pushing the stack frame over some size which changes the allocation/free behaviour, or bypassing some optimisations). #110052 is possibly the same issue - although as you note (and the above reproducer indicates) the issue in this case seems to be races involving stack frame manipulation rather than thread creation/destruction. |
Correct, it's not thread-safe by design. faulthandler "fault handler" is only a "best effort" debugging tool, trying to provide some information about a crash. Sometimes it works, sometimes it crashs since Python internals are no longer consistent. Maybe for non-fatal signals such as SIGUSR1, it can try to acquire/release the GIL, but I'm not sure if it's doable. |
In that case, I think there should be a prominent warning in the documentation for faulthandler.register, that it must not be used in multi-threaded code.
That would seem like a sensible approach if it is possible. |
Crash report
What happened?
We use faulthandler.register to dump stacks on SIGUSR1.
Sending the signal to a running multi-threaded process, it segfaulted. It successfully dumped several thread stacks, but segfaulted part way through:
The last line shows that it wrote "File", then segfaulted. (That was caught by a parent watchdog process that restarted the process. The timestamp and rest of the message is from the watchdog process.)
Looking at the core file:
co_filename is clearly invalid here.
Most of the time dumping the stack works fine, but doing it repeatedly I can reproduce this when the process is busy. A second time, it failed in a slightly different place, accessing co_name:
CPython versions tested on:
3.11
Operating systems tested on:
Linux
Output from running 'python -VV' on the command line:
Python 3.11.4 (main, Jun 19 2023, 17:32:00) [GCC 11.3.1 20221121 (Red Hat 11.3.1-4.3.0.1)]
The text was updated successfully, but these errors were encountered: