Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in faulthandler signal handler #116008

Open
dgrisby opened this issue Feb 27, 2024 · 9 comments
Open

Segfault in faulthandler signal handler #116008

dgrisby opened this issue Feb 27, 2024 · 9 comments
Labels
type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@dgrisby
Copy link
Contributor

dgrisby commented Feb 27, 2024

Crash report

What happened?

We use faulthandler.register to dump stacks on SIGUSR1.

faulthandler.register(signal.SIGUSR1, chain=True)

Sending the signal to a running multi-threaded process, it segfaulted. It successfully dumped several thread stacks, but segfaulted part way through:

Thread 0x00007f4fbd4a3640 (most recent call first):
  File "/usr/tideway/lib/python3.11/threading.py", line 320 in wait
  File "/usr/tideway/lib/python3.11/threading.py", line 622 in wait
  File "/usr/tideway/python/reasoning/consolidation/sender.py", line 1151 in _threadedSendingOfData
  File "/usr/tideway/lib/python3.11/threading.py", line 975 in run
  File "/usr/tideway/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/usr/tideway/lib/python3.11/threading.py", line 995 in _bootstrap

Thread 0x00007f4fafca0640 (most recent call first):
  File "/usr/tideway/python/stubs/search_idl.py", line 720 in search
  File "/usr/tideway/python/reasoning/ecaengine/tpl/tpl_search.py", line 87 in _doSearch
  File "/usr/tideway/python/reasoning/ecaengine/tpl/tpl_search.py", line 103 in search
  File "/usr/tideway/var/code/tpl_py/SupportDetail/SoftwareInstance/__init__.py", line 233 in execute
  File "/usr/tideway/python/reasoning/ecaengine/engine/rule.py", line 141 in doExecute
  File "/usr/tideway/python/reasoning/ecaengine/engine/__init__.py", line 715 in processEvent
  File "/usr/tideway/python/reasoning/ecaengine/transactional_server.py", line 258 in processEvent
  File "/usr/tideway/pylib/00_hack/common/threadpool.py", line 93 in run
  File "/usr/tideway/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/usr/tideway/lib/python3.11/threading.py", line 995 in _bootstrap

Thread 0x00007f4fb04a1640 (most recent call first):
  File "/usr/tideway/lib/python3.11/threading.py", line 338 in notify_all
  File "/usr/tideway/lib/python3.11/concurrent/futures/_base.py", line 550 in set_result
  File Tue Feb 27 14:17:40 2024 : Watchdog : Service failed after 1674.6 seconds (signal 11; exit 0). Restarting (1 restart so far).

The last line shows that it wrote "File", then segfaulted. (That was caught by a parent watchdog process that restarted the process. The timestamp and rest of the message is from the watchdog process.)

Looking at the core file:

Core was generated by `python3 -m reasoning.main --daemon start'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f4fcd393e70 in dump_frame (frame=0x7f4f9533bac0, fd=2) at Python/traceback.c:1190
1190	        && PyUnicode_Check(code->co_filename))

(gdb) bt
#0  0x00007f4fcd393e70 in dump_frame (frame=0x7f4f9533bac0, fd=2) at Python/traceback.c:1190
#1  dump_traceback (fd=fd@entry=2, tstate=tstate@entry=0xbd23860, write_header=write_header@entry=0) at Python/traceback.c:1242
#2  0x00007f4fcd3940b7 in _Py_DumpTracebackThreads (fd=fd@entry=2, interp=<optimized out>, interp@entry=0x0, current_tstate=0x7f4fcd75d2d8 <_PyRuntime+166328>) at Python/traceback.c:1344
#3  0x00007f4fcd39a380 in faulthandler_dump_traceback (fd=2, all_threads=1, interp=<optimized out>) at ./Modules/faulthandler.c:252
#4  0x00007f4fcd39a539 in faulthandler_user (signum=10) at ./Modules/faulthandler.c:862
#5  <signal handler called>
#6  0x00007f4fcce9c3aa in __futex_abstimed_wait_common () from /lib64/libc.so.6
#7  0x00007f4fccea6bae in __new_sem_wait_slow64 () from /lib64/libc.so.6
#8  0x00007f4fcd466270 in PyThread_acquire_lock_timed (lock=lock@entry=0x50d6510, microseconds=microseconds@entry=5000000, intr_flag=intr_flag@entry=1) at Python/thread_pthread.h:483
#9  0x00007f4fcd4f5c72 in acquire_timed (lock=0x50d6510, timeout=5000000000) at ./Modules/_threadmodule.c:98
#10 0x00007f4fcd4f5abe in lock_PyThread_acquire_lock (self=0x7f4fb1dabdc0, args=args@entry=0x7f4fa0b4fdc0, kwds=kwds@entry=0x0) at ./Modules/_threadmodule.c:179
#11 0x00007f4fcd3ea0dc in method_vectorcall_VARARGS_KEYWORDS (func=<optimized out>, args=0x7f4fcd9394b0, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/descrobject.c:364
#12 0x00007f4fcd3e59a8 in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=0x7f4fcd18af20, tstate=0x7f4fcd75d2d8 <_PyRuntime+166328>)
    at ./Include/internal/pycore_call.h:92
#13 PyObject_Vectorcall (callable=0x7f4fcd18af20, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:299
#14 0x00007f4fcd4438ca in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:4774
#15 0x00007f4fcd442557 in _PyEval_EvalFrame (throwflag=0, frame=0x7f4fcd9391b8, tstate=0x7f4fcd75d2d8 <_PyRuntime+166328>) at ./Include/internal/pycore_ceval.h:73
#16 _PyEval_Vector (tstate=0x7f4fcd75d2d8 <_PyRuntime+166328>, func=<optimized out>, locals=0x7f4fcd1f30c0, args=0x0, argcount=<optimized out>, kwnames=<optimized out>) at Python/ceval.c:6439
#17 0x00007f4fcd4bf2f4 in PyEval_EvalCode (co=co@entry=0x7f4fccd230a0, globals=globals@entry=0x7f4fcd1f30c0, locals=locals@entry=0x7f4fcd1f30c0) at Python/ceval.c:1154
#18 0x00007f4fcd4be3b4 in builtin_exec_impl (module=<optimized out>, closure=<optimized out>, locals=0x7f4fcd1f30c0, globals=0x7f4fcd1f30c0, source=0x7f4fccd230a0) at Python/bltinmodule.c:1077
#19 builtin_exec (module=<optimized out>, args=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>) at Python/clinic/bltinmodule.c.h:465
#20 0x00007f4fcd40c4c6 in cfunction_vectorcall_FASTCALL_KEYWORDS (func=0x7f4fcd188f90, args=0x7f4fcd939180, nargsf=<optimized out>, kwnames=0x0) at ./Include/cpython/methodobject.h:52
#21 0x00007f4fcd3e59a8 in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=0x7f4fcd188f90, tstate=0x7f4fcd75d2d8 <_PyRuntime+166328>)
    at ./Include/internal/pycore_call.h:92
#22 PyObject_Vectorcall (callable=0x7f4fcd188f90, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:299
#23 0x00007f4fcd4438ca in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:4774
#24 0x00007f4fcd442557 in _PyEval_EvalFrame (throwflag=0, frame=0x7f4fcd939020, tstate=0x7f4fcd75d2d8 <_PyRuntime+166328>) at ./Include/internal/pycore_ceval.h:73
#25 _PyEval_Vector (tstate=0x7f4fcd75d2d8 <_PyRuntime+166328>, func=<optimized out>, locals=0x0, args=0x7f4fccda6c18, argcount=<optimized out>, kwnames=<optimized out>) at Python/ceval.c:6439
#26 0x00007f4fcd4dc089 in pymain_run_module (modname=<optimized out>, set_argv0=<optimized out>) at Modules/main.c:300
#27 0x00007f4fcd4dbc8d in pymain_run_python (exitcode=0x7ffda6d1e874) at Modules/main.c:595
#28 Py_RunMain () at Modules/main.c:680
#29 0x00007f4fcd4dba39 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:734
#30 0x00007f4fcce3feb0 in __libc_start_call_main () from /lib64/libc.so.6
#31 0x00007f4fcce3ff60 in __libc_start_main_impl () from /lib64/libc.so.6
#32 0x0000000000401075 in _start ()
(gdb) p code
$1 = (PyCodeObject *) 0x7f4fcd63ebe0 <empty_keys_struct>
(gdb) p *code
$2 = {ob_base = {ob_base = {ob_refcnt = 118616, ob_type = 0x100010000}, ob_size = 0}, co_consts = 0x0, co_names = 0xffffffffffffffff, co_exceptiontable = 0x0, co_flags = 0, co_warmup = 0, _co_linearray_entry_size = 0, 
  co_argcount = 0, co_posonlyargcount = 0, co_kwonlyargcount = 11, co_stacksize = 0, co_firstlineno = -849062080, co_nlocalsplus = 32591, co_nlocals = 0, co_nplaincellvars = 0, co_ncellvars = -850270115, 
  co_nfreevars = 32591, co_localsplusnames = 0x18, co_localspluskinds = 0x0, co_filename = 0x7f4fcd408c90 <dictview_dealloc>, co_name = 0x0, co_qualname = 0x0, co_linetable = 0x0, co_weakreflist = 0x0, 
  _co_code = 0x7f4fcd499e60 <dictview_repr>, _co_linearray = 0x0, _co_firsttraceable = -849084352, co_extra = 0x0, co_code_adaptive = " "}
(gdb) p *code->co_filename
$3 = {ob_refcnt = -177530513340544939, ob_type = 0x8418b48f8578b48}
(gdb) p *code->co_filename->ob_type
Cannot access memory at address 0x8418b48f8578b48

co_filename is clearly invalid here.

Most of the time dumping the stack works fine, but doing it repeatedly I can reproduce this when the process is busy. A second time, it failed in a slightly different place, accessing co_name:

(gdb) bt
#0  0x00007f4ab7993f05 in dump_frame (frame=0x7f4ab3ab43a8, fd=2) at Python/traceback.c:1210
#1  dump_traceback (fd=fd@entry=2, tstate=tstate@entry=0xae9d1f0, write_header=write_header@entry=0) at Python/traceback.c:1242
#2  0x00007f4ab79940b7 in _Py_DumpTracebackThreads (fd=fd@entry=2, interp=<optimized out>, interp@entry=0x0, current_tstate=0x7f4ab7d5d2d8 <_PyRuntime+166328>) at Python/traceback.c:1344
#3  0x00007f4ab799a380 in faulthandler_dump_traceback (fd=2, all_threads=1, interp=<optimized out>) at ./Modules/faulthandler.c:252
#4  0x00007f4ab799a539 in faulthandler_user (signum=10) at ./Modules/faulthandler.c:862
#5  <signal handler called>
#6  0x00007f4ab749c3aa in __futex_abstimed_wait_common () from /lib64/libc.so.6
...


(gdb) p *code
$2 = {ob_base = {ob_base = {ob_refcnt = 336273, ob_type = 0x7f4ab7c42940 <_PyNone_Type>}, ob_size = 0}, co_consts = 0x0, co_names = 0xa2, co_exceptiontable = 0x7f4ab7c45740 <PyType_Type>, co_flags = 0, co_warmup = 0, 
  _co_linearray_entry_size = 0, co_argcount = -1213067338, co_posonlyargcount = 32586, co_kwonlyargcount = 16, co_stacksize = 0, co_firstlineno = 0, co_nlocalsplus = 0, co_nlocals = -1215091588, co_nplaincellvars = 32586, 
  co_ncellvars = 0, co_nfreevars = 0, co_localsplusnames = 0x0, co_localspluskinds = 0x0, co_filename = 0x0, co_name = 0x7f4ab7a0eb60 <none_repr>, co_qualname = 0x7f4ab7c42c40 <none_as_number>, co_linetable = 0x0, 
  co_weakreflist = 0x0, _co_code = 0x7f4ab7a61a20 <_Py_HashPointer>, _co_linearray = 0x0, _co_firsttraceable = -1214131376, co_extra = 0x7f4ab7a0dfe0 <PyObject_GenericGetAttr>, co_code_adaptive = "P"}

(gdb) p code->co_name
$4 = (PyObject *) 0x7f4ab7a0eb60 <none_repr>
(gdb) p *code->co_name
$5 = {ob_refcnt = -1657306674253427384, ob_type = 0x401f0fffee6194}

CPython versions tested on:

3.11

Operating systems tested on:

Linux

Output from running 'python -VV' on the command line:

Python 3.11.4 (main, Jun 19 2023, 17:32:00) [GCC 11.3.1 20221121 (Red Hat 11.3.1-4.3.0.1)]

@dgrisby dgrisby added the type-crash A hard crash of the interpreter, possibly with a core dump label Feb 27, 2024
@dgrisby
Copy link
Contributor Author

dgrisby commented Feb 28, 2024

Thinking about it some more, I suspect that this is a race condition where a thread is executing Python code while the faulthandler signal handler code is running. The main thread handling the signal is looking at the stack frames of a thread while that thread is actively executing. The stack frames are being changed as the traceback executes.

@jonny5532
Copy link

jonny5532 commented Mar 3, 2024

I've just been investigating a similar issue myself - here's a short reproducer:

import faulthandler
import signal
faulthandler.enable()
faulthandler.register(signal.SIGUSR1)

import threading
while True:
    threading.Thread(target=lambda: None).start()

Whilst the above is running, spam it with SIGUSR1 calls, eg:

while [ -d . ]; do killall -USR1 python3; done

and it will instantly segfault (have tried Python 3.10.12, 3.11.8 and 3.12.2).

It seems to me that faulthandler is flawed by design - it doesn't hold the GIL, so state may get changed (eg, stack frames or thread states freed) whilst it is producing the the traceback.

@jonny5532
Copy link

A workaround is to avoid using faulthandler.register(...) and instead catch the signal yourself and call faulthandler.dump_traceback, eg:

import signal
import traceback
import faulthandler

def handler(signum, frame):
    faulthandler.dump_traceback(all_threads=True)
signal.signal(signal.SIGUSR1, handler)

import threading
while True:
    threading.Thread(target=lambda: None).start()

Then the GIL will be held when the traceback is output, and it no longer crashes.

@dgrisby
Copy link
Contributor Author

dgrisby commented Mar 4, 2024

The crash I am seeing does not involve rapid starting and stopping of threads. In my case, I have a relatively small number of threads that are busy making sequences of nested calls, meaning that the call stack is changing quite rapidly.

@jonny5532
Copy link

I'm guessing that the issue is caused by a thread exiting whilst the traceback is being generated - my silly reproducer obviously does that a lot, but your traceback segfaults at the set_result in concurrent.futures, which might have been shortly followed by the thread exiting (eg, if the executor goes out of scope and gets collected) - does that happen in your code?

I was able to crash Python 3.10 with a single additional thread running a while loop, but 3.11+ have so far required thread creation/destruction thrash to trigger a segfault (with the faulthandler+SIGUSR1 combo).

I get very similar backtraces to you in gdb (using the above reproducer on Python 3.11.8):

First crash:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
dump_traceback (fd=2, tstate=<optimized out>, write_header=<optimized out>) at Python/traceback.c:1340
1340            frame = frame->previous;
(gdb) bt
#0  dump_traceback (fd=2, tstate=<optimized out>, write_header=<optimized out>) at Python/traceback.c:1340
#1  0x00007ffff7abc753 in _Py_DumpTracebackThreads (fd=fd@entry=2, interp=<optimized out>, interp@entry=0x0, 
    current_tstate=0x7ffff7d8c278 <_PyRuntime+166328>) at Python/traceback.c:1446
#2  0x00007ffff7acebf5 in faulthandler_dump_traceback (interp=<optimized out>, all_threads=<optimized out>, 
    fd=<optimized out>) at ./Modules/faulthandler.c:251
#3  faulthandler_dump_traceback (interp=<optimized out>, all_threads=<optimized out>, fd=<optimized out>)
    at ./Modules/faulthandler.c:230
#4  faulthandler_user (signum=10) at ./Modules/faulthandler.c:860
#5  <signal handler called>
#6  0x00007ffff7491115 in __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393, 
    expected=0, futex_word=0x555555584290) at ./nptl/futex-internal.c:57
...snipped...

(gdb) p frame
$1 = (_PyInterpreterFrame *) 0x7ffff7fa8090

(gdb) p frame->previous
Cannot access memory at address 0x7ffff7fa80c0

Second crash:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
PyCode_Addr2Line (co=0x0, addrq=-184) at Objects/codeobject.c:769
769             return co->co_firstlineno;
(gdb) bt
#0  PyCode_Addr2Line (co=0x0, addrq=-184) at Objects/codeobject.c:769
#1  0x00007ffff7a7c8dd in _PyInterpreterFrame_GetLine (frame=frame@entry=0x7ffff7fa8190) at Python/frame.c:169
#2  0x00007ffff7abc4e7 in dump_frame (frame=0x7ffff7fa8190, fd=2) at Python/traceback.c:1274
#3  dump_traceback (fd=2, tstate=<optimized out>, write_header=<optimized out>) at Python/traceback.c:1339
#4  0x00007ffff7abc753 in _Py_DumpTracebackThreads (fd=fd@entry=2, interp=<optimized out>, interp@entry=0x0, 
    current_tstate=0x7ffff7d8c278 <_PyRuntime+166328>) at Python/traceback.c:1446
#5  0x00007ffff7acebf5 in faulthandler_dump_traceback (interp=<optimized out>, all_threads=<optimized out>, 
    fd=<optimized out>) at ./Modules/faulthandler.c:251
#6  faulthandler_dump_traceback (interp=<optimized out>, all_threads=<optimized out>, fd=<optimized out>)
    at ./Modules/faulthandler.c:230
#7  faulthandler_user (signum=10) at ./Modules/faulthandler.c:860
#8  <signal handler called>
#9  0x00007ffff7491115 in __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393, 
    expected=0, futex_word=0x555555565820) at ./nptl/futex-internal.c:57
...snipped...

(gdb) p co
$1 = (PyCodeObject *) 0x0

@dgrisby
Copy link
Contributor Author

dgrisby commented Mar 5, 2024

The threads in my system are definitely not exiting. They are all long-lived.

@jonny5532
Copy link

jonny5532 commented Mar 6, 2024

Here's a reproducer that only needs one long-lived thread:

import faulthandler
import signal
faulthandler.register(signal.SIGUSR1)

import threading
import time

def rec(x):
    a = 1
    b = 2
    c = 3
    if x>0: return rec(x-1)

def loop():
    while True:
        rec(123)

threading.Thread(target=loop).start()

time.sleep(1000)

Again, run this whilst spamming it with USR1s from another terminal with:

while [ -d . ]; do killall -USR1 python3; done

It crashes instantly on Python 3.11.8 and 3.12.2 (Linux, x86_64 on Ryzen 6800U).

It needs those three assignments in the loop, and sufficiently deep recursion, to trigger the bug (maybe this is pushing the stack frame over some size which changes the allocation/free behaviour, or bypassing some optimisations).

#110052 is possibly the same issue - although as you note (and the above reproducer indicates) the issue in this case seems to be races involving stack frame manipulation rather than thread creation/destruction.

@vstinner
Copy link
Member

vstinner commented Apr 3, 2024

It seems to me that faulthandler is flawed by design - it doesn't hold the GIL, so state may get changed (eg, stack frames or thread states freed) whilst it is producing the the traceback.

Correct, it's not thread-safe by design. faulthandler "fault handler" is only a "best effort" debugging tool, trying to provide some information about a crash. Sometimes it works, sometimes it crashs since Python internals are no longer consistent.

Maybe for non-fatal signals such as SIGUSR1, it can try to acquire/release the GIL, but I'm not sure if it's doable.

@dgrisby
Copy link
Contributor Author

dgrisby commented Apr 4, 2024

Correct, it's not thread-safe by design. faulthandler "fault handler" is only a "best effort" debugging tool, trying to provide some information about a crash. Sometimes it works, sometimes it crashs since Python internals are no longer consistent.

In that case, I think there should be a prominent warning in the documentation for faulthandler.register, that it must not be used in multi-threaded code.

Maybe for non-fatal signals such as SIGUSR1, it can try to acquire/release the GIL, but I'm not sure if it's doable.

That would seem like a sensible approach if it is possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-crash A hard crash of the interpreter, possibly with a core dump
Projects
None yet
Development

No branches or pull requests

3 participants