Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/trace: crash during traceAdvance when collecting call stack for cgo-calling goroutine #69085

Closed
nsrip-dd opened this issue Aug 27, 2024 · 10 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@nsrip-dd
Copy link
Contributor

After upgrading to Go 1.23, we started seeing crashes like the following from the execution tracer:

SIGSEGV: segmentation violation
PC=0x47ed94 m=9 sigcode=128 addr=0x0

goroutine 0 gp=0xc005a641c0 m=9 mp=0xc000cfe308 [idle]:
runtime.fpTracebackPCs(...)
        /root/.gimme/versions/go1.23.0.linux.amd64/src/runtime/tracestack.go:258
runtime.traceStack(0xc0062ba700?, 0x200000004?, 0x27)
        /root/.gimme/versions/go1.23.0.linux.amd64/src/runtime/tracestack.go:113 +0x2d4 fp=0x76ad22c6a498 sp=0x76ad22c6a030 pc=0x47ed94
runtime.traceAdvance.func1.1()
        /root/.gimme/versions/go1.23.0.linux.amd64/src/runtime/trace.go:398 +0xbd fp=0x76ad22c6a4f0 sp=0x76ad22c6a498 pc=0x475a3d
runtime.systemstack(0x800000)
        /root/.gimme/versions/go1.23.0.linux.amd64/src/runtime/asm_amd64.s:514 +0x4a fp=0x76ad22c6a500 sp=0x76ad22c6a4f0 pc=0x49320a

goroutine 132 gp=0xc005a7f880 m=9 mp=0xc000cfe308 [trace goroutine status]:
runtime.systemstack_switch()
        /root/.gimme/versions/go1.23.0.linux.amd64/src/runtime/asm_amd64.s:479 +0x8 fp=0xc003f1a988 sp=0xc003f1a978 pc=0x4931a8
runtime.traceAdvance.func1(0x0?)
        /root/.gimme/versions/go1.23.0.linux.amd64/src/runtime/trace.go:374 +0x107 fp=0xc003f1aa38 sp=0xc003f1a988 pc=0x475867
runtime.forEachGRace(0xc003f1ab70)
        /root/.gimme/versions/go1.23.0.linux.amd64/src/runtime/proc.go:709 +0x49 fp=0xc003f1aa68 sp=0xc003f1aa38 pc=0x456f09
runtime.traceAdvance(0x1)
        /root/.gimme/versions/go1.23.0.linux.amd64/src/runtime/trace.go:360 +0xa5 fp=0xc003f1abb8 sp=0xc003f1aa68 pc=0x490f25
runtime.StopTrace(...)
        /root/.gimme/versions/go1.23.0.linux.amd64/src/runtime/trace.go:318
runtime/trace.Stop()
        /root/.gimme/versions/go1.23.0.linux.amd64/src/runtime/trace/trace.go:148 +0x85 fp=0xc003f1abf0 sp=0xc003f1abb8 pc=0x9dfca5
gopkg.in/DataDog/dd-trace-go.v1/profiler.init.func4(0xc000ebde00)
        /go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/profile.go:201 +0x1a5 fp=0xc003f1ac90 sp=0xc003f1abf0 pc=0x1d011a5
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).runProfile(0xc000ebde00, 0x7)
        /go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/profile.go:345 +0x296 fp=0xc003f1ae78 sp=0xc003f1ac90 pc=0x1d04e76
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).collect.func2(0x7)
        /go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/profiler.go:344 +0xe5 fp=0xc003f1afc8 sp=0xc003f1ae78 pc=0x1d08725
gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).collect.gowrap2()
        /go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/profiler.go:353 +0x24 fp=0xc003f1afe0 sp=0xc003f1afc8 pc=0x1d08604
runtime.goexit({})
        /root/.gimme/versions/go1.23.0.linux.amd64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc003f1afe8 sp=0xc003f1afe0 pc=0x4951c1
created by gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).collect in goroutine 121
        /go/pkg/mod/gopkg.in/!data!dog/[email protected]/profiler/profiler.go:339 +0x737

[ ... ]

goroutine 2679 gp=0xc0062ba700 m=7 mp=0xc004370708 [syscall (scan)]:
runtime.cgocall(0x2f63160, 0xc006605f58)
        /root/.gimme/versions/go1.23.0.linux.amd64/src/runtime/cgocall.go:167 +0x4b fp=0xc006605f30 sp=0xc006605ef8 pc=0x485e8b
github.com/apple/foundationdb/bindings/go/src/fdb._Cfunc_fdb_run_network()
        _cgo_gotypes.go:382 +0x47 fp=0xc006605f58 sp=0xc006605f30 pc=0x17b0ae7
github.com/apple/foundationdb/bindings/go/src/fdb.startNetwork.func1()
        /go/pkg/mod/github.com/!data!dog/foundationdb/bindings/[email protected]/src/fdb/fdb.go:209 +0x17 fp=0xc006605fe0 sp=0xc006605f58 pc=0x17b8f57
runtime.goexit({})
        /root/.gimme/versions/go1.23.0.linux.amd64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc006605fe8 sp=0xc006605fe0 pc=0x4951c1
created by github.com/apple/foundationdb/bindings/go/src/fdb.startNetwork in goroutine 1
        /go/pkg/mod/github.com/!data!dog/foundationdb/bindings/[email protected]/src/fdb/fdb.go:208 +0x4b

[ ... ]

The full stack contains company code I'd prefer not to share broadly. I've dropped most of it, but I left the goroutine for which it appears traceAdvance is collecting a call stack.

I believe this is a problem with the syscallbp value introduced by https://go.dev/cl/580255. The traceStack function in the above traceback is using that value for unwinding here. From a core dump, we can see that gp.syscallsp and gp.sched.sp differ significantly from gp.syscallbp and gp.sched.bp for the goroutine:

(dlv) l
Goroutine 132 frame 10 at /root/.gimme/versions/go1.23.0.linux.amd64/src/runtime/trace.go:398 (PC: 0x475a3d)
Command failed: open /root/.gimme/versions/go1.23.0.linux.amd64/src/runtime/trace.go: no such file or directory
(dlv) p/x gp.sched
Command failed: command not available
(dlv) p %x gp.sched
runtime.gobuf {sp: c006605ee8, pc: 4931a8, g: c0062ba700, ctxt: unsafe.Pointer(0x0), ret: 0, lr: 0, bp: c002d67720}
(dlv) p %x gp.syscallbp
c002d67720
(dlv) p %x gp.syscallsp
c006605ef8

I suspect the issue here is that copystack needs to adjust gp.syscallbp, like it does for gp.sched.bp.

cc @mknyszek

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Aug 27, 2024
@mknyszek
Copy link
Contributor

mknyszek commented Aug 27, 2024

Thanks for reporting!

I'm a bit surprised though. Stack movement during a syscall (and cgo call) is forbidden because pointers to stack-allocated values might be passed to the syscall. But then why is syscallbp stale? Hm...

@mknyszek mknyszek self-assigned this Aug 27, 2024
@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 27, 2024
@mknyszek mknyszek added this to the Backlog milestone Aug 27, 2024
@mknyszek
Copy link
Contributor

Actually, why is this taking the fpTracebackPCs path at all? There's cgo on that stack.

@nsrip-dd
Copy link
Contributor Author

I'm a bit surprised though. Stack movement during a syscall (and cgo call) is forbidden because pointers to stack-allocated values might be passed to the syscall. But then why is syscallbp stale? Hm...

Indeed. I had initially suspected and then dismissed syscallbp problems for that reason. But if the C code ever calls back into Go, then cgocallbackg clears syscallsp before re-entering Go code so the GC would be free to move the stack. I'm not too familiar with the FoundationDB client library here, so I'm not positive that the C function in question ever calls Go code. FWIW I have a toy program where I can sometimes observe a stack shrink when a C function calls back into Go: https://go.dev/play/p/NvmZbCU4qGU

Actually, why is this taking the fpTracebackPCs path at all? There's cgo on that stack.

Do you know if the right M is checked in traceStack for this scenario? I think it would be gp.lockedm here but I'm not familiar with how that works.

@mknyszek
Copy link
Contributor

That detail about being in cgo on stack but the tracer not seeing it might be highly relevant. There is a window in between entersyscall and mp.ncgo++ in which such a state could be observed. Now the question is, could the stack be moved in this window? I'm currently thinking that maybe entersyscall is taking one of its systemstack off-ramps.

But the thing is, in entersyscall, we've got throwsplit set. So it is super unlikely that the syscalling goroutine is calling into copystack itself. But then also isShrinkStackSafe must always be false at this point because syscallsp is definitely non-zero.

One thing that's a little fishy is syscallbp isn't updated in the cgocallback case, but also that should be fine because the m is marked as isextra, so we should also be taking the traceStack cgo slow path.

Still just thinking out loud at this point...

@mknyszek
Copy link
Contributor

But if the C code ever calls back into Go, then cgocallbackg clears syscallsp before re-entering Go code so the GC would be free to move the stack. I'm not too familiar with the FoundationDB client library here, so I'm not positive that the C function in question ever calls Go code.

Ah! Good catch. Yeah, I think this is plausibly what's happening, but then (and I mean this rhetorically) why is this code valid at all?

@mknyszek
Copy link
Contributor

mknyszek commented Aug 27, 2024

Oh my goodness, I see it. savedsp is an unsafe.Pointer so stack movement knows to update it. savedbp is not. OK, fix incoming.

@mknyszek
Copy link
Contributor

@gopherbot Please open a backport issue for Go 1.23.

This bug is causing random crashes when using the tracer that have no workaround. It is also a regression that only exists in Go 1.23. The fix is also small and simple, but admittedly subtle.

@gopherbot
Copy link
Contributor

Backport issue(s) opened: #69087 (for 1.23).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/608775 mentions this issue: runtime: store bp on cgocallback as unsafe.Pointer

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/608835 mentions this issue: [release-branch.go1.23] runtime: store bp on cgocallback as unsafe.Pointer

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Aug 27, 2024
@dmitshur dmitshur modified the milestones: Backlog, Go1.24 Aug 27, 2024
gopherbot pushed a commit that referenced this issue Aug 28, 2024
…inter

As of CL 580255, the runtime tracks the frame pointer (or base pointer,
bp) when entering syscalls, so that we can use fpTracebackPCs on
goroutines that are sitting in syscalls. That CL mostly got things
right, but missed one very subtle detail.

When calling from Go->C->Go, the goroutine stack performing the calls
when returning to Go is free to move around in memory due to growth,
shrinking, etc. But upon returning back to C, it needs to restore
gp.syscall*, including gp.syscallsp and gp.syscallbp. The way syscallsp
currently gets updated is automagically: it's stored as an
unsafe.Pointer on the stack so that it shows up in a stack map. If the
stack ever moves, it'll get updated correctly. But gp.syscallbp isn't
saved to the stack as an unsafe.Pointer, but rather as a uintptr, so it
never gets updated! As a result, in rare circumstances, fpTracebackPCs
can correctly try to use gp.syscallbp as the starting point for the
traceback, but the value is stale.

This change fixes the problem by just storing gp.syscallbp to the stack
on cgocallback as an unsafe.Pointer, like gp.syscallsp. It also adds a
comment documenting this subtlety; the lack of explanation for the
unsafe.Pointer type on syscallsp meant this detail was missed -- let's
not miss it again in the future.

Now, we have a fix, what about a test? Unfortunately, testing this is
going to be incredibly annoying because the circumstances under which
gp.syscallbp are actually used for traceback are non-deterministic and
hard to arrange, especially from within testprogcgo where we don't have
export_test.go and can't reach into the runtime.

So, instead, add a gp.syscallbp check to reentersyscall and
entersyscallblock that mirrors the gp.syscallbp consistency check. This
probably causes some miniscule slowdown to the syscall path, but it'll
catch the issue without having to actually perform a traceback.

For #69085.
Fixes #69087.

Change-Id: Iaf771758f1666024b854f5fbe2b2c63cbe35b201
Reviewed-on: https://go-review.googlesource.com/c/go/+/608775
Reviewed-by: Nick Ripley <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
Reviewed-by: Michael Pratt <[email protected]>
Reviewed-by: Cherry Mui <[email protected]>
(cherry picked from commit 54fe0fd)
Reviewed-on: https://go-review.googlesource.com/c/go/+/608835
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

4 participants