-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: fatal error: checkdead: runnable g #40368
Comments
I tried running couple of other configurations.
|
Issue is reproducing with |
Can you reproduce it with the environment variable |
Dealing with gdb is a pain, but if I'm reading the stack correctly, I believe this G exited the syscall, failed to acquire a P, added to the global runq and then called stopm. That mput -> checkdead is the one throwing.
|
Also of interest is that sysmon is blocked waiting on sched.lock:
At first glance this isn't surprising since |
I'm unable to reproduce it with |
I think I found sequence of events that might lead to that error.
Is that plausible? |
@slon The code in |
For posterity, I've more explicitly walked the stack to confirm the call path in #40368 (comment) is correct.
|
@prattmic I think you've found the problem. In short, the problem is that |
One fix might be for |
After spending too long on this, I'm able to reproduce the crash with a program racing syscall exit with timer expiration, plus strategically placed sleeps in the runtime to help increase the likelihood of racing properly: package main
import (
"syscall"
"time"
)
func main() {
t := time.NewTimer(100*time.Millisecond)
defer t.Stop()
ts := syscall.Timespec{
Nsec: int64(100*time.Millisecond - 4*time.Microsecond),
}
for {
if err := syscall.Nanosleep(&ts, nil); err != syscall.EINTR {
break
}
}
} A program like this is going to have 3 Ms:
In order to trigger the crash, we need the race to proceed in this order:
Perhaps a more complex program could convince the spinning M to already be stopped, in which case this race would be much easier to trigger? As-is, it is very tricky. I'll try to send out a fix on Monday, but it will likely not include a test due to the difficulty of triggering.
|
@gopherbot backport to 1.14 please. This crash could affect any program running with GOMAXPROCS=1. |
Backport issue(s) opened: #40398 (for 1.14). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases. |
@prattmic the backport issue automatically has a 1.14.x milestone, so this would be 1.15 (and release-blocker?) or 1.16. |
As networkimprov said the backport already has a 1.14.7 milestone, so I'm putting this one in 1.16 (move it to 1.15 if you think this is 1.15 material). |
would this kind of backtrace be relaited to this issue ?
''' |
@networkimprov @ALTree thanks, this is fine. If it doesn't make 1.15 it will probably want to be in the first minor release, but I don't think it needs to be blocking. @apmattil it is difficult to tell from the stack trace. What was the panic message printed by that crash? |
I'm sorry, did/do not get the ouput.. it is at our robot framework tests
that in this case does not save it.
It happened couple times.. got 6 same kind of cores.
Is there anything else I can get/check ?
I think it does not appear when I have non-optimized build.
…On Mon, Jul 27, 2020 at 6:50 PM Michael Pratt ***@***.***> wrote:
@networkimprov <https://github.com/networkimprov> @ALTree
<https://github.com/ALTree> thanks, this is fine. If it doesn't make 1.15
it will probably want to be in the first minor release, but I don't think
it needs to be blocking.
@apmattil <https://github.com/apmattil> it is difficult to tell from the
stack trace. What was the panic message printed by that crash?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#40368 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AE42IFHUMEMYDQPOAENZWPDR5WO33ANCNFSM4PFU4FEA>
.
|
Change https://golang.org/cl/245018 mentions this issue: |
@apmattil If you are running 1.14.6, then the line number in the stack trace indicates it is this case: https://github.com/golang/go/blob/go1.14.6/src/runtime/proc.go#L4386 That is different from this bug, so you should file a separate issue for that. |
I used 1.14.4 compiler
I do get the sched variable at frame 11:
'''
(dlv) frame 11
runtime.raise() /usr/local/go/src/runtime/sys_linux_amd64.s:165 (PC:
0x465511)
Warning: debugging optimized function
Frame 11: /usr/local/go/src/runtime/proc.go:4386 (PC: 440a12)
4381: run := mcount() - sched.nmidle - sched.nmidlelocked -
sched.nmsys
4382: if run > run0 {
4383: return
4384: }
4385: if run < 0 {
=>4386: print("runtime: checkdead: nmidle=", sched.nmidle,
" nmidlelocked=", sched.nmidlelocked, " mcount=", mcount(), " nmsys=",
sched.nmsys, "\n")
4387: throw("checkdead: inconsistent counts")
4388: }
4389:
4390: grunning := 0
4391: lock(&allglock)
(dlv) p sched
runtime.schedt {
goidgen: 0,
lastpoll: 0,
pollUntil: 0,
lock: runtime.mutex {key: 0},
midle: 0,
nmidle: 0,
nmidlelocked: 0,
mnext: 0,
maxmcount: 0,
nmsys: 0,
nmfreed: 0,
ngsys: 0,
pidle: 0,
npidle: 0,
nmspinning: 0,
runq: runtime.gQueue {head: 0, tail: 0},
runqsize: 0,
disable: struct { runtime.user bool; runtime.runnable
runtime.gQueue; runtime.n int32 } {
user: false,
runnable: (*runtime.gQueue)(0x1367030),
n: 0,},
gFree: struct { runtime.lock runtime.mutex; runtime.stack
runtime.gList; runtime.noStack runtime.gList; runtime.n int32 } {
lock: (*runtime.mutex)(0x1367048),
stack: (*runtime.gList)(0x1367050),
noStack: (*runtime.gList)(0x1367058),
n: 0,},
sudoglock: runtime.mutex {key: 0},
sudogcache: *runtime.sudog nil,
deferlock: runtime.mutex {key: 0},
deferpool: [5]*runtime._defer [
*nil,
*nil,
*nil,
*nil,
*nil,
],
freem: *runtime.m nil,
gcwaiting: 0,
stopwait: 0,
stopnote: runtime.note {key: 0},
sysmonwait: 0,
sysmonnote: runtime.note {key: 0},
safePointFn: nil,
safePointWait: 0,
safePointNote: runtime.note {key: 0},
profilehz: 0,
procresizetime: 0,
totaltime: 0,}
(dlv)
'''
…On Mon, Jul 27, 2020 at 10:47 PM GopherBot ***@***.***> wrote:
Change https://golang.org/cl/245018 mentions this issue: runtime: ensure
startm new M is consistently visible to checkdead
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#40368 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AE42IFG7YNPCCURGXF7466TR5XKVFANCNFSM4PFU4FEA>
.
|
Change https://golang.org/cl/245297 mentions this issue: |
Change https://golang.org/cl/246199 mentions this issue: |
…visible to checkdead If no M is available, startm first grabs an idle P, then drops sched.lock and calls newm to start a new M to run than P. Unfortunately, that leaves a window in which a G (e.g., returning from a syscall) may find no idle P, add to the global runq, and then in stopm discover that there are no running M's, a condition that should be impossible with runnable G's. To avoid this condition, we pre-allocate the new M ID in startm before dropping sched.lock. This ensures that checkdead will see the M as running, and since that new M must eventually run the scheduler, it will handle any pending work as necessary. Outside of startm, most other calls to newm/allocm don't have a P at all. The only exception is startTheWorldWithSema, which always has an M if there is 1 P (i.e., the currently running M), and if there is >1 P the findrunnable spinning dance ensures the problem never occurs. This has been tested with strategically placed sleeps in the runtime to help induce the correct race ordering, but the timing on this is too narrow for a test that can be checked in. For #40368 Fixes #40398 Change-Id: If5e0293a430cc85154b7ed55bc6dadf9b340abe2 Reviewed-on: https://go-review.googlesource.com/c/go/+/245018 Run-TryBot: Michael Pratt <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Austin Clements <[email protected]> (cherry picked from commit 85afa2e) Reviewed-on: https://go-review.googlesource.com/c/go/+/245297
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I encountered the following error. Panic reliably happens in about one of 10000 jobs on a mapreduce cluster. My binary is built with CGO_ENABLED=0 and is started with GOMAXPROCS=1.
I'm unable to reproduce this issue locally, but I was able to collect a coredump. Binary and core dump are inside checkdead.zip archive. My binary is reading data from stdin and writing data to stdout, without spawning any additional goroutines.
I would be glad to hear any suggestions on how to futher diagnose this issue.
The text was updated successfully, but these errors were encountered: