runtime: fatal error: checkdead: runnable g #40368

slon · 2020-07-23T12:04:38Z

What version of Go are you using (`go version`)?

$ go version
go version go1.14.6 linux/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/prime/.cache/go-build"
GOENV="/home/prime/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/prime/Code/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/prime/Code/go/src/a.yandex-team.ru/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build172964829=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I encountered the following error. Panic reliably happens in about one of 10000 jobs on a mapreduce cluster. My binary is built with CGO_ENABLED=0 and is started with GOMAXPROCS=1.

runtime: checkdead: find g 1 in status 1
fatal error: checkdead: runnable g

runtime stack:
runtime.throw(0xb1ac74, 0x15)
	/usr/local/go/src/runtime/panic.go:1116 +0x72 fp=0x7ffef3709728 sp=0x7ffef37096f8 pc=0x438e02
runtime.checkdead()
	/usr/local/go/src/runtime/proc.go:4407 +0x3c9 fp=0x7ffef37097a8 sp=0x7ffef3709728 pc=0x4457a9
runtime.mput(0xf2e7e0)
	/usr/local/go/src/runtime/proc.go:4824 +0x50 fp=0x7ffef37097c8 sp=0x7ffef37097a8 pc=0x446ee0
runtime.stopm()
	/usr/local/go/src/runtime/proc.go:1832 +0x7d fp=0x7ffef37097f0 sp=0x7ffef37097c8 pc=0x43f03d
runtime.exitsyscall0(0xc000000180)
	/usr/local/go/src/runtime/proc.go:3268 +0xc6 fp=0x7ffef3709820 sp=0x7ffef37097f0 pc=0x442b86
runtime.mcall(0xf2e280)
	/usr/local/go/src/runtime/asm_amd64.s:318 +0x5b fp=0x7ffef3709830 sp=0x7ffef3709820 pc=0x46919b

goroutine 1 [runnable]:
syscall.Syscall(0x1, 0x4, 0xc0000f4000, 0x1000, 0x1000, 0x1000, 0x0)
	/usr/local/go/src/syscall/asm_linux_amd64.s:18 +0x5 fp=0xc0000bc6e8 sp=0xc0000bc6e0 pc=0x48cb05
syscall.write(0x4, 0xc0000f4000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/syscall/zsyscall_linux_amd64.go:914 +0xb9 fp=0xc0000bc768 sp=0xc0000bc6e8 pc=0x48ade9
syscall.Write(0x4, 0xc0000f4000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/syscall/syscall_unix.go:214 +0x70 fp=0xc0000bc7c8 sp=0xc0000bc768 pc=0x488ff0
internal/poll.(*FD).Write(0xc000054600, 0xc0000f4000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:268 +0x2b7 fp=0xc0000bc900 sp=0xc0000bc7c8 pc=0x4b0a77
os.(*File).write(0xc00000e7a0, 0xc0000f4000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/os/file_unix.go:280 +0x81 fp=0xc0000bc978 sp=0xc0000bc900 pc=0x4bba21
os.(*File).Write(0xc00000e7a0, 0xc0000f4000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/os/file.go:153 +0xe9 fp=0xc0000bc9f8 sp=0xc0000bc978 pc=0x4b89c9
bufio.(*Writer).Flush(0xc0000ee180, 0x0, 0x0)
	/usr/local/go/src/bufio/bufio.go:591 +0x141 fp=0xc0000bcaf0 sp=0xc0000bc9f8 pc=0x543a81
bufio.(*Writer).WriteString(0xc0000ee180, 0xc00047bfc2, 0x1a, 0x0, 0x0, 0x0)
	/usr/local/go/src/bufio/bufio.go:694 +0x24b fp=0xc0000bcbd8 sp=0xc0000bcaf0 pc=0x544b0b
a.yandex-team.ru/yt/go/yson.(*Writer).str(0xc0000ee1c0, 0xc00047bfb0, 0x2c)
	/home/prime/Code/go/src/a.yandex-team.ru/yt/go/yson/writer.go:242 +0x29e fp=0xc0000bcce0 sp=0xc0000bcbd8 pc=0x65292e
a.yandex-team.ru/yt/go/yson.(*Writer).String(0xc0000ee1c0, 0xc00047bfb0, 0x2c)
	/home/prime/Code/go/src/a.yandex-team.ru/yt/go/yson/writer.go:371 +0x107 fp=0xc0000bcd40 sp=0xc0000bcce0 pc=0x653aa7
a.yandex-team.ru/yt/go/yson.encodeAny(0xc0000ee1c0, 0xa57740, 0xc000222a38, 0x0, 0x0)
	/home/prime/Code/go/src/a.yandex-team.ru/yt/go/yson/marshal.go:200 +0x20c1 fp=0xc0000bd170 sp=0xc0000bcd40 pc=0x63faf1
a.yandex-team.ru/yt/go/yson.encodeReflectStruct.func1(0xc0000d5d00, 0xa, 0x10, 0x0, 0x0)
	/home/prime/Code/go/src/a.yandex-team.ru/yt/go/yson/marshal.go:463 +0x316 fp=0xc0000bd2b8 sp=0xc0000bd170 pc=0x656dc6
a.yandex-team.ru/yt/go/yson.encodeReflectStruct(0xc0000ee1c0, 0xaf1e40, 0xc000222a10, 0x99, 0x0, 0x0)
	/home/prime/Code/go/src/a.yandex-team.ru/yt/go/yson/marshal.go:487 +0x33d fp=0xc0000bd3d0 sp=0xc0000bd2b8 pc=0x6414ad
a.yandex-team.ru/yt/go/yson.encodeReflect(0xc0000ee1c0, 0xaf1e40, 0xc000222a10, 0x99, 0x0, 0x0)
	/home/prime/Code/go/src/a.yandex-team.ru/yt/go/yson/marshal.go:350 +0x71b fp=0xc0000bd4c8 sp=0xc0000bd3d0 pc=0x64045b
a.yandex-team.ru/yt/go/yson.encodeAny(0xc0000ee1c0, 0xaf1e40, 0xc000222a10, 0x0, 0x0)
	/home/prime/Code/go/src/a.yandex-team.ru/yt/go/yson/marshal.go:326 +0x9a4 fp=0xc0000bd8f8 sp=0xc0000bd4c8 pc=0x63e3d4
a.yandex-team.ru/yt/go/yson.(*Writer).Any(0xc0000ee1c0, 0xaf1e40, 0xc000222a10)
	/home/prime/Code/go/src/a.yandex-team.ru/yt/go/yson/writer.go:668 +0x5d fp=0xc0000bd940 sp=0xc0000bd8f8 pc=0x655d3d
a.yandex-team.ru/yt/go/mapreduce.(*writer).Write(0xc0000e6630, 0xaf1e40, 0xc000222a10, 0x0, 0x0)
	/home/prime/Code/go/src/a.yandex-team.ru/yt/go/mapreduce/writer.go:34 +0x89 fp=0xc0000bd988 sp=0xc0000bd940 pc=0x8cb2c9
a.yandex-team.ru/yt/go/mapreduce.(*writer).MustWrite(0xc0000e6630, 0xaf1e40, 0xc000222a10)
	/home/prime/Code/go/src/a.yandex-team.ru/yt/go/mapreduce/writer.go:39 +0x43 fp=0xc0000bd9d8 sp=0xc0000bd988 pc=0x8cb363
a.yandex-team.ru/yt/jaeger.RebuildIndexJob.Do(0xbb5920, 0xc00006f940, 0xbbec80, 0xc0000b0480, 0xc00000da80, 0x2, 0x2, 0x0, 0x0)
	/home/prime/Code/go/src/a.yandex-team.ru/yt/jaeger/rebuild_index.go:28 +0x2e8 fp=0xc0000bdbf8 sp=0xc0000bd9d8 pc=0xa04998
a.yandex-team.ru/yt/jaeger.(*RebuildIndexJob).Do(0xf58688, 0xbb5920, 0xc00006f940, 0xbbec80, 0xc0000b0480, 0xc00000da80, 0x2, 0x2, 0x0, 0x0)
	<autogenerated>:1 +0xa9 fp=0xc0000bdc68 sp=0xc0000bdbf8 pc=0xa084e9
a.yandex-team.ru/yt/go/mapreduce.JobMain(0x0)
	/home/prime/Code/go/src/a.yandex-team.ru/yt/go/mapreduce/main.go:72 +0x9d1 fp=0xc0000bdf60 sp=0xc0000bdc68 pc=0x8c4d61
main.main()
	/home/prime/Code/go/src/a.yandex-team.ru/yt/jaeger/yt-jaeger/main.go:12 +0x35 fp=0xc0000bdf88 sp=0xc0000bdf60 pc=0xa0c6e5
runtime.main()
	/usr/local/go/src/runtime/proc.go:203 +0x1c8 fp=0xc0000bdfe0 sp=0xc0000bdf88 pc=0x43b4a8
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc0000bdfe8 sp=0xc0000bdfe0 pc=0x46b141

goroutine 2 [force gc (idle)]:
runtime.gopark(0xb3b238, 0xf2ccc0, 0x1411, 0x1)
	/usr/local/go/src/runtime/proc.go:304 +0xcb fp=0xc00003af88 sp=0xc00003af58 pc=0x43b86b
runtime.goparkunlock(0xf2ccc0, 0x1411, 0x1)
	/usr/local/go/src/runtime/proc.go:310 +0x53 fp=0xc00003afb8 sp=0xc00003af88 pc=0x43b923
runtime.forcegchelper()
	/usr/local/go/src/runtime/proc.go:253 +0xaa fp=0xc00003afe0 sp=0xc00003afb8 pc=0x43b70a
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc00003afe8 sp=0xc00003afe0 pc=0x46b141
created by runtime.init.6
	/usr/local/go/src/runtime/proc.go:242 +0x35

goroutine 3 [GC sweep wait]:
runtime.gopark(0xb3b238, 0xf2d300, 0x46140c, 0x1)
	/usr/local/go/src/runtime/proc.go:304 +0xcb fp=0xc00003b780 sp=0xc00003b750 pc=0x43b86b
runtime.goparkunlock(0xf2d300, 0xba140c, 0x1)
	/usr/local/go/src/runtime/proc.go:310 +0x53 fp=0xc00003b7b0 sp=0xc00003b780 pc=0x43b923
runtime.bgsweep(0xc000052000)
	/usr/local/go/src/runtime/mgcsweep.go:89 +0x101 fp=0xc00003b7d8 sp=0xc00003b7b0 pc=0x427161
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc00003b7e0 sp=0xc00003b7d8 pc=0x46b141
created by runtime.gcenable
	/usr/local/go/src/runtime/mgc.go:214 +0x5c

goroutine 4 [sleep]:
runtime.gopark(0xb3b238, 0xf2d2c0, 0xc000011313, 0x2)
	/usr/local/go/src/runtime/proc.go:304 +0xcb fp=0xc00003bf20 sp=0xc00003bef0 pc=0x43b86b
runtime.goparkunlock(0xf2d2c0, 0x2c99138c01313, 0x2)
	/usr/local/go/src/runtime/proc.go:310 +0x53 fp=0xc00003bf50 sp=0xc00003bf20 pc=0x43b923
runtime.scavengeSleep(0x928f8, 0x13b74e7)
	/usr/local/go/src/runtime/mgcscavenge.go:214 +0x83 fp=0xc00003bf80 sp=0xc00003bf50 pc=0x4255b3
runtime.bgscavenge(0xc000052000)
	/usr/local/go/src/runtime/mgcscavenge.go:337 +0x1ce fp=0xc00003bfd8 sp=0xc00003bf80 pc=0x4257ae
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc00003bfe0 sp=0xc00003bfd8 pc=0x46b141
created by runtime.gcenable
	/usr/local/go/src/runtime/mgc.go:215 +0x7e

goroutine 5 [finalizer wait]:
runtime.gopark(0xb3b238, 0xf585d0, 0xf51410, 0x1)
	/usr/local/go/src/runtime/proc.go:304 +0xcb fp=0xc00003a728 sp=0xc00003a6f8 pc=0x43b86b
runtime.goparkunlock(0xf585d0, 0xb31410, 0x1)
	/usr/local/go/src/runtime/proc.go:310 +0x53 fp=0xc00003a758 sp=0xc00003a728 pc=0x43b923
runtime.runfinq()
	/usr/local/go/src/runtime/mfinal.go:175 +0x96 fp=0xc00003a7e0 sp=0xc00003a758 pc=0x41cec6
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc00003a7e8 sp=0xc00003a7e0 pc=0x46b141
created by runtime.createfing
	/usr/local/go/src/runtime/mfinal.go:156 +0x61

goroutine 6 [GC worker (idle)]:
runtime.gopark(0xb3b0b8, 0xc0004340f0, 0x1418, 0x0)
	/usr/local/go/src/runtime/proc.go:304 +0xcb fp=0xc00003c758 sp=0xc00003c728 pc=0x43b86b
runtime.gcBgMarkWorker(0xc00002c000)
	/usr/local/go/src/runtime/mgc.go:1873 +0x119 fp=0xc00003c7d8 sp=0xc00003c758 pc=0x420659
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc00003c7e0 sp=0xc00003c7d8 pc=0x46b141
created by runtime.gcBgMarkStartWorkers
	/usr/local/go/src/runtime/mgc.go:1821 +0x77

I'm unable to reproduce this issue locally, but I was able to collect a coredump. Binary and core dump are inside checkdead.zip archive. My binary is reading data from stdin and writing data to stdout, without spawning any additional goroutines.

I would be glad to hear any suggestions on how to futher diagnose this issue.

The text was updated successfully, but these errors were encountered:

ALTree · 2020-07-23T12:50:18Z

cc @aclements @mknyszek

slon · 2020-07-23T16:33:33Z

I tried running couple of other configurations.

Issue is not reproducing with 1.13.14
Issue is not reproducing with CGO_ENABLED=1
Issue is not reproducing with GOMAXPROCS=2

aclements · 2020-07-23T18:56:50Z

cc @prattmic

@slon, can you try reproducing with GODEBUG=asyncpreemptoff=1?

slon · 2020-07-23T19:23:21Z

Issue is reproducing with GODEBUG=asyncpreemptoff=1.

ianlancetaylor · 2020-07-23T19:51:05Z

Can you reproduce it with the environment variable GODEBUG=schedtrace=1,scheddetail=1 ?

prattmic · 2020-07-23T20:02:30Z

Dealing with gdb is a pain, but if I'm reading the stack correctly, I believe this G exited the syscall, failed to acquire a P, added to the global runq and then called stopm. That mput -> checkdead is the one throwing.

(gdb) p /x 'runtime.allgs'
$21 = []*runtime.g = {0xc000000180, 0xc000000600, 0xc000000780, 0xc000000900, 0xc000000d80, 0xc000000f00}
(gdb) p /x 'runtime.sched'.runq
$19 = {
  head = 0xc000000180, 
  tail = 0xc000000180
}

prattmic · 2020-07-23T20:33:27Z

Also of interest is that sysmon is blocked waiting on sched.lock:

Thread 3 (LWP 35):
#0  runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:568
#1  0x00000000004350b6 in runtime.futexsleep (addr=0xf2dfb8 <runtime.sched+24>, val=2, ns=-1) at /usr/local/go/src/runtime/os_linux.go:45
#2  0x000000000040bb0c in runtime.lock (l=0xf2dfb8 <runtime.sched+24>) at /usr/local/go/src/runtime/lock_futex.go:102
#3  0x000000000043ca57 in runtime.mcommoninit (mp=0xc00003e700) at /usr/local/go/src/runtime/proc.go:616
#4  0x000000000043e43d in runtime.allocm (_p_=0xc00002c000, fn={void (void)} 0xc00004dec8, ~r2=<optimized out>) at /usr/local/go/src/runtime/proc.go:1390
#5  0x000000000043eb89 in runtime.newm (fn={void (void)} 0xc00004def0, _p_=0xc00002c000) at /usr/local/go/src/runtime/proc.go:1704
#6  0x000000000043f269 in runtime.startm (_p_=0xc00002c000, spinning=false) at /usr/local/go/src/runtime/proc.go:1875
#7  0x0000000000445b3a in runtime.sysmon () at /usr/local/go/src/runtime/proc.go:4539
#8  0x000000000043db71 in runtime.mstart1 () at /usr/local/go/src/runtime/proc.go:1097
#9  0x000000000043da74 in runtime.mstart () at /usr/local/go/src/runtime/proc.go:1062
#10 0x000000000046d333 in runtime.clone () at /usr/local/go/src/runtime/sys_linux_amd64.s:619
#11 0x0000000000000000 in ?? ()

At first glance this isn't surprising since checkdead is holding sched.lock during the throw. What is surprising is that it is stuck on the lock in mcommoninit. startm also takes sched.lock but releases it before calling newm. This implies that the startm call started before checkdead, which makes me rather suspicious of a race condition with sysmon (perhaps taking back the P from the syscall G).

slon · 2020-07-23T20:35:26Z

I'm unable to reproduce it with GODEBUG=schedtrace=1,scheddetail=1.

slon · 2020-07-23T21:14:50Z

I think I found sequence of events that might lead to that error.

G1 is blocked in syscall.
sysmon takes P and starts another M to run GC.
G1 returns from syscall and fails to acquire P.
GC is finished. Other M stops.
G1 calls stopm() and triggers the panic.

Is that plausible?

ianlancetaylor · 2020-07-23T21:41:58Z

@slon The code in exitsyscall0 acquires the scheduler lock, then tries to get a P. In this case it fails to get a P, so it adds the G to the global run queue. So I think the only way that your scenario can lead to trouble is if when the M running GC stops it does not decide to run the G that is on the global run queue. I don't yet see how that could happen.

prattmic · 2020-07-23T22:08:32Z

For posterity, I've more explicitly walked the stack to confirm the call path in #40368 (comment) is correct.

(gdb) set $start = 0xc00004fdf0
(gdb) while $start < $end
 >printf "start: 0x%lx, pc: 0x%lx\n", $start, *(void **)($start+8)
 >info symbol *(void **)($start+8)
 >set $start = *(void **)($start)
 >end
start: 0xc00004fdf0, pc: 0x44feaa
runtime.crash + 26 in section .text of /home/mpratt_google_com/yt-jaeger
start: 0xc00004fe08, pc: 0x467d21
runtime.fatalthrow.func1 + 129 in section .text of /home/mpratt_google_com/yt-jaeger
start: 0xc00004fe50, pc: 0x438fc7
runtime.fatalthrow + 87 in section .text of /home/mpratt_google_com/yt-jaeger
start: 0xc00004fe88, pc: 0x438e02
runtime.throw + 114 in section .text of /home/mpratt_google_com/yt-jaeger
start: 0xc00004feb8, pc: 0x4457a9
runtime.checkdead + 969 in section .text of /home/mpratt_google_com/yt-jaeger
start: 0xc00004ff38, pc: 0x446ee0
runtime.mput + 80 in section .text of /home/mpratt_google_com/yt-jaeger
start: 0xc00004ff58, pc: 0x43f03d
runtime.stopm + 125 in section .text of /home/mpratt_google_com/yt-jaeger
start: 0xc00004ff80, pc: 0x442b86
runtime.exitsyscall0 + 198 in section .text of /home/mpratt_google_com/yt-jaeger
start: 0xc00004ffb0, pc: 0x46919b
runtime.mcall + 91 in section .text of /home/mpratt_google_com/yt-jaeger

ianlancetaylor · 2020-07-23T22:08:54Z

@prattmic I think you've found the problem. sysmon calls startm which acquires the only P and releases the scheduler lock. exitsyscall0 acquires the scheduler lock, finds no P, puts the G on the global run queue, releases the lock, calls stopm which calls checkdead. At this point there is no P available, because startm snagged it. But there is also no M available, because startm hasn't yet made it to the line sched.mnext++ in mcommoninit, and it will never get there because checkdead is holding the scheduler lock.

In short, the problem is that sysmon -> startm -> newm can acquire a P before creating an M to run on that P. When GOMAXPROCS == 1, we are in a situation where there are no P's and no M's, so things look bleak.

ianlancetaylor · 2020-07-23T22:11:26Z

One fix might be for startm to say somewhere "I am starting an M" when mget returns nil, before releasing the scheduler lock. Then mcommoninit would somehow remove that statement after it increments sched.mnext.

prattmic · 2020-07-24T21:35:42Z

After spending too long on this, I'm able to reproduce the crash with a program racing syscall exit with timer expiration, plus strategically placed sleeps in the runtime to help increase the likelihood of racing properly:

package main

import (
        "syscall"
        "time"
)

func main() {
        t := time.NewTimer(100*time.Millisecond)
        defer t.Stop()

        ts := syscall.Timespec{
                Nsec: int64(100*time.Millisecond - 4*time.Microsecond),
        }
        for {
                if err := syscall.Nanosleep(&ts, nil); err != syscall.EINTR {
                        break
                }
        }
}

A program like this is going to have 3 Ms:

Running the main G in the nanosleep syscall.
Spinning M in the findrunnable netpoll waiting for timer expiration.
Sysmon M also waiting for timer expiration.

In order to trigger the crash, we need the race to proceed in this order:

Sysmon wakes up, calls startm to handle the timer.
startm takes the only P, but does not find an idle M. Releases sched.lock. (Does not make it to lock in mcommoninit)
netpoll in findrunnable completes, runs timer (chan send with no waiter), then goes to sleep (stopm, this ensures checkdead won't see this M as running).
Syscall completes, takes sched.lock, finds no P, adds to global runq and goes to sleep.
checkdead detects that there are no running Ms (because mcommoninit hasn't run yet). Panic!

Perhaps a more complex program could convince the spinning M to already be stopped, in which case this race would be much easier to trigger? As-is, it is very tricky. I'll try to send out a fix on Monday, but it will likely not include a test due to the difficulty of triggering.

runtime: checkdead: find g 1 in status 1                                                                                                     
fatal error: checkdead: runnable g                                                                                                           
                                                                                                                                             
runtime stack:                                                                                                                               
runtime.throw(0x48c091, 0x15)                                                                                                                
        /usr/local/google/home/mpratt/src/go/src/runtime/panic.go:1116 +0x72                          
runtime.checkdead()                                                                                                                          
        /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:4665 +0x548                          
runtime.mput(0x4f0680)                                                                                                                       
        /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:5119 +0x9e                           
runtime.stopm()                                                                                                                              
        /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:1947 +0x7f                           
runtime.exitsyscall0(0xc000000180)                                                                                                           
        /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:3490 +0x1c5                          
runtime.mcall(0x0)                                                                                                                           
        /usr/local/google/home/mpratt/src/go/src/runtime/asm_amd64.s:318 +0x5b                        
                                                                                                                                             
goroutine 1 [runnable]:                                                                                                                      
syscall.Syscall(0x23, 0xc000034750, 0x0, 0x0, 0x0, 0x0, 0x0)                                                                                 
        /usr/local/google/home/mpratt/src/go/src/syscall/asm_linux_amd64.s:18 +0x5
syscall.Nanosleep(0xc000034750, 0x0, 0x7fcc503a2000, 0x1a099)
        /usr/local/google/home/mpratt/src/go/src/syscall/zsyscall_linux_amd64.go:640 +0x5f
main.main()
        /usr/local/google/home/mpratt/src/go/src/40368.go:31 +0x7e

prattmic · 2020-07-24T21:40:05Z

@gopherbot backport to 1.14 please. This crash could affect any program running with GOMAXPROCS=1.

gopherbot · 2020-07-24T21:40:46Z

Backport issue(s) opened: #40398 (for 1.14).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

networkimprov · 2020-07-24T22:12:49Z

@prattmic the backport issue automatically has a 1.14.x milestone, so this would be 1.15 (and release-blocker?) or 1.16.

ALTree · 2020-07-25T07:48:27Z

As networkimprov said the backport already has a 1.14.7 milestone, so I'm putting this one in 1.16 (move it to 1.15 if you think this is 1.15 material).

apmattil · 2020-07-27T12:38:34Z

would this kind of backtrace be relaited to this issue ?
'''
(dlv) bt
0 0x0000000000465511 in runtime.raise
at /usr/local/go/src/runtime/sys_linux_amd64.s:165
1 0x0000000000449c7b in runtime.crash
at /usr/local/go/src/runtime/signal_unix.go:822
2 0x000000c0000098e0 in ???
at ?:-1
3 0x0000000000448b84 in runtime.restoreGsignalStack
at /usr/local/go/src/runtime/signal_unix.go:1156
4 0x0000000000448b84 in runtime.sigtrampgo
at /usr/local/go/src/runtime/signal_unix.go:455
5 0x0000000000465873 in runtime.sigtramp
at /usr/local/go/src/runtime/sys_linux_amd64.s:389
6 0x0000000000465960 in runtime.sigreturn
at /usr/local/go/src/runtime/sys_linux_amd64.s:481
7 0x0000000000449e0a in runtime.noSignalStack
at /usr/local/go/src/runtime/signal_unix.go:870
8 0x0000000001355bc0 in ???
at ?:-1
9 0x0000000000433c47 in runtime.startpanic_m
at /usr/local/go/src/runtime/panic.go:1277
10 0x00007ffc657265e8 in ???
at ?:-1
11 0x0000000000440a12 in runtime.checkdead
at /usr/local/go/src/runtime/proc.go:4386
12 0x0000000000439eb5 in runtime.startm
at /usr/local/go/src/runtime/proc.go:1851
13 0x000000c00003c000 in ???
at ?:-1
14 0x000000000043c03c in runtime.schedule
at /usr/local/go/src/runtime/proc.go:2485
15 0x000000000043cff6 in runtime.reentersyscall
at /usr/local/go/src/runtime/proc.go:2935
16 0x000000c00004c7c0 in ???
at ?:-1
17 0x0000000000463956 in runtime.goexit
at /usr/local/go/src/runtime/asm_amd64.s:1374
(dlv) frame 3

runtime.raise() /usr/local/go/src/runtime/sys_linux_amd64.s:165 (PC: 0x465511)
Warning: debugging optimized function
Frame 3: /usr/local/go/src/runtime/signal_unix.go:1156 (PC: 448b84)
1151: // before entering the signal handler.
1152: //go:nosplit
1153: //go:nowritebarrierrec
1154: func restoreGsignalStack(st *gsignalStack) {
1155: gp := getg().m.gsignal
=>1156: gp.stack = st.stack
1157: gp.stackguard0 = st.stackguard0
1158: gp.stackguard1 = st.stackguard1
1159: gp.stktopsp = st.stktopsp
1160: }
1161:
(dlv) p st
Command failed: could not find symbol value for st
(dlv) locals
(no locals)
(dlv) args
(no args)
(dlv) p st
Command failed: could not find symbol value for st
(dlv) p gp
Command failed: could not find symbol value for gp
(dlv) frame 9
runtime.raise() /usr/local/go/src/runtime/sys_linux_amd64.s:165 (PC: 0x465511)
Warning: debugging optimized function
Frame 9: /usr/local/go/src/runtime/panic.go:1277 (PC: 433c47)
1272: case 2:
1273: // This is a genuine bug in the runtime, we couldn't even
1274: // print the stack trace successfully.
1275: g.m.dying = 3
1276: print("stack trace unavailable\n")
=>1277: exit(4)
1278: fallthrough
1279: default:
1280: // Can't even print! Just exit.
1281: exit(5)
1282: return false // Need to return something.
(dlv) args
(no args)

'''

prattmic · 2020-07-27T15:50:00Z

@networkimprov @ALTree thanks, this is fine. If it doesn't make 1.15 it will probably want to be in the first minor release, but I don't think it needs to be blocking.

@apmattil it is difficult to tell from the stack trace. What was the panic message printed by that crash?

apmattil · 2020-07-27T19:38:51Z

I'm sorry, did/do not get the ouput.. it is at our robot framework tests that in this case does not save it. It happened couple times.. got 6 same kind of cores. Is there anything else I can get/check ? I think it does not appear when I have non-optimized build.

…

On Mon, Jul 27, 2020 at 6:50 PM Michael Pratt ***@***.***> wrote: @networkimprov <https://github.com/networkimprov> @ALTree <https://github.com/ALTree> thanks, this is fine. If it doesn't make 1.15 it will probably want to be in the first minor release, but I don't think it needs to be blocking. @apmattil <https://github.com/apmattil> it is difficult to tell from the stack trace. What was the panic message printed by that crash? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#40368 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE42IFHUMEMYDQPOAENZWPDR5WO33ANCNFSM4PFU4FEA> .

gopherbot · 2020-07-27T19:47:13Z

Change https://golang.org/cl/245018 mentions this issue: runtime: ensure startm new M is consistently visible to checkdead

prattmic · 2020-07-27T19:51:48Z

@apmattil If you are running 1.14.6, then the line number in the stack trace indicates it is this case: https://github.com/golang/go/blob/go1.14.6/src/runtime/proc.go#L4386

That is different from this bug, so you should file a separate issue for that.

apmattil · 2020-07-27T19:52:27Z

I used 1.14.4 compiler I do get the sched variable at frame 11: ''' (dlv) frame 11

runtime.raise() /usr/local/go/src/runtime/sys_linux_amd64.s:165 (PC:

0x465511) Warning: debugging optimized function Frame 11: /usr/local/go/src/runtime/proc.go:4386 (PC: 440a12) 4381: run := mcount() - sched.nmidle - sched.nmidlelocked - sched.nmsys 4382: if run > run0 { 4383: return 4384: } 4385: if run < 0 { =>4386: print("runtime: checkdead: nmidle=", sched.nmidle, " nmidlelocked=", sched.nmidlelocked, " mcount=", mcount(), " nmsys=", sched.nmsys, "\n") 4387: throw("checkdead: inconsistent counts") 4388: } 4389: 4390: grunning := 0 4391: lock(&allglock) (dlv) p sched runtime.schedt { goidgen: 0, lastpoll: 0, pollUntil: 0, lock: runtime.mutex {key: 0}, midle: 0, nmidle: 0, nmidlelocked: 0, mnext: 0, maxmcount: 0, nmsys: 0, nmfreed: 0, ngsys: 0, pidle: 0, npidle: 0, nmspinning: 0, runq: runtime.gQueue {head: 0, tail: 0}, runqsize: 0, disable: struct { runtime.user bool; runtime.runnable runtime.gQueue; runtime.n int32 } { user: false, runnable: (*runtime.gQueue)(0x1367030), n: 0,}, gFree: struct { runtime.lock runtime.mutex; runtime.stack runtime.gList; runtime.noStack runtime.gList; runtime.n int32 } { lock: (*runtime.mutex)(0x1367048), stack: (*runtime.gList)(0x1367050), noStack: (*runtime.gList)(0x1367058), n: 0,}, sudoglock: runtime.mutex {key: 0}, sudogcache: *runtime.sudog nil, deferlock: runtime.mutex {key: 0}, deferpool: [5]*runtime._defer [ *nil, *nil, *nil, *nil, *nil, ], freem: *runtime.m nil, gcwaiting: 0, stopwait: 0, stopnote: runtime.note {key: 0}, sysmonwait: 0, sysmonnote: runtime.note {key: 0}, safePointFn: nil, safePointWait: 0, safePointNote: runtime.note {key: 0}, profilehz: 0, procresizetime: 0, totaltime: 0,} (dlv) '''

…

On Mon, Jul 27, 2020 at 10:47 PM GopherBot ***@***.***> wrote: Change https://golang.org/cl/245018 mentions this issue: runtime: ensure startm new M is consistently visible to checkdead — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#40368 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE42IFG7YNPCCURGXF7466TR5XKVFANCNFSM4PFU4FEA> .

prattmic · 2020-07-27T20:04:50Z

@apmattil thanks, that's helpful. I opened #40434, as that is a different issue.

gopherbot · 2020-07-28T17:15:03Z

Change https://golang.org/cl/245297 mentions this issue: [release-branch.go1.14] runtime: ensure startm new M is consistently visible to checkdead

gopherbot · 2020-07-31T18:34:28Z

Change https://golang.org/cl/246199 mentions this issue: [release-branch.go1.13] runtime: ensure startm new M is consistently visible to checkdead

…visible to checkdead If no M is available, startm first grabs an idle P, then drops sched.lock and calls newm to start a new M to run than P. Unfortunately, that leaves a window in which a G (e.g., returning from a syscall) may find no idle P, add to the global runq, and then in stopm discover that there are no running M's, a condition that should be impossible with runnable G's. To avoid this condition, we pre-allocate the new M ID in startm before dropping sched.lock. This ensures that checkdead will see the M as running, and since that new M must eventually run the scheduler, it will handle any pending work as necessary. Outside of startm, most other calls to newm/allocm don't have a P at all. The only exception is startTheWorldWithSema, which always has an M if there is 1 P (i.e., the currently running M), and if there is >1 P the findrunnable spinning dance ensures the problem never occurs. This has been tested with strategically placed sleeps in the runtime to help induce the correct race ordering, but the timing on this is too narrow for a test that can be checked in. For #40368 Fixes #40398 Change-Id: If5e0293a430cc85154b7ed55bc6dadf9b340abe2 Reviewed-on: https://go-review.googlesource.com/c/go/+/245018 Run-TryBot: Michael Pratt <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Austin Clements <[email protected]> (cherry picked from commit 85afa2e) Reviewed-on: https://go-review.googlesource.com/c/go/+/245297

ALTree changed the title ~~fatal error: checkdead: runnable g~~ runtime: fatal error: checkdead: runnable g Jul 23, 2020

ALTree added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 23, 2020

ianlancetaylor added this to the Go1.16 milestone Jul 23, 2020

prattmic modified the milestones: Go1.16, Go1.14.7 Jul 24, 2020

prattmic added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jul 24, 2020

gopherbot mentioned this issue Jul 24, 2020

runtime: fatal error: checkdead: runnable g [1.14 backport] #40398

Closed

prattmic self-assigned this Jul 24, 2020

ALTree modified the milestones: Go1.14.7, Go1.16 Jul 25, 2020

prattmic mentioned this issue Jul 27, 2020

runtime: fatal error: checkdead: inconsistent counts #40434

Open

prattmic modified the milestones: Go1.16, Go1.15 Jul 28, 2020

gopherbot closed this as completed in 85afa2e Jul 28, 2020

dmitshur mentioned this issue Jul 30, 2020

runtime: fatal error: checkdead: runnable g [1.13 backport] #40503

Closed

prattmic mentioned this issue Jul 31, 2020

runtime: merge workarounds to false deadlocks #40518

Open

simonpasquier mentioned this issue Sep 11, 2020

consul_exporter crashes (apparently) randomly, with the following stack trace prometheus/consul_exporter#184

Closed

grobie mentioned this issue Dec 9, 2020

crashes on http.ListenAndServe (during accepting new connections) prometheus/haproxy_exporter#203

Closed

tbuckel mentioned this issue Dec 17, 2020

node_exporter fails to restart prometheus/node_exporter#1881

Closed

golang locked and limited conversation to collaborators Jul 31, 2021

gopherbot added the FrozenDueToAge label Jul 31, 2021

rsc unassigned prattmic Jun 23, 2022

prattmic self-assigned this Jun 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: fatal error: checkdead: runnable g #40368

runtime: fatal error: checkdead: runnable g #40368

slon commented Jul 23, 2020 •

edited

Loading

ALTree commented Jul 23, 2020

slon commented Jul 23, 2020 •

edited

Loading

aclements commented Jul 23, 2020

slon commented Jul 23, 2020

ianlancetaylor commented Jul 23, 2020

prattmic commented Jul 23, 2020

prattmic commented Jul 23, 2020

slon commented Jul 23, 2020

slon commented Jul 23, 2020

ianlancetaylor commented Jul 23, 2020

prattmic commented Jul 23, 2020

ianlancetaylor commented Jul 23, 2020

ianlancetaylor commented Jul 23, 2020

prattmic commented Jul 24, 2020

prattmic commented Jul 24, 2020

gopherbot commented Jul 24, 2020

networkimprov commented Jul 24, 2020

ALTree commented Jul 25, 2020

apmattil commented Jul 27, 2020

prattmic commented Jul 27, 2020

apmattil commented Jul 27, 2020 via email

gopherbot commented Jul 27, 2020

prattmic commented Jul 27, 2020

apmattil commented Jul 27, 2020 via email

prattmic commented Jul 27, 2020

gopherbot commented Jul 28, 2020

gopherbot commented Jul 31, 2020

runtime: fatal error: checkdead: runnable g #40368

runtime: fatal error: checkdead: runnable g #40368

Comments

slon commented Jul 23, 2020 • edited Loading

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

ALTree commented Jul 23, 2020

slon commented Jul 23, 2020 • edited Loading

aclements commented Jul 23, 2020

slon commented Jul 23, 2020

ianlancetaylor commented Jul 23, 2020

prattmic commented Jul 23, 2020

prattmic commented Jul 23, 2020

slon commented Jul 23, 2020

slon commented Jul 23, 2020

ianlancetaylor commented Jul 23, 2020

prattmic commented Jul 23, 2020

ianlancetaylor commented Jul 23, 2020

ianlancetaylor commented Jul 23, 2020

prattmic commented Jul 24, 2020

prattmic commented Jul 24, 2020

gopherbot commented Jul 24, 2020

networkimprov commented Jul 24, 2020

ALTree commented Jul 25, 2020

apmattil commented Jul 27, 2020

prattmic commented Jul 27, 2020

apmattil commented Jul 27, 2020 via email

gopherbot commented Jul 27, 2020

prattmic commented Jul 27, 2020

apmattil commented Jul 27, 2020 via email

prattmic commented Jul 27, 2020

gopherbot commented Jul 28, 2020

gopherbot commented Jul 31, 2020

slon commented Jul 23, 2020 •

edited

Loading

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

slon commented Jul 23, 2020 •

edited

Loading