Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/signal: test flaky on darwin #31264

Closed
bcmills opened this issue Apr 4, 2019 · 34 comments
Closed

os/signal: test flaky on darwin #31264

bcmills opened this issue Apr 4, 2019 · 34 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Darwin release-blocker Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Apr 4, 2019

The os/signal test seems to time out fairly frequently on the darwin-amd64-race builder.
It's not obvious to me whether the test is deadlocked or just slow.

Example: https://build.golang.org/log/7c20aa477d782f38c6184e7f8bce05c9648b57b3

panic: test timed out after 3m0s

goroutine 20 [running]:
testing.(*M).startAlarm.func1()
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:1332 +0x11c
created by time.goFunc
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/time/sleep.go:169 +0x52

goroutine 1 [chan receive]:
testing.(*T).Run(0xc0000c0000, 0x42154f9, 0x9, 0x421e888, 0x1)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:917 +0x68a
testing.runTests.func1(0xc0000c0000)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:1157 +0xa7
testing.tRunner(0xc0000c0000, 0xc000046db0)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:865 +0x163
testing.runTests(0xc00000e060, 0x435ace0, 0xb, 0xb, 0x0)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:1155 +0x522
testing.(*M).Run(0xc0000ba000, 0x0)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:1072 +0x2ea
main.main()
	_testmain.go:64 +0x21f

goroutine 4 [syscall]:
os/signal.signal_recv(0x4247580)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/sigqueue.go:139 +0x9f
os/signal.loop()
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_unix.go:23 +0x30
created by os/signal.init.0
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_unix.go:29 +0x4f

goroutine 10 [runnable]:
runtime.Gosched(...)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/proc.go:266
os/signal.signalWaitUntilIdle()
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/sigqueue.go:172 +0x31
os/signal.Stop(0xc00009c300)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal.go:196 +0x2f3
runtime.Goexit()
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/panic.go:503 +0xec
testing.(*common).FailNow(0xc0000c0300)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:609 +0x5b
testing.(*common).Fatalf(0xc0000c0300, 0x4218523, 0x16, 0xc0000e6d18, 0x1, 0x1)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:672 +0x9f
os/signal.waitSig(0xc0000c0300, 0xc00009c2a0, 0x4247580, 0x42436a8)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_test.go:32 +0x309
os/signal.testCancel(0xc0000c0300, 0xc00003b700)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_test.go:133 +0x1f9
os/signal.TestReset(0xc0000c0300)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_test.go:187 +0x3e
testing.tRunner(0xc0000c0300, 0x421e888)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:865 +0x163
created by testing.(*T).Run
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:916 +0x652

goroutine 8 [runnable]:
runtime.Gosched(...)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/proc.go:266
os/signal.signalWaitUntilIdle()
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/sigqueue.go:172 +0x31
os/signal.Stop(0xc00009c180)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal.go:196 +0x2f3
os/signal.TestStress.func1(0xc000022480, 0xc0000224e0)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_test.go:92 +0x194
created by os/signal.TestStress
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_test.go:79 +0xe1
FAIL	os/signal	180.028s
@bcmills bcmills added Testing An issue that has been verified to require only test changes, not just a test failure. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Apr 4, 2019
@bcmills bcmills added this to the Go1.13 milestone Apr 4, 2019
@bcmills
Copy link
Contributor Author

bcmills commented Apr 4, 2019

CC @ianlancetaylor for os/signal

@ianlancetaylor
Copy link
Member

In the stack trace two goroutines are stuck in this loop in signalWaitUntilIdle (in runtime/sigqueue.go):

	for atomic.Load(&sig.delivering) != 0 {
		Gosched()
	}

sig.delivering can only be non-zero while executing sigsend, also in runtime/sigqueue.go. sigsend is invoked by the signal handler. There is no code in sigsend that blocks, and sigsend always returns with sig.delivering unchanged.

So this seems to be impossible. The failure presumably has something to do with the race detector, but I don't know why it would be Darwin specific.

@bradfitz
Copy link
Contributor

bradfitz commented Apr 5, 2019

@dvyukov for any ideas.

@bcmills
Copy link
Contributor Author

bcmills commented Apr 5, 2019

Some thoughts:

@bradfitz
Copy link
Contributor

bradfitz commented Apr 5, 2019

Not just race, it turns out. @andybons notes that it's also here:

https://build.golang.org/log/65130fd6cf64710896cae0a569087f8fa8df66fc (darwin-amd64-10_14)

@bradfitz bradfitz changed the title os/signal: test flaky on darwin-amd64-race builder os/signal: test flaky on darwin-amd64 Apr 5, 2019
@bradfitz
Copy link
Contributor

bradfitz commented Apr 5, 2019

And that last one is on the Go 1.12 release branch.

@randall77, @ianlancetaylor, should this block the Go 1.12.2 release?

@ianlancetaylor
Copy link
Member

It's hard to believe that this is new in 1.12.2 compared to 1.12, so I don't see a reason to block the 1.12.2 release.

@rsc
Copy link
Contributor

rsc commented May 14, 2019

Just saw this on my pre-Mojave Mac as well. It's too bad the stack trace does not include any signal-handling thread stacks. It looks for all the world like sig.delivering is non-zero and it would be very nice to see where exactly the signal-handling thread is.

panic: test timed out after 9m0s

goroutine 24 [running]:
testing.(*M).startAlarm.func1()
	/Users/rsc/go/src/testing/testing.go:1380 +0xdf
created by time.goFunc
	/Users/rsc/go/src/time/sleep.go:169 +0x44

goroutine 1 [chan receive]:
testing.(*T).Run(0xc000152100, 0x4167a64, 0x9, 0x4170e60, 0x4078a01)
	/Users/rsc/go/src/testing/testing.go:964 +0x377
testing.runTests.func1(0xc0000fe000)
	/Users/rsc/go/src/testing/testing.go:1205 +0x78
testing.tRunner(0xc0000fe000, 0xc000054e30)
	/Users/rsc/go/src/testing/testing.go:912 +0xbf
testing.runTests(0xc0000d4020, 0x426fca0, 0xb, 0xb, 0x0)
	/Users/rsc/go/src/testing/testing.go:1203 +0x2a7
testing.(*M).Run(0xc0000fa000, 0x0)
	/Users/rsc/go/src/testing/testing.go:1120 +0x160
main.main()
	_testmain.go:64 +0x135

goroutine 34 [syscall]:
os/signal.signal_recv(0x41931e0)
	/Users/rsc/go/src/runtime/sigqueue.go:139 +0x9f
os/signal.loop()
	/Users/rsc/go/src/os/signal/signal_unix.go:23 +0x22
created by os/signal.init.0
	/Users/rsc/go/src/os/signal/signal_unix.go:29 +0x41

goroutine 23 [runnable]:
runtime.Gosched(...)
	/Users/rsc/go/src/runtime/proc.go:269
os/signal.signalWaitUntilIdle()
	/Users/rsc/go/src/runtime/sigqueue.go:172 +0x30
os/signal.Stop(0xc0000be240)
	/Users/rsc/go/src/os/signal/signal.go:196 +0x18c
runtime.Goexit()
	/Users/rsc/go/src/runtime/panic.go:503 +0xec
testing.(*common).FailNow(0xc000152100)
	/Users/rsc/go/src/testing/testing.go:656 +0x39
testing.(*common).Fatalf(0xc000152100, 0x416ab68, 0x16, 0xc000043588, 0x1, 0x1)
	/Users/rsc/go/src/testing/testing.go:719 +0x90
os/signal.waitSig(0xc000152100, 0xc0000be1e0, 0x41931e0, 0x418f4e0)
	/Users/rsc/go/src/os/signal/signal_test.go:32 +0x247
os/signal.testCancel(0xc000152100, 0x2be993261bf00)
	/Users/rsc/go/src/os/signal/signal_test.go:133 +0x1ce
os/signal.TestReset(0xc000152100)
	/Users/rsc/go/src/os/signal/signal_test.go:187 +0x30
testing.tRunner(0xc000152100, 0x4170e60)
	/Users/rsc/go/src/testing/testing.go:912 +0xbf
created by testing.(*T).Run
	/Users/rsc/go/src/testing/testing.go:963 +0x350

goroutine 21 [runnable]:
runtime.Gosched(...)
	/Users/rsc/go/src/runtime/proc.go:269
os/signal.signalWaitUntilIdle()
	/Users/rsc/go/src/runtime/sigqueue.go:172 +0x30
os/signal.Stop(0xc000024120)
	/Users/rsc/go/src/os/signal/signal.go:196 +0x18c
os/signal.TestStress.func1(0xc000138120, 0xc000138180)
	/Users/rsc/go/src/os/signal/signal_test.go:92 +0x160
created by os/signal.TestStress
	/Users/rsc/go/src/os/signal/signal_test.go:79 +0xd3
FAIL	os/signal	540.197s

@bcmills
Copy link
Contributor Author

bcmills commented May 30, 2019

One on darwin-arm64-corellium (https://build.golang.org/log/89688609340e1b045e263677ed0bf9b0cdd84710):

panic: test timed out after 3m0s

goroutine 10 [running]:
testing.(*M).startAlarm.func1()
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/testing/testing.go:1382 +0xc0
created by time.goFunc
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/time/sleep.go:169 +0x38

goroutine 1 [chan receive]:
testing.(*T).Run(0x130112200, 0x10248ee96, 0x9, 0x1024981c0, 0x5ce80e01)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/testing/testing.go:961 +0x2b4
testing.runTests.func1(0x1300f4000)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/testing/testing.go:1207 +0x6c
testing.tRunner(0x1300f4000, 0x13003edf8)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/testing/testing.go:909 +0xa8
testing.runTests(0x1300b8060, 0x102596e20, 0xb, 0xb, 0x0)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/testing/testing.go:1205 +0x24c
testing.(*M).Run(0x1300e0000, 0x0)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/testing/testing.go:1122 +0x140
main.main()
	_testmain.go:66 +0x144

goroutine 19 [syscall]:
os/signal.signal_recv(0x1024bbc60)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/runtime/sigqueue.go:139 +0xb8
os/signal.loop()
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/os/signal/signal_unix.go:23 +0x18
created by os/signal.init.0
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/os/signal/signal_unix.go:29 +0x30

goroutine 9 [runnable]:
runtime.Gosched(...)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/runtime/proc.go:269
os/signal.signalWaitUntilIdle()
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/runtime/sigqueue.go:172 +0x2c
os/signal.Stop(0x130080360)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/os/signal/signal.go:196 +0x1a8
runtime.Goexit()
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/runtime/panic.go:503 +0xdc
testing.(*common).FailNow(0x130112200)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/testing/testing.go:653 +0x30
testing.(*common).Fatalf(0x130112200, 0x102491f08, 0x16, 0x13003dd78, 0x1, 0x1)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/testing/testing.go:716 +0x70
os/signal.waitSig(0x130112200, 0x130080300, 0x1024bbc60, 0x1024b7e80)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/os/signal/signal_test.go:32 +0x19c
os/signal.testCancel(0x130112200, 0x13003df00)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/os/signal/signal_test.go:133 +0x188
os/signal.TestReset(0x130112200)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/os/signal/signal_test.go:187 +0x24
testing.tRunner(0x130112200, 0x1024981c0)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/testing/testing.go:909 +0xa8
created by testing.(*T).Run
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/testing/testing.go:960 +0x298

goroutine 7 [runnable]:
runtime.Gosched(...)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/runtime/proc.go:269
os/signal.signalWaitUntilIdle()
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/runtime/sigqueue.go:172 +0x2c
os/signal.Stop(0x1300ca480)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/os/signal/signal.go:196 +0x1a8
os/signal.TestStress.func1(0x130128120, 0x130128180)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/os/signal/signal_test.go:92 +0x110
created by os/signal.TestStress
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/os/signal/signal_test.go:79 +0xd0
FAIL	os/signal	180.141s

@bcmills bcmills changed the title os/signal: test flaky on darwin-amd64 os/signal: test flaky on darwin May 30, 2019
@bcmills
Copy link
Contributor Author

bcmills commented Jun 3, 2019

darwin-amd64-10_12:

panic: test timed out after 3m0s

goroutine 21 [running]:
testing.(*M).startAlarm.func1()
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:1382 +0xdf
created by time.goFunc
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/time/sleep.go:169 +0x44

goroutine 1 [chan receive]:
testing.(*T).Run(0xc00013c100, 0x4164e26, 0x9, 0x416e1f8, 0x4077c01)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:961 +0x377
testing.runTests.func1(0xc0000f2000)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:1207 +0x78
testing.tRunner(0xc0000f2000, 0xc000046e30)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:909 +0xbf
testing.runTests(0xc0000d0020, 0x426ec60, 0xb, 0xb, 0x0)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:1205 +0x2a7
testing.(*M).Run(0xc0000e6000, 0x0)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:1122 +0x160
main.main()
	_testmain.go:66 +0x135

goroutine 34 [syscall]:
os/signal.signal_recv(0x41903a0)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/sigqueue.go:139 +0x9f
os/signal.loop()
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_unix.go:23 +0x22
created by os/signal.init.0
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_unix.go:29 +0x41

goroutine 20 [runnable]:
runtime.Gosched(...)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/proc.go:269
os/signal.signalWaitUntilIdle()
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/sigqueue.go:172 +0x30
os/signal.Stop(0xc0000ae1e0)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal.go:196 +0x18c
runtime.Goexit()
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/panic.go:503 +0xec
testing.(*common).FailNow(0xc00013c100)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:653 +0x39
testing.(*common).Fatalf(0xc00013c100, 0x4167f1f, 0x16, 0xc00004bd88, 0x1, 0x1)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:716 +0x90
os/signal.waitSig(0xc00013c100, 0xc0000ae180, 0x41903a0, 0x418c628)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_test.go:32 +0x247
os/signal.testCancel(0xc00013c100, 0x5b15d2f4400)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_test.go:133 +0x1ce
os/signal.TestReset(0xc00013c100)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_test.go:187 +0x30
testing.tRunner(0xc00013c100, 0x416e1f8)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:909 +0xbf
created by testing.(*T).Run
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:960 +0x350

goroutine 5 [runnable]:
runtime.Gosched(...)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/proc.go:269
os/signal.signalWaitUntilIdle()
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/sigqueue.go:172 +0x30
os/signal.Stop(0xc0000b2180)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal.go:196 +0x18c
os/signal.TestStress.func1(0xc0000220c0, 0xc000022120)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_test.go:92 +0x160
created by os/signal.TestStress
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_test.go:79 +0xee
FAIL	os/signal	180.025s

@ianlancetaylor
Copy link
Member

This has been happening for a while. The first occurrence seems to be on June 18, 2018:

https://build.golang.org/log/d464d075b04b100b96b7e9b7614755c02efd9dfa

The first occurrence that is not on the race builder is from July 13, 2018, but is on 386:

https://build.golang.org/log/5a14c8368922d5d3c4a19e4e7aff37111d916bd3

The first occurrence on amd64, not on the race builder, is July 18, 2018:

https://build.golang.org/log/0666f3b1dfa397938a4e93fe89afb97bdd2769e7

That is an interesting one, because it shows time.Sleep(10 * time.Millisecond) as sleeping for 3 minutes. The next more typical occurrence on amd64, not on the race builder, is from August 20, 2018:

https://build.golang.org/log/924b174c273ad0f136081ef3d1d9751bada8253b

That shows the most common pattern: a goroutine started by TestStress is hanging waiting for a deferred call to Stop to complete, and the test TestReset is also hanging waiting for a deferred call to Stop to complete. The deferred call to Stop from TestReset is occurring because TestReset has failed, timed out waiting for a SIGWINCH to be delivered.

June 2018, when the problems started happening, is when the runtime package was changing to move some operations from direct system calls to library calls. E.g., https://golang.org/cl/118819, https://golang.org/cl/118736. Notably https://golang.org/cl/117176 was committed on June 12, 2018: "runtime: use libc for signal functions on iOS".

@ianlancetaylor
Copy link
Member

Actually a possibly more relevant CL is https://golang.org/cl/116875, committed June 12, 2018: "runtime: use libc's signal functions on Darwin".

@ianlancetaylor
Copy link
Member

Through sheer luck I triggered the problem on a gomote. The process appears to be burning CPU. I attached to it using lldb and got this stack trace:

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x00007fff60d11f06 libsystem_kernel.dylib`__psynch_mutexwait + 10
    frame #1: 0x00007fff60dc8d52 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 96
    frame #2: 0x00007fff60dc64cd libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 222
    frame #3: 0x000000000405f57c signal.test`runtime.pthread_mutex_lock_trampoline + 12
    frame #4: 0x000000000405cf8d signal.test`runtime.asmcgocall + 173
    frame #5: 0x000000000405f570 signal.test`runtime.pthread_mutex_init_trampoline + 32
    frame #6: 0x000000000404c909 signal.test`runtime.pthread_mutex_lock + 57
    frame #7: 0x000000000402b93a signal.test`runtime.semawakeup + 42
    frame #8: 0x000000000400b959 signal.test`runtime.notewakeup + 89
    frame #9: 0x0000000004044233 signal.test`runtime.sigsend + 275
    frame #10: 0x0000000004041e90 signal.test`runtime.sighandler + 288
    frame #11: 0x0000000004042f6c signal.test`runtime.sigtrampgo + 572
    frame #12: 0x000000000405f263 signal.test`runtime.sigtramp + 51
  * frame #13: 0x000000000402b93a signal.test`runtime.semawakeup + 42
    frame #14: 0x000000000400b959 signal.test`runtime.notewakeup + 89
    frame #15: 0x0000000004044233 signal.test`runtime.sigsend + 275
    frame #16: 0x0000000004041e90 signal.test`runtime.sighandler + 288
    frame #17: 0x0000000004042f6c signal.test`runtime.sigtrampgo + 572
    frame #18: 0x000000000405f263 signal.test`runtime.sigtramp + 51
    frame #19: 0x000000000400b959 signal.test`runtime.notewakeup + 89
    frame #20: 0x0000000004044233 signal.test`runtime.sigsend + 275
    frame #21: 0x0000000004041e90 signal.test`runtime.sighandler + 288
    frame #22: 0x0000000004042f6c signal.test`runtime.sigtrampgo + 572
    frame #23: 0x000000000405f263 signal.test`runtime.sigtramp + 51
    frame #24: 0x0000000004044233 signal.test`runtime.sigsend + 275
    frame #25: 0x0000000004041e90 signal.test`runtime.sighandler + 288
    frame #26: 0x0000000004042f6c signal.test`runtime.sigtrampgo + 572
    frame #27: 0x000000000405f263 signal.test`runtime.sigtramp + 51
    frame #28: 0x0000000004041e90 signal.test`runtime.sighandler + 288
    frame #29: 0x0000000004042f6c signal.test`runtime.sigtrampgo + 572
    frame #30: 0x0000000004042f6c signal.test`runtime.sigtrampgo + 572
    frame #31: 0x000000000405f263 signal.test`runtime.sigtramp + 51
    frame #32: 0x000000000405f263 signal.test`runtime.sigtramp + 51

It looks like when sigsend calls notewakeup it is triggering a signal. Thus we never get out of sigsend and so sig.delivering really is always non-zero. Continuing to investigate.

@ianlancetaylor
Copy link
Member

This C program triggers a SIGSEGV in a signal handler while SIGSEGV is blocked. On normal Unix systems that causes the kernel to immediately kill the program: since the SIGSEGV cannot be delivered, nothing else can happen. On macOS, however, it causes the program to hang while burning CPU. So I think that explains the symptom: in some cases sigsend is triggering a SIGSEGV or some other synchronous signal, causing the program to hang until it is killed by cmd/go.

#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

static void die(const char *s) {
  perror(s);
  exit(EXIT_FAILURE);
}

char *nil;

static void handler(int sig, siginfo_t *info __attribute__((unused)), void *ctxt __attribute__((unused))) {
  printf("SP %#lx\n", (unsigned long)(__builtin_frame_address(0)));
  *nil = 0; /* trigger SIGSEGV */
}

static char *stack[SIGSTKSZ];

int main() {
  stack_t ss;
  struct sigaction sa;

  memset(&ss, 0, sizeof ss);
  ss.ss_sp = &stack[0];
  ss.ss_size = sizeof stack;
  if (sigaltstack(&ss, NULL) != 0)
    die("sigaltstack");

  memset(&sa, 0, sizeof sa);
  sa.sa_sigaction = handler;
  if (sigfillset(&sa.sa_mask) != 0)
    die("sigfillset");
  sa.sa_flags = SA_ONSTACK | SA_RESTART | SA_SIGINFO;
  if (sigaction(SIGWINCH, &sa, NULL) != 0)
    die("sigaction");
  if (sigaction(SIGSEGV, &sa, NULL) != 0)
    die("sigaction");

  raise(SIGWINCH);

  exit(EXIT_SUCCESS);
}

@ianlancetaylor
Copy link
Member

Actually I now suspect the problem is simple enough. The TestStress test sends signals as fast as possible. Those signals are sent through sigsend. sigsend calls notewakeup, which calls semawakeup, which calls pthread_mutex_lock. In other words, we are calling pthread_mutex_lock from a signal handler. But we are calling the C function, not making a system call. The C function pthread_mutex_lock is not async-signal-safe. If the signal arrives while signal_recv is also locking or unlocking the same mutex, anything could happen.

@ianlancetaylor
Copy link
Member

So after looking through a bunch of docs and complaints, it appears that macOS primitives for communications between threads are limited.

The sem_post function is async-signal-safe, but there is no way to do a timed wait for a semaphore: macOS does not support sem_timedwait, although it is in IEEE Std 1003.1-2001.

I think the best bet is the API defined in <mach/semaphore.h>, which is more or less what we used to use before converting to libc. That API appears to be available from C, but is sparsely documented. This would mean calling functions semaphore_create, semaphore_signal, semaphore_wait, semaphore_timedwait, semaphore_destroy, and mach_task_self. This assumes that semaphore_signal is async-signal-safe; it is not documented as such, but according to the Intertubes sem_post calls it, so it must be.

@randall77 @eliasnaur Do you see any difficulties with using those functions on macOS and iOS?

@eliasnaur
Copy link
Contributor

Thanks for working on this and for taking iOS limitations into account.

https://developer.apple.com/documentation/kernel/1538205-semaphore_create?language=objc only lists macOS, which means that while semaphore_create and friends might exist and work on iOS, Go programs could end up being rejected by the App Store checks. See #31628 for a recent example with ptrace.

Is the dispatch variant of the semaphore API sufficient? It is listed as supported on all Apple platforms.

@ianlancetaylor
Copy link
Member

Well, that sucks.

What we need is a way to do a timed wait for some condition to occur: wait for up to N nanoseconds for the condition. Right now we we use a condition variable, and use pthread_cond_timedwait_relative_np for the timed wait, but that doesn't work because we have no way to signal a condition variable in a signal handler.

@ianlancetaylor
Copy link
Member

My next thought is to add a new kind of synchronization construct to the runtime: a signal safe note. On all platforms other than Darwin, we would implement that as a normal note. On Darwin we would call sem_open (it's a real pain that Darwin is also missing sem_init) and call sem_post from the signal handler. This could work because we don't need to do a timed wait on the signal safe note.

@eliasnaur
Copy link
Contributor

Sorry to push the point if it was completely off: did you see my question about the "dispatch" semaphore variants, such as https://developer.apple.com/documentation/dispatch/1452919-dispatch_semaphore_signal?language=objc?

@ianlancetaylor
Copy link
Member

I saw the link but I didn't understand it, and intended to reply to it with my comment about a timed wait. But looking deeper I see that there is a timed wait there. Thanks.

What I'm unsure about now is that that page says these are available from Objective C and Swift. But I do see <dispatch/dispatch.h> which looks like C code. I will give it a try.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/182258 mentions this issue: runtime: use dispatch semaphores on Darwin

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/182880 mentions this issue: Revert "runtime: use dispatch semaphores on Darwin"

@ianlancetaylor
Copy link
Member

Reopening because rollback was committed.

gopherbot pushed a commit that referenced this issue Jun 21, 2019
This reverts https://golang.org/cl/182258.

The new code caused unpredictable crashes that are not understood.  The old code was occasionally flaky but still better than this approach.

Fixes #32655
Updates #31264

Change-Id: I2e9d27d6052e84bf75106d8b844549ba4f571695
Reviewed-on: https://go-review.googlesource.com/c/go/+/182880
Run-TryBot: Ian Lance Taylor <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: Katie Hockman <[email protected]>
Reviewed-by: Keith Randall <[email protected]>
@katiehockman
Copy link
Contributor

As an update, the darwin-amd64-10_11 build is now seeing flaky failures again: https://build.golang.org/log/3728af36f8bbb9c6d0341afa90b8afbc1138cf3a

@MichaelTJones
Copy link
Contributor

Update on the test failures here at home. As it happens, the "lingering death" that causes the tests to timeout with various issues actually happens after or better said, in the wake of, running signal_test. The run by hand, -v output is:

lum:signal mtj$ go test -v
=== RUN   TestSignal
--- PASS: TestSignal (0.00s)
    signal_test.go:44: sighup...
    signal_test.go:53: sigwinch...
    signal_test.go:60: sighup...
    signal_test.go:63: sighup...
=== RUN   TestStress
--- PASS: TestStress (3.01s)
=== RUN   TestReset
signal: killed

The PASS in TestStress does indeed pass in all cases. Changing the number of CPUs in TestStress to one makes the test pass 3 out of 4 times. Changing various waits and delays does nothing. After TestStress in cases where it triggers the issue, the latent issue puts the next test into the unkillable by ^C mode where it does nothing but vegetate and consume processor time.

I read and understood Ian's deeper causal commentary, but wanted to explore ways to provoke the bug less aggressively as an alternative to preventing the test on darwin. I surrender on that, the test is now skipped in my tree.

@MichaelTJones
Copy link
Contributor

FYI, in os_test.go, in PipeThreads, changing the "but can you do it with just half the threads?" test to a more generous 2/3 the threads, works fine on many core...

defer debug.SetMaxThreads(debug.SetMaxThreads((2*threads) / 3))

Line 2278

@gertcuykens

This comment has been minimized.

@ianlancetaylor
Copy link
Member

@gertcuykens Thanks. I think we understand exactly what is happening, and we don't need further examples. What we need is some idea of how to fix it. iOS doesn't seem to have the facility we need.

@eliasnaur
Copy link
Contributor

I'm not enrolled in the Apple Developer program to be sure, but a good test of whether symbols are available for iOS is to check whether they're available in the Objective C headers. I created a single page iOS project in Xcode and modified main.m:

#import <UIKit/UIKit.h>
#import <mach/semaphore.h>
#import <mach/task.h>
#import <mach/mach.h>


/*#include <sys/types.h>
#include <sys/ptrace.h>
*/


#import "AppDelegate.h"

int main(int argc, char * argv[]) {
    semaphore_t sem;
    task_t me = mach_task_self();
    semaphore_create(me, &sem, 0, 0);
    semaphore_signal(sem);
    semaphore_wait(sem);
    mach_timespec_t ts;
    semaphore_timedwait(sem, ts);
    semaphore_destroy(me, sem);
    
    
    
     //    ptrace(0, 0, 0, 0);
    @autoreleasepool {
        return UIApplicationMain(argc, argv, nil, NSStringFromClass([AppDelegate class]));
    }
}

Which compiled without errors.

As you can tell, I also tried to use a symbol we know is not available: ptrace. As expected, the ptrace symbol is not defined in ptrace.h for iOS and the build fails.

I think we can go ahead and use your suggested C functions, Ian. In the meantime, perhaps someone with Apple Developer access can make absolutely sure the functions pass the App Store tests. @steeve, @sanderpick?

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/184169 mentions this issue: runtime: use a pipe to wake up signal_recv on Darwin

@ianlancetaylor
Copy link
Member

I went with a different approach that should always work: use a special purpose note, based on a pipe, for signal handling. @eliasnaur It would be great if you could check that CL 184169 works on arm and arm64. Thanks. I verified that it seems to work on amd64, using gomote.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/184377 mentions this issue: runtime: use correct register in darwin/386 pipe_trampoline

gopherbot pushed a commit that referenced this issue Jul 1, 2019
Updates #31264

Change-Id: I745744dd3fdaa432d70e8dc9336547017bac89ee
Reviewed-on: https://go-review.googlesource.com/c/go/+/184377
Run-TryBot: Ian Lance Taylor <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: Elias Naur <[email protected]>
@golang golang locked and limited conversation to collaborators Jun 29, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Darwin release-blocker Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
None yet
Development

No branches or pull requests

9 participants