-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os/signal: test flaky on darwin #31264
Comments
CC @ianlancetaylor for |
In the stack trace two goroutines are stuck in this loop in for atomic.Load(&sig.delivering) != 0 {
Gosched()
}
So this seems to be impossible. The failure presumably has something to do with the race detector, but I don't know why it would be Darwin specific. |
@dvyukov for any ideas. |
Some thoughts:
|
Not just race, it turns out. @andybons notes that it's also here: https://build.golang.org/log/65130fd6cf64710896cae0a569087f8fa8df66fc ( |
And that last one is on the Go 1.12 release branch. @randall77, @ianlancetaylor, should this block the Go 1.12.2 release? |
It's hard to believe that this is new in 1.12.2 compared to 1.12, so I don't see a reason to block the 1.12.2 release. |
Just saw this on my pre-Mojave Mac as well. It's too bad the stack trace does not include any signal-handling thread stacks. It looks for all the world like sig.delivering is non-zero and it would be very nice to see where exactly the signal-handling thread is.
|
One on
|
|
This has been happening for a while. The first occurrence seems to be on June 18, 2018: https://build.golang.org/log/d464d075b04b100b96b7e9b7614755c02efd9dfa The first occurrence that is not on the race builder is from July 13, 2018, but is on 386: https://build.golang.org/log/5a14c8368922d5d3c4a19e4e7aff37111d916bd3 The first occurrence on amd64, not on the race builder, is July 18, 2018: https://build.golang.org/log/0666f3b1dfa397938a4e93fe89afb97bdd2769e7 That is an interesting one, because it shows https://build.golang.org/log/924b174c273ad0f136081ef3d1d9751bada8253b That shows the most common pattern: a goroutine started by June 2018, when the problems started happening, is when the runtime package was changing to move some operations from direct system calls to library calls. E.g., https://golang.org/cl/118819, https://golang.org/cl/118736. Notably https://golang.org/cl/117176 was committed on June 12, 2018: "runtime: use libc for signal functions on iOS". |
Actually a possibly more relevant CL is https://golang.org/cl/116875, committed June 12, 2018: "runtime: use libc's signal functions on Darwin". |
Through sheer luck I triggered the problem on a gomote. The process appears to be burning CPU. I attached to it using lldb and got this stack trace:
It looks like when |
This C program triggers a #include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
static void die(const char *s) {
perror(s);
exit(EXIT_FAILURE);
}
char *nil;
static void handler(int sig, siginfo_t *info __attribute__((unused)), void *ctxt __attribute__((unused))) {
printf("SP %#lx\n", (unsigned long)(__builtin_frame_address(0)));
*nil = 0; /* trigger SIGSEGV */
}
static char *stack[SIGSTKSZ];
int main() {
stack_t ss;
struct sigaction sa;
memset(&ss, 0, sizeof ss);
ss.ss_sp = &stack[0];
ss.ss_size = sizeof stack;
if (sigaltstack(&ss, NULL) != 0)
die("sigaltstack");
memset(&sa, 0, sizeof sa);
sa.sa_sigaction = handler;
if (sigfillset(&sa.sa_mask) != 0)
die("sigfillset");
sa.sa_flags = SA_ONSTACK | SA_RESTART | SA_SIGINFO;
if (sigaction(SIGWINCH, &sa, NULL) != 0)
die("sigaction");
if (sigaction(SIGSEGV, &sa, NULL) != 0)
die("sigaction");
raise(SIGWINCH);
exit(EXIT_SUCCESS);
} |
Actually I now suspect the problem is simple enough. The |
So after looking through a bunch of docs and complaints, it appears that macOS primitives for communications between threads are limited. The I think the best bet is the API defined in @randall77 @eliasnaur Do you see any difficulties with using those functions on macOS and iOS? |
Thanks for working on this and for taking iOS limitations into account. https://developer.apple.com/documentation/kernel/1538205-semaphore_create?language=objc only lists macOS, which means that while Is the dispatch variant of the semaphore API sufficient? It is listed as supported on all Apple platforms. |
Well, that sucks. What we need is a way to do a timed wait for some condition to occur: wait for up to N nanoseconds for the condition. Right now we we use a condition variable, and use |
My next thought is to add a new kind of synchronization construct to the runtime: a signal safe note. On all platforms other than Darwin, we would implement that as a normal note. On Darwin we would call |
Sorry to push the point if it was completely off: did you see my question about the "dispatch" semaphore variants, such as https://developer.apple.com/documentation/dispatch/1452919-dispatch_semaphore_signal?language=objc? |
I saw the link but I didn't understand it, and intended to reply to it with my comment about a timed wait. But looking deeper I see that there is a timed wait there. Thanks. What I'm unsure about now is that that page says these are available from Objective C and Swift. But I do see <dispatch/dispatch.h> which looks like C code. I will give it a try. |
Change https://golang.org/cl/182258 mentions this issue: |
Change https://golang.org/cl/182880 mentions this issue: |
Reopening because rollback was committed. |
This reverts https://golang.org/cl/182258. The new code caused unpredictable crashes that are not understood. The old code was occasionally flaky but still better than this approach. Fixes #32655 Updates #31264 Change-Id: I2e9d27d6052e84bf75106d8b844549ba4f571695 Reviewed-on: https://go-review.googlesource.com/c/go/+/182880 Run-TryBot: Ian Lance Taylor <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Katie Hockman <[email protected]> Reviewed-by: Keith Randall <[email protected]>
As an update, the darwin-amd64-10_11 build is now seeing flaky failures again: https://build.golang.org/log/3728af36f8bbb9c6d0341afa90b8afbc1138cf3a |
Update on the test failures here at home. As it happens, the "lingering death" that causes the tests to timeout with various issues actually happens after or better said, in the wake of, running signal_test. The run by hand, -v output is:
The PASS in TestStress does indeed pass in all cases. Changing the number of CPUs in TestStress to one makes the test pass 3 out of 4 times. Changing various waits and delays does nothing. After TestStress in cases where it triggers the issue, the latent issue puts the next test into the unkillable by ^C mode where it does nothing but vegetate and consume processor time. I read and understood Ian's deeper causal commentary, but wanted to explore ways to provoke the bug less aggressively as an alternative to preventing the test on darwin. I surrender on that, the test is now skipped in my tree. |
FYI, in os_test.go, in PipeThreads, changing the "but can you do it with just half the threads?" test to a more generous 2/3 the threads, works fine on many core... defer debug.SetMaxThreads(debug.SetMaxThreads((2*threads) / 3)) Line 2278 |
This comment has been minimized.
This comment has been minimized.
@gertcuykens Thanks. I think we understand exactly what is happening, and we don't need further examples. What we need is some idea of how to fix it. iOS doesn't seem to have the facility we need. |
I'm not enrolled in the Apple Developer program to be sure, but a good test of whether symbols are available for iOS is to check whether they're available in the Objective C headers. I created a single page iOS project in Xcode and modified main.m:
Which compiled without errors. As you can tell, I also tried to use a symbol we know is not available: ptrace. As expected, the ptrace symbol is not defined in ptrace.h for iOS and the build fails. I think we can go ahead and use your suggested C functions, Ian. In the meantime, perhaps someone with Apple Developer access can make absolutely sure the functions pass the App Store tests. @steeve, @sanderpick? |
Change https://golang.org/cl/184169 mentions this issue: |
I went with a different approach that should always work: use a special purpose note, based on a pipe, for signal handling. @eliasnaur It would be great if you could check that CL 184169 works on arm and arm64. Thanks. I verified that it seems to work on amd64, using gomote. |
Change https://golang.org/cl/184377 mentions this issue: |
Updates #31264 Change-Id: I745744dd3fdaa432d70e8dc9336547017bac89ee Reviewed-on: https://go-review.googlesource.com/c/go/+/184377 Run-TryBot: Ian Lance Taylor <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Elias Naur <[email protected]>
The
os/signal
test seems to time out fairly frequently on the darwin-amd64-race builder.It's not obvious to me whether the test is deadlocked or just slow.
Example: https://build.golang.org/log/7c20aa477d782f38c6184e7f8bce05c9648b57b3
The text was updated successfully, but these errors were encountered: