-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/build,cmd/go: frequent connection timeouts to github.com since 2022-03-04 #52545
Comments
Not that I'm aware of. |
As with the other network-related failures, this failure mode is much more common on the
|
|
|
|
This is causing a huge amount of noise in the At this failure rate, I suspect that if this issue affected end users we'd be hearing about it by now. So my hypothesis is that this is some kind of GCE networking issue that affects the builder hosts. Marking as release-blocker for Go 1.19. We need to figure out what's happening with the network and either fix it (if it's a GCE or builder problem) or figure out a workaround (if it's a GitHub problem). (Note that most of these failures come from the |
When these timeouts occur, the tests are consistently at 120s or longer of running time. |
Here's a theory, inspired by https://askubuntu.com/questions/218728/connection-timeout-when-accessing-github. Maybe this timeout is some kind of rate-limiting countermeasure on the GitHub side, spurred by bursts of traffic from the Go builders running more SlowBots and post-commit CLs. Maybe the 5-month gap from October 2021 to March 2022 is an artifact of the Go development cycle: the Go release freeze started 2021-11-01 and ended 2022-02-28. The last failure before the gap was a couple of weeks before the freeze began, and the first failure after the gap was about a week after it ended. During the freeze, the CL rate is naturally much lower, which might keep the builders below whatever threshold GitHub is using for rate-limiting. |
Hi @theojulienne, do you know who might be able to cast an eye on this from the GitHub side? |
I filed a GitHub support ticket (#1616480), but their conclusion was that it's probably a network issue on the GCE side. Unfortunately, the GCE network architecture is pretty opaque — I think we need to escalate this on the GCE side to at least get enough routing information to identify where along the route the connection is being dropped. |
Change https://go.dev/cl/405714 mentions this issue: |
Go tests don't include timestamps by default, but we would like to have them in order to correlate builder failures with server and network logs. Since many of the Go tests with external network and service dependencies are script tests for the 'go' command, logging timestamps here adds a lot of logging value with one simple and very low-risk change. For #50541. For #52490. For #52545. For #52851. Change-Id: If3fa86deb4a216ec6a1abc4e6f4ee9b05030a729 Reviewed-on: https://go-review.googlesource.com/c/go/+/405714 Reviewed-by: Dmitri Shuralyov <[email protected]> Auto-Submit: Bryan Mills <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Bryan Mills <[email protected]>
The failures starting 2022-05-11 have timestamps, so we (at least theoretically) now have something we can correlate with network and server logs. |
2022-05-19T15:38:08-e23cc08/linux-386-longtest is an interesting sample. It includes two timeouts connecting to (The tag-fetching error is not properly detected by |
We can work on escalating this through GCE support, but I wonder if we should be retrying transient errors (like connection timeouts) to third-party services. Should we do that as a workaround until we can figure out the networking issue? In the meantime, we may also want to set up a recurring prober app to do similar fetches to GitHub to see if we can reproduce the issue at the same time our builds fail. I also read through the response from GitHub and I am not sure that there's anything definitive that its an issue on either side. I wish the error that occurred to cloud.google.com had more of the output. It's hard for me to tell if that's a network issue or something else going on, as STDOUT is extremely truncated in that failure. |
In this case the connection timeouts are on the order of 120s, and I don't see a way to configure them to be shorter (I've looked but haven't found anything). So even a single retry would be a huge latency hit for running the test, and a few retries in a row (since several of the tests do several sequential (If someone knows of a way to shorten the |
Hmm, but as a diagnostic I wonder if |
There have been gaps in these failures before (see, for example, the 5-month gap in the original post), so I think these correlate in some way with load. That said, this may have been fixed (or at least mitigated) by the reduction in outbound connections in #27494, so I agree that it can be closed until/unless we start seeing it again. 👍 |
Found new dashboard test flakes for:
2022-12-06 20:53 linux-amd64-longtest-race go@89871dd8 cmd/go.TestScript (log)
|
I've been able to reproduce this locally on a VM behind a Google Cloud NAT with a limit of 128 ports. The root cause is a combination of having a low limit on the number of (statically-allocated or semi-static) NAT ports available to the host, and the tests making a lot of Empirically, the socket for the Perhaps something about the interaction between the VM and the NAT service causes that not to be the case here, or perhaps the Cloud NAT service refuses to reuse ports just in case the other host doesn't support this optional behavior. Either way, the NAT service ends up with all of the available ports in the host's static range in the The general solution is to increase the number of ports available to the host, either by expanding the per-host static port range (as Heschi did) or by using dynamic NAT port assignments instead. I suspect it could also be remedied by setting up a connection-pooling HTTPS proxy on the VM. And, of course, we could reduce the number of times we connect to the same external hosts in tests. 😅 |
FWIW, the most recent failure that
I suspect that is from a server crash on the GitHub side rather than port exhaustion. |
Change https://go.dev/cl/473277 mentions this issue: |
Change https://go.dev/cl/473276 mentions this issue: |
Found new dashboard test flakes for:
2023-04-12 20:54 linux-arm64-longtest go@f58c6ccc cmd/go.TestScript (log)
|
Although we aren't precise about enforcing the hosts just yet, we can eventually use the declared hostnames to selectively skip tests (for example, if an external service has an outage while a Go release is being tested). Also relax the constraint to [short] in tests that require only vcs-test.golang.org, which has redirected to an in-process server since around CL 427914. Also enforce that tests that use the network actually use the [net] constraint, by setting TESTGONETWORK=panic in the test environment until the condition is evaluated. For #52545. For #54503. Updates #27494. Change-Id: I13be6b42a9beee97657eb45424882e787ac164c3 Reviewed-on: https://go-review.googlesource.com/c/go/+/473276 Run-TryBot: Bryan Mills <[email protected]> TryBot-Bypass: Bryan Mills <[email protected]> Reviewed-by: Russ Cox <[email protected]> Auto-Submit: Bryan Mills <[email protected]>
Change https://go.dev/cl/488236 mentions this issue: |
Invoking 'git' adds about 200ms to this test on a fast machine, probably more on a slow one. (As a general habit we skip the 'git' tests uniformly in short mode.) For #52545. Change-Id: Iea6d86a8c9c8b0f1fe51888faf7f5fe7dd8f1eb3 Reviewed-on: https://go-review.googlesource.com/c/go/+/488236 Auto-Submit: Bryan Mills <[email protected]> Reviewed-by: Than McIntosh <[email protected]> Run-TryBot: Bryan Mills <[email protected]> TryBot-Result: Gopher Robot <[email protected]>
#52545 (comment) is a new and different failure mode. |
…ctions I implemented this in order to debug connection failures on a new-to-me VM development environment that uses Cloud NAT. It doesn't directly fix the bug, but perhaps folks will find it useful to diagnose port-exhaustion-related flakiness in other environments. For #52545. Change-Id: Icd3f13dcf62e718560c4f4a965a4df7c1bd785ce Reviewed-on: https://go-review.googlesource.com/c/go/+/473277 Run-TryBot: Bryan Mills <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Michael Matloob <[email protected]> Auto-Submit: Bryan Mills <[email protected]>
greplogs --dashboard -md -l -e 'github.com.* Connection timed out' --since=2021-01-01
2022-04-22T19:02:29-808d40d/linux-386-longtest
2022-04-22T15:44:57-1899472/linux-386-longtest
2022-04-22T04:42:23-1e59876/linux-386-longtest
2022-04-22T00:25:08-c9031a4/linux-amd64-longtest
2022-04-22T00:05:54-c510cd9/linux-386-longtest
2022-04-22T00:05:54-c510cd9/linux-amd64-longtest
2022-04-19T15:07:49-caa4631/linux-amd64-longtest
2022-04-15T19:02:54-df08c9a/linux-386-longtest
2022-04-14T18:00:13-dd97871/linux-386-longtest
2022-04-14T18:00:13-dd97871/linux-amd64-longtest
2022-04-13T17:48:12-517781b/linux-386-longtest
2022-04-13T01:15:22-b55a2fb/linux-386-longtest
2022-04-12T22:35:20-fc1d4c1/linux-amd64-longtest
2022-04-12T22:32:01-6183920/linux-386-longtest
2022-04-12T05:46:57-2b31abc/linux-amd64-longtest
2022-04-11T16:31:45-e4e033a/linux-386-longtest
2022-04-11T16:31:43-036b615/linux-amd64-longtest
2022-04-11T16:31:33-494b79f/linux-amd64-longtest
2022-04-08T18:30:53-3a19102/linux-386-longtest
2022-03-15T17:04:57-9b112ce/linux-amd64-longtest
2022-03-07T13:47:51-0e2f1ab/linux-386-longtest
2022-03-04T20:02:41-2b8aa2b/linux-386-longtest
[Note the 5-month gap here! 🤔]
2021-10-08T14:08:12-5b9206f/linux-386-longtest
2021-10-08T14:08:12-5b9206f/linux-amd64-longtest
2021-09-07T21:39:06-dcf3545/linux-386-longtest
2021-09-07T20:37:05-8078355/linux-386-longtest
2021-09-07T20:27:30-23f4f0d/linux-386-longtest
2021-09-07T19:39:04-d92101f/linux-386-longtest
2021-06-22T02:44:43-197a5ee/linux-amd64-longtest
2021-04-02T05:24:14-aebc0b4/linux-amd64-longtest
2021-03-18T14:43:33-e726e2a/linux-amd64-longtest
2021-03-17T17:50:50-8628bf9/linux-amd64-longtest
2021-03-17T17:13:50-0bd308f/linux-amd64-longtest
2021-03-17T16:53:00-70d54df/linux-386-longtest
2021-03-17T16:53:00-70d54df/linux-amd64-longtest
2021-03-17T16:19:21-2f3db22/linux-386-longtest
2021-03-05T18:46:36-51d8d35/linux-amd64-longtest
@golang/release: were there changes to the
linux-.*-longtest
image or network configuration around 2022-03-04 that might explain the increase in timeouts at that point?The text was updated successfully, but these errors were encountered: