Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build,cmd/go: frequent connection timeouts to github.com since 2022-03-04 #52545

Closed
bcmills opened this issue Apr 25, 2022 · 37 comments
Closed
Assignees
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge GoCommand cmd/go NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Apr 25, 2022

#!watchflakes
post <- pkg ~ `cmd/go` && `github.com.*(Connection timed out|TLS connection)` && date < "2023-07-01"
# cd $WORK/tmp/d2/src/github.com/myitcv; git clone -- https://github.com/myitcv/vgo_example_compat $WORK/tmp/d2/src/github.com/myitcv/vgo_example_compat
            Cloning into '$WORK/tmp/d2/src/github.com/myitcv/vgo_example_compat'...
            fatal: unable to access 'https://github.com/myitcv/vgo_example_compat/': Failed to connect to github.com port 443: Connection timed out

greplogs --dashboard -md -l -e 'github.com.* Connection timed out' --since=2021-01-01

2022-04-22T19:02:29-808d40d/linux-386-longtest
2022-04-22T15:44:57-1899472/linux-386-longtest
2022-04-22T04:42:23-1e59876/linux-386-longtest
2022-04-22T00:25:08-c9031a4/linux-amd64-longtest
2022-04-22T00:05:54-c510cd9/linux-386-longtest
2022-04-22T00:05:54-c510cd9/linux-amd64-longtest
2022-04-19T15:07:49-caa4631/linux-amd64-longtest
2022-04-15T19:02:54-df08c9a/linux-386-longtest
2022-04-14T18:00:13-dd97871/linux-386-longtest
2022-04-14T18:00:13-dd97871/linux-amd64-longtest
2022-04-13T17:48:12-517781b/linux-386-longtest
2022-04-13T01:15:22-b55a2fb/linux-386-longtest
2022-04-12T22:35:20-fc1d4c1/linux-amd64-longtest
2022-04-12T22:32:01-6183920/linux-386-longtest
2022-04-12T05:46:57-2b31abc/linux-amd64-longtest
2022-04-11T16:31:45-e4e033a/linux-386-longtest
2022-04-11T16:31:43-036b615/linux-amd64-longtest
2022-04-11T16:31:33-494b79f/linux-amd64-longtest
2022-04-08T18:30:53-3a19102/linux-386-longtest
2022-03-15T17:04:57-9b112ce/linux-amd64-longtest
2022-03-07T13:47:51-0e2f1ab/linux-386-longtest
2022-03-04T20:02:41-2b8aa2b/linux-386-longtest
[Note the 5-month gap here! 🤔]
2021-10-08T14:08:12-5b9206f/linux-386-longtest
2021-10-08T14:08:12-5b9206f/linux-amd64-longtest
2021-09-07T21:39:06-dcf3545/linux-386-longtest
2021-09-07T20:37:05-8078355/linux-386-longtest
2021-09-07T20:27:30-23f4f0d/linux-386-longtest
2021-09-07T19:39:04-d92101f/linux-386-longtest
2021-06-22T02:44:43-197a5ee/linux-amd64-longtest
2021-04-02T05:24:14-aebc0b4/linux-amd64-longtest
2021-03-18T14:43:33-e726e2a/linux-amd64-longtest
2021-03-17T17:50:50-8628bf9/linux-amd64-longtest
2021-03-17T17:13:50-0bd308f/linux-amd64-longtest
2021-03-17T16:53:00-70d54df/linux-386-longtest
2021-03-17T16:53:00-70d54df/linux-amd64-longtest
2021-03-17T16:19:21-2f3db22/linux-386-longtest
2021-03-05T18:46:36-51d8d35/linux-amd64-longtest

@golang/release: were there changes to the linux-.*-longtest image or network configuration around 2022-03-04 that might explain the increase in timeouts at that point?

@gopherbot gopherbot added the Builders x/build issues (builders, bots, dashboards) label Apr 25, 2022
@gopherbot gopherbot added this to the Unreleased milestone Apr 25, 2022
@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 25, 2022
@heschi
Copy link
Contributor

heschi commented Apr 26, 2022

Not that I'm aware of.

@bcmills
Copy link
Contributor Author

bcmills commented May 4, 2022

@bcmills
Copy link
Contributor Author

bcmills commented May 5, 2022

@bcmills
Copy link
Contributor Author

bcmills commented May 6, 2022

@bcmills bcmills modified the milestones: Unreleased, Go1.19 May 6, 2022
@bcmills
Copy link
Contributor Author

bcmills commented May 6, 2022

This is causing a huge amount of noise in the longtest builders, and affects the most comprehensive builders for two first-class ports (linux/386 and linux/amd64).

At this failure rate, I suspect that if this issue affected end users we'd be hearing about it by now. So my hypothesis is that this is some kind of GCE networking issue that affects the builder hosts.

Marking as release-blocker for Go 1.19. We need to figure out what's happening with the network and either fix it (if it's a GCE or builder problem) or figure out a workaround (if it's a GitHub problem).

(Note that most of these failures come from the git binary itself; #46693 may explain why we're not seeing this on the Windows builders.)

@bcmills
Copy link
Contributor Author

bcmills commented May 6, 2022

When these timeouts occur, the tests are consistently at 120s or longer of running time.
That suggests that raising the connection timeout would not be helpful.

@bcmills
Copy link
Contributor Author

bcmills commented May 6, 2022

Here's a theory, inspired by https://askubuntu.com/questions/218728/connection-timeout-when-accessing-github.

Maybe this timeout is some kind of rate-limiting countermeasure on the GitHub side, spurred by bursts of traffic from the Go builders running more SlowBots and post-commit CLs.

Maybe the 5-month gap from October 2021 to March 2022 is an artifact of the Go development cycle: the Go release freeze started 2021-11-01 and ended 2022-02-28. The last failure before the gap was a couple of weeks before the freeze began, and the first failure after the gap was about a week after it ended. During the freeze, the CL rate is naturally much lower, which might keep the builders below whatever threshold GitHub is using for rate-limiting.

@adonovan
Copy link
Member

adonovan commented May 6, 2022

Hi @theojulienne, do you know who might be able to cast an eye on this from the GitHub side?

@bcmills
Copy link
Contributor Author

bcmills commented May 11, 2022

I filed a GitHub support ticket (#1616480), but their conclusion was that it's probably a network issue on the GCE side.

Unfortunately, the GCE network architecture is pretty opaque — traceroute doesn't even work properly (https://cloud.google.com/compute/docs/faq#networking). 😵

I think we need to escalate this on the GCE side to at least get enough routing information to identify where along the route the connection is being dropped.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/405714 mentions this issue: cmd/go: add timestamps to script test output

gopherbot pushed a commit that referenced this issue May 11, 2022
Go tests don't include timestamps by default, but we would like to
have them in order to correlate builder failures with server and
network logs.

Since many of the Go tests with external network and service
dependencies are script tests for the 'go' command, logging timestamps
here adds a lot of logging value with one simple and very low-risk
change.

For #50541.
For #52490.
For #52545.
For #52851.

Change-Id: If3fa86deb4a216ec6a1abc4e6f4ee9b05030a729
Reviewed-on: https://go-review.googlesource.com/c/go/+/405714
Reviewed-by: Dmitri Shuralyov <[email protected]>
Auto-Submit: Bryan Mills <[email protected]>
Reviewed-by: Dmitri Shuralyov <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Run-TryBot: Bryan Mills <[email protected]>
@dmitshur dmitshur added the GoCommand cmd/go label May 13, 2022
@bcmills
Copy link
Contributor Author

bcmills commented May 19, 2022

The failures starting 2022-05-11 have timestamps, so we (at least theoretically) now have something we can correlate with network and server logs.

@bcmills
Copy link
Contributor Author

bcmills commented May 23, 2022

2022-05-19T15:38:08-e23cc08/linux-386-longtest is an interesting sample. It includes two timeouts connecting to github.com, as well as what appears to be an error fetching tags for cloud.google.com/go (which presumably is not hosted on GitHub) at nearly the same time.

(The tag-fetching error is not properly detected by cmd/go, but it at least suggests that the network problem is not specific to GitHub.)

@toothrot
Copy link
Contributor

We can work on escalating this through GCE support, but I wonder if we should be retrying transient errors (like connection timeouts) to third-party services.

Should we do that as a workaround until we can figure out the networking issue?

In the meantime, we may also want to set up a recurring prober app to do similar fetches to GitHub to see if we can reproduce the issue at the same time our builds fail.

I also read through the response from GitHub and I am not sure that there's anything definitive that its an issue on either side. I wish the error that occurred to cloud.google.com had more of the output. It's hard for me to tell if that's a network issue or something else going on, as STDOUT is extremely truncated in that failure.

@bcmills
Copy link
Contributor Author

bcmills commented May 31, 2022

I wonder if we should be retrying transient errors (like connection timeouts) to third-party services.

In this case the connection timeouts are on the order of 120s, and I don't see a way to configure them to be shorter (I've looked but haven't found anything). So even a single retry would be a huge latency hit for running the test, and a few retries in a row (since several of the tests do several sequential git fetches) would quite likely run us over the default 10m timeout anyway.

(If someone knows of a way to shorten the git connection timeout I'd be glad to give that a try, but I tried the usual CURL_ environment variables and they don't appear to be plumbed through. 😞)

@bcmills
Copy link
Contributor Author

bcmills commented May 31, 2022

Hmm, but as a diagnostic I wonder if GIT_TRACE_CURL_NO_DATA and/or GIT_CURL_VERBOSE in the script environment would help? I'll give that a try, and we can see if it yields anything useful for debugging.

@bcmills bcmills self-assigned this May 31, 2022
Repository owner moved this from Planned to Done in Go Release Nov 8, 2022
Repository owner moved this from Active to Done in Test Flakes Nov 8, 2022
@heschi heschi moved this to Done in Release Blockers Nov 8, 2022
@bcmills
Copy link
Contributor Author

bcmills commented Nov 9, 2022

There have been gaps in these failures before (see, for example, the 5-month gap in the original post), so I think these correlate in some way with load.

That said, this may have been fixed (or at least mitigated) by the reduction in outbound connections in #27494, so I agree that it can be closed until/unless we start seeing it again. 👍

@gopherbot gopherbot reopened this Dec 17, 2022
@gopherbot
Copy link
Contributor

Found new dashboard test flakes for:

#!watchflakes
post <- pkg ~ `cmd/go` && `github.com.*(Connection timed out|TLS connection)`
2022-12-06 20:53 linux-amd64-longtest-race go@89871dd8 cmd/go.TestScript (log)
vcs-test.golang.org rerouted to http://127.0.0.1:40159
https://vcs-test.golang.org rerouted to https://127.0.0.1:44293
go test proxy running at GOPROXY=http://127.0.0.1:35113/mod
2022/12/06 21:55:23 http: TLS handshake error from 127.0.0.1:56438: read tcp 127.0.0.1:44293->127.0.0.1:56438: read: connection reset by peer
2022/12/06 21:55:47 http: TLS handshake error from 127.0.0.1:53576: EOF
--- FAIL: TestScript (0.16s)
    --- FAIL: TestScript/get_legacy (15.51s)
        script_test.go:134: 2022-12-06T21:56:12Z
        script_test.go:136: $WORK=/workdir/tmp/cmd-go-test-935094743/tmpdir789022441/get_legacy2332480131
        script_test.go:154: 
...
            cannot find package "github.com/myitcv/vgo_example_compat/sub" in any of:
            	/workdir/go/src/github.com/myitcv/vgo_example_compat/sub (from $GOROOT)
            	/workdir/tmp/cmd-go-test-935094743/tmpdir789022441/get_legacy2332480131/tmp/d2/src/github.com/myitcv/vgo_example_compat/sub (from $GOPATH)
            cannot find package "github.com/myitcv/vgo_example_compat/v2" in any of:
            	/workdir/go/src/github.com/myitcv/vgo_example_compat/v2 (from $GOROOT)
            	/workdir/tmp/cmd-go-test-935094743/tmpdir789022441/get_legacy2332480131/tmp/d2/src/github.com/myitcv/vgo_example_compat/v2 (from $GOPATH)
            cannot find package "github.com/myitcv/vgo_example_compat/v2/sub" in any of:
            	/workdir/go/src/github.com/myitcv/vgo_example_compat/v2/sub (from $GOROOT)
            	/workdir/tmp/cmd-go-test-935094743/tmpdir789022441/get_legacy2332480131/tmp/d2/src/github.com/myitcv/vgo_example_compat/v2/sub (from $GOPATH)
        script_test.go:154: FAIL: get_legacy.txt:31: go get github.com/rsc/vgotest4 github.com/rsc/vgotest5: exit status 1

watchflakes

@seankhliao seankhliao removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Dec 26, 2022
@bcmills
Copy link
Contributor Author

bcmills commented Mar 3, 2023

I've been able to reproduce this locally on a VM behind a Google Cloud NAT with a limit of 128 ports.

The root cause is a combination of having a low limit on the number of (statically-allocated or semi-static) NAT ports available to the host, and the tests making a lot of git requests (which always establish a new connection) repeatedly to the same github.com host.

Empirically, the socket for the github.com server does not end up in the TIME-WAIT state on the local machine, so it must be in the TIME-WAIT state on the server side. RFC-1122 says that a host MAY accept a SYN for a socket in the TIME-WAIT state, but only if the initial sequence number is larger than the largest sequence number used for the previous connection.

Perhaps something about the interaction between the VM and the NAT service causes that not to be the case here, or perhaps the Cloud NAT service refuses to reuse ports just in case the other host doesn't support this optional behavior.

Either way, the NAT service ends up with all of the available ports in the host's static range in the TIME-WAIT state on the server side, and something (either the NAT service or the GitHub server itself) drops the connections until the TIME_WAIT interval expires.

The general solution is to increase the number of ports available to the host, either by expanding the per-host static port range (as Heschi did) or by using dynamic NAT port assignments instead.

I suspect it could also be remedied by setting up a connection-pooling HTTPS proxy on the VM.

And, of course, we could reduce the number of times we connect to the same external hosts in tests. 😅

@bcmills
Copy link
Contributor Author

bcmills commented Mar 3, 2023

FWIW, the most recent failure that watchflakes found has this error instead:

fatal: unable to access 'https://github.com/myitcv/vgo_example_compat/': gnutls_handshake() failed: The TLS connection was non-properly terminated.

I suspect that is from a server crash on the GitHub side rather than port exhaustion.

@bcmills bcmills closed this as completed Mar 3, 2023
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/473277 mentions this issue: cmd/go: add a GODEBUG to limit the number of concurrent network connections

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/473276 mentions this issue: cmd/go: declare net hosts in script tests

@gopherbot
Copy link
Contributor

Found new dashboard test flakes for:

#!watchflakes
post <- pkg ~ `cmd/go` && `github.com.*(Connection timed out|TLS connection)`
2023-04-12 20:54 linux-arm64-longtest go@f58c6ccc cmd/go.TestScript (log)
vcs-test.golang.org rerouted to http://127.0.0.1:43441
https://vcs-test.golang.org rerouted to https://127.0.0.1:32839
go test proxy running at GOPROXY=http://127.0.0.1:44693/mod
2023/04/12 21:15:46 http: TLS handshake error from 127.0.0.1:57932: EOF
2023/04/12 21:15:46 http: TLS handshake error from 127.0.0.1:58030: read tcp 127.0.0.1:32839->127.0.0.1:58030: read: connection reset by peer
--- FAIL: TestScript (0.07s)
    --- FAIL: TestScript/get_dot_slash_download (0.14s)
        script_test.go:134: 2023-04-12T21:15:57Z
        script_test.go:136: $WORK=/tmp/workdir/tmp/cmd-go-test-1333411493/tmpdir3489084715/get_dot_slash_download3030690374
        script_test.go:154: 
...
            21:15:57.229927 http.c:756              == Info: Connected to github.com (140.82.114.3) port 443 (#0)
            21:15:57.264466 http.c:756              == Info: found 387 certificates in /etc/ssl/certs
            21:15:57.264592 http.c:756              == Info: ALPN, offering h2
            21:15:57.264614 http.c:756              == Info: ALPN, offering http/1.1
            21:15:57.265054 http.c:756              == Info: gnutls_handshake() failed: The TLS connection was non-properly terminated.
            V��
            21:15:57.265100 http.c:756              == Info: Closing connection 0
            fatal: unable to access 'https://github.com/rsc/pprof_mac_fix/': gnutls_handshake() failed: The TLS connection was non-properly terminated.
            package rsc.io/pprof_mac_fix: exit status 128
        script_test.go:154: FAIL: get_dot_slash_download.txt:10: go get ./pprof_mac_fix: exit status 1

watchflakes

@gopherbot gopherbot reopened this Apr 13, 2023
gopherbot pushed a commit that referenced this issue Apr 24, 2023
Although we aren't precise about enforcing the hosts just yet,
we can eventually use the declared hostnames to selectively skip
tests (for example, if an external service has an outage while
a Go release is being tested).

Also relax the constraint to [short] in tests that require only
vcs-test.golang.org, which has redirected to an in-process server
since around CL 427914.

Also enforce that tests that use the network actually use the [net]
constraint, by setting TESTGONETWORK=panic in the test environment
until the condition is evaluated.

For #52545.
For #54503.
Updates #27494.

Change-Id: I13be6b42a9beee97657eb45424882e787ac164c3
Reviewed-on: https://go-review.googlesource.com/c/go/+/473276
Run-TryBot: Bryan Mills <[email protected]>
TryBot-Bypass: Bryan Mills <[email protected]>
Reviewed-by: Russ Cox <[email protected]>
Auto-Submit: Bryan Mills <[email protected]>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/488236 mentions this issue: cmd/go: skip the 'git' part of get_insecure_redirect in short mode

gopherbot pushed a commit that referenced this issue Apr 27, 2023
Invoking 'git' adds about 200ms to this test on a fast machine,
probably more on a slow one. (As a general habit we skip the
'git' tests uniformly in short mode.)

For #52545.

Change-Id: Iea6d86a8c9c8b0f1fe51888faf7f5fe7dd8f1eb3
Reviewed-on: https://go-review.googlesource.com/c/go/+/488236
Auto-Submit: Bryan Mills <[email protected]>
Reviewed-by: Than McIntosh <[email protected]>
Run-TryBot: Bryan Mills <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
@bcmills
Copy link
Contributor Author

bcmills commented May 11, 2023

#52545 (comment) is a new and different failure mode.

@bcmills bcmills closed this as completed May 11, 2023
@bcmills bcmills removed this from Test Flakes May 11, 2023
gopherbot pushed a commit that referenced this issue May 19, 2023
…ctions

I implemented this in order to debug connection failures on a
new-to-me VM development environment that uses Cloud NAT. It doesn't
directly fix the bug, but perhaps folks will find it useful to
diagnose port-exhaustion-related flakiness in other environments.

For #52545.

Change-Id: Icd3f13dcf62e718560c4f4a965a4df7c1bd785ce
Reviewed-on: https://go-review.googlesource.com/c/go/+/473277
Run-TryBot: Bryan Mills <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Reviewed-by: Michael Matloob <[email protected]>
Auto-Submit: Bryan Mills <[email protected]>
@golang golang locked and limited conversation to collaborators Nov 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge GoCommand cmd/go NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Archived in project
Status: Done
Development

No branches or pull requests

7 participants