-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/build: windows-amd64-longtest builder not keeping up with commits #52591
Comments
I generally triage build failures from over the past day when I get in in the mornings, so having a multi-day backlog means that if these builds fail they may be missed during triage. |
They're almost certainly timing out. I looked at a random ongoing run:
I thought the VM would be shut down after 30 minutes, so I'm not sure why it's still running. But it appears that something is wrong with runtime:cpu124 -- it appears in many ongoing logs. |
For completeness, here's a run I was watching that died: builder: windows-amd64-longtest rev: e845750744b648b8b348bbcebe2ff85d4e6247c5 buildlet: http://10.128.0.15 GCE VM: buildlet-windows-amd64-2016-big-rna1ddcc0 started: 2022-04-27 13:51:10.786923276 +0000 UTC m=+79337.510111514 status: still running |
https://go.dev/cl/361212 looks a little suspicious in that range? cc @aclements |
(In that case, #42699 may be related.) |
It appears that this masked several of the early failures for #52800. |
(And, FWIW, these builders still aren't keeping up. 😩) |
In a recent run, The builder host type doesn't have a custom |
Why doesn't https://cs.opensource.google/go/x/build/+/master:buildlet/gce.go;l=54;drc=7cdbd32ba0ddd9aa868c5a9a45ad60aaca687312 kick in? The coordinator is so confusing. |
That kicks in only when |
https://go.dev/cl/361212 would definitely slow down the longtest builders a bit. Do we have any data on how long these builds have taken over time that would tell us if we were close and that just pushed us over the line? |
The current 45 min limit is a hard cut-off (the VM is forcibly destroyed at that time), so until we increase it, the only information available is how far the build gets within the 45 mins. (It should be possible to find out how long they used to take some weeks/months ago when they were completing, but I don't have a query handy.) I suspect the "45 min" was meant as a safety cut off with plenty of headroom, not to prevent legitimate builds from completing. Builds being slow can be a problem, but stopping before they complete and retrying without limit (#42699) isn't a good solution; not to mention longtest post-submit builders by definition have the least need for short completion times (compared to TryBots and non-long builders). We probably didn't have as many long tests in 2015 when "45 mins" was chosen as we do by now. Maybe not even longtest builders? Yes, first longtest builder came to be in 2018! Similarly to how CL 167638 bumped their timeout scale, I think we should increase the VM-timeout for builders whose |
Change https://go.dev/cl/406216 mentions this issue: |
CL 406216 increased the timeout to 2 hours, and longtest builds are completing on first try or so. Compare before and after. One of the
If we want to find agreement on timing goals for longtest post-submit build times, that should be a new issue. For reference, the current goal for trybot (pre-submit) builds is tracked in subject of issue #17104, and the task of maintaining the hard timeout for builds is now issue #52929. |
The
windows-amd64-longtest
builder is currently backlogged by almost two days: there are builds still running back to at least CL 361212, merged on the afternoon of Apr. 25:My understanding is that the
windows-amd64-longtest
builder runs on GCE, and thus should not be hardware-limited. I also don't see anything on https://farmer.golang.org/ that would explain the backlog.@golang/release: could you look into this backlog? I wonder if this indicates a problem in the coordinator.
The text was updated successfully, but these errors were encountered: