Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

windows-amd64 build flakiness: "compile.exe: Access is denied." #241

Open
dagood opened this issue Oct 28, 2021 · 7 comments
Open

windows-amd64 build flakiness: "compile.exe: Access is denied." #241

dagood opened this issue Oct 28, 2021 · 7 comments
Labels

Comments

@dagood
Copy link
Member

dagood commented Oct 28, 2021

---- Running command: [cmd.exe /c make.bat]
Building Go cmd/dist using C:\Users\VssAdministrator\.go-stage-0\1.16.5\go
Building Go toolchain1 using C:\Users\VssAdministrator\.go-stage-0\1.16.5\go.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
go install cmd/compile: copying C:\Users\VSSADM~1\AppData\Local\Temp\go-build3426272395\b098\exe\a.out.exe: open D:\a\1\s\pkg\tool\windows_amd64\compile.exe: Access is denied.
go tool dist: FAILED: D:\a\1\s\pkg\tool\windows_amd64\go_bootstrap install -gcflags=all= -ldflags=all= -a -i cmd/asm cmd/cgo cmd/compile cmd/link: exit status 1

All instances I've found so far are on 1.17 branches:

https://dev.azure.com/dnceng/internal/_build/results?buildId=1433378&view=logs&j=b2ef2c90-30c2-5075-badc-2feb75a64cee&t=b6826774-f899-5802-9e9c-ca3c12e8a4c0&l=38

https://dev.azure.com/dnceng/internal/_build/results?buildId=1442944&view=logs&j=5a2439d3-40cd-5cda-520e-11af88128fb3&t=a1b03d73-1679-5340-7eef-275cc3387d35&l=83

https://dev.azure.com/dnceng/internal/_build/results?buildId=1433379&view=logs&j=5a2439d3-40cd-5cda-520e-11af88128fb3&t=a1b03d73-1679-5340-7eef-275cc3387d35&l=38

The first thing that comes to mind is some virus scan locking it, potentially related to the recent build pool migrations.

@dagood dagood added the Flaky label Oct 28, 2021
@dagood
Copy link
Member Author

dagood commented Dec 2, 2021

actions/runner-images#4086 (comment) (Sept 15) mentions an "internal monitor" they disabled when something similar showed up for someone else:

We have disabled an internal monitor. Could you please rerun builds and check results?

Maybe the monitor was turned back on, or something similar was implemented independently. (Or, maybe this is all unrelated and it's just Defender like we first thought.)

@dagood
Copy link
Member Author

dagood commented Dec 2, 2021

In my experience, Defender hits some kind of choke point when you move executable files around, and cmd/dist/build.go seems to do a lot of that in the multi-stage build, so that behavior probably aggravates any issues we'd normally have with Defender.

(Also in my experience, exclusion rules on Defender don't actually work with ATP (Advanced Threat Protection), so this is the kind of thing that's easy to think is solved because it's just a flaky problem but not actually have fixed.)

@dagood
Copy link
Member Author

dagood commented Dec 2, 2021

Apparently only one person ever has hit this other than us: (only Google result for "compile.exe: Access is denied.")

@dagood
Copy link
Member Author

dagood commented Dec 2, 2021

I started a thread with FR: https://teams.microsoft.com/l/message/19:[email protected]/1638482594255?tenantId=72f988bf-86f1-41af-91ab-2d7cd011db47&groupId=4d73664c-9f2f-450d-82a5-c2f02756606d&parentMessageId=1638482594255&teamName=.NET%20Core%20Eng%20Services%20Partners&channelName=First%20Responders&createdTime=1638482594255

Basically, there's a script that's used in hosted agent image generation that's intended to turn off Defender: https://github.com/actions/virtual-environments/blob/main/images/win/scripts/Installers/Configure-Antivirus.ps1. But, generally speaking it's never really fully disabled.

This also could be caused by some Azure-specific scanning, not Defender.

I hit this kind of thing before and the only reasonable way forward was to retry:

We can probably add some Windows-specific retry logic in our build to mitigate this flakiness.


To make the retries quick and not delay the build much, I believe we could add them to the individual copy call:

go/src/cmd/dist/build.go

Lines 1053 to 1059 in 677e561

// copy copies the file src to dst, via memory (so only good for small files).
func copyfile(dst, src string, flag int) {
if vflag > 1 {
errprintf("cp %s %s\n", src, dst)
}
writefile(readfile(src), dst, flag)
}

But ideally we should avoid this--it seems to me like it would be hard to justify upstreaming such a niche mitigation. If there were more reports than golang/go#21608 and it were easier to point at this as the root cause, maybe this would be better.


We can add retries to the entire build command:

buildCommandLine := append(shellPrefix, "make"+scriptExtension)

This means that if we hit a retry, we would be redoing some build work and potentially making it harder to notice other Windows-specific flakiness problems that come up elsewhere in the build. But, this way we only modify our own code, so we don't have to be concerned about upstreaming.

@qmuntal
Copy link
Member

qmuntal commented Dec 3, 2021

I think we are having this issue more than before because it reproduces in go1.17 but not in main, and I'm doing the FIPS work in go1.17.

Evidences:

I'm will try to find the commit that fixes the issue in case we can backport it, but I don't promise anything 😄

@dagood
Copy link
Member Author

dagood commented Dec 3, 2021

Yeah, I suppose doing a (rough) binary search would be reasonable if it's not too much effort. Maybe we should also try a build just before the release branch fork (no VERSION file) and just after, in case the build machine just doesn't like the number 17 in particular, or something along those lines. 😛

I'm concerned we'll end up with something inconclusive if it's a gradual issue not related to any single commit--e.g. if the build (under our infra) got more flaky after 1.16 and less flaky after 1.17, and 1.17 just happened to fork at a nasty spot for our particular environment. (The hypothesis I've had in mind for this is that our binary size, build time, and the scan throughput could be interacting just perfectly wrong so the scan doesn't complete in time.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants