-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
windows-amd64 build flakiness: "compile.exe: Access is denied." #241
Comments
actions/runner-images#4086 (comment) (Sept 15) mentions an "internal monitor" they disabled when something similar showed up for someone else:
Maybe the monitor was turned back on, or something similar was implemented independently. (Or, maybe this is all unrelated and it's just Defender like we first thought.) |
In my experience, Defender hits some kind of choke point when you move executable files around, and (Also in my experience, exclusion rules on Defender don't actually work with ATP (Advanced Threat Protection), so this is the kind of thing that's easy to think is solved because it's just a flaky problem but not actually have fixed.) |
Apparently only one person ever has hit this other than us: (only Google result for "compile.exe: Access is denied.") |
Basically, there's a script that's used in hosted agent image generation that's intended to turn off Defender: https://github.com/actions/virtual-environments/blob/main/images/win/scripts/Installers/Configure-Antivirus.ps1. But, generally speaking it's never really fully disabled. This also could be caused by some Azure-specific scanning, not Defender. I hit this kind of thing before and the only reasonable way forward was to retry: We can probably add some Windows-specific retry logic in our build to mitigate this flakiness. To make the retries quick and not delay the build much, I believe we could add them to the individual copy call: Lines 1053 to 1059 in 677e561
But ideally we should avoid this--it seems to me like it would be hard to justify upstreaming such a niche mitigation. If there were more reports than golang/go#21608 and it were easier to point at this as the root cause, maybe this would be better. We can add retries to the entire build command: go/eng/_core/cmd/build/build.go Line 112 in 677e561
This means that if we hit a retry, we would be redoing some build work and potentially making it harder to notice other Windows-specific flakiness problems that come up elsewhere in the build. But, this way we only modify our own code, so we don't have to be concerned about upstreaming. |
I think we are having this issue more than before because it reproduces in go1.17 but not in main, and I'm doing the FIPS work in go1.17. Evidences:
I'm will try to find the commit that fixes the issue in case we can backport it, but I don't promise anything 😄 |
Yeah, I suppose doing a (rough) binary search would be reasonable if it's not too much effort. Maybe we should also try a build just before the release branch fork (no I'm concerned we'll end up with something inconclusive if it's a gradual issue not related to any single commit--e.g. if the build (under our infra) got more flaky after 1.16 and less flaky after 1.17, and 1.17 just happened to fork at a nasty spot for our particular environment. (The hypothesis I've had in mind for this is that our binary size, build time, and the scan throughput could be interacting just perfectly wrong so the scan doesn't complete in time.) |
All instances I've found so far are on 1.17 branches:
https://dev.azure.com/dnceng/internal/_build/results?buildId=1433378&view=logs&j=b2ef2c90-30c2-5075-badc-2feb75a64cee&t=b6826774-f899-5802-9e9c-ca3c12e8a4c0&l=38
https://dev.azure.com/dnceng/internal/_build/results?buildId=1442944&view=logs&j=5a2439d3-40cd-5cda-520e-11af88128fb3&t=a1b03d73-1679-5340-7eef-275cc3387d35&l=83
https://dev.azure.com/dnceng/internal/_build/results?buildId=1433379&view=logs&j=5a2439d3-40cd-5cda-520e-11af88128fb3&t=a1b03d73-1679-5340-7eef-275cc3387d35&l=38
The first thing that comes to mind is some virus scan locking it, potentially related to the recent build pool migrations.
The text was updated successfully, but these errors were encountered: