Try 1ES pools for Windows builds to remove need for retry logic #300

dagood · 2021-12-06T18:01:16Z

For this flakiness issue, we're adding retries to the Windows jobs:

windows-amd64 build flakiness: "compile.exe: Access is denied." #241

The only known explanations for the flakiness are Windows antivirus or an Azure scan gone wrong and keeping the file open.
https://teams.microsoft.com/l/message/19:[email protected]/1638482594255?tenantId=72f988bf-86f1-41af-91ab-2d7cd011db47&groupId=4d73664c-9f2f-450d-82a5-c2f02756606d&parentMessageId=1638482594255&teamName=.NET%20Core%20Eng%20Services%20Partners&channelName=First%20Responders&createdTime=1638482594255

You can never, really, fully shut off defender or WU, but we do our best to

One thing we just noticed is that Azure Security pack stuff seems to be "turning on" for our VMs even when we specify the properties not to

In the end trying to figure it out is madness, Retry Is The Way.

We've been using Microsoft-hosted pools. We could try out 1ES pools, where:

The environment is different.
- The hardware is (last I heard) more powerful.
- dnceng is more directly aware of the attempts made to disable scans. (The Microsoft-hosted agents do seem to attempt to do this, but we/dnceng only saw this by reading the scripts--not involved in writing them.)
- Any change, big or small, could make this work more reliably. We don't have a lot of info--not even sure about the cause.
  - E.g. a faster disk could make scans complete in time to avoid breaking our builds?
We are more likely to have success with a 1ES pool bug report/fix, because we have more direct lines of communication.

After switching to the 1ES pool, we can use https://github.com/jaredpar/runfo to scan the pipeline logs for retries and see if the number goes down. (Look for Running 'make' attempt 2 of 5....)

We can also run stress test jobs like https://dev.azure.com/dnceng/internal/_build/results?buildId=1495624&view=results to get more data, quickly.

The text was updated successfully, but these errors were encountered:

qmuntal · 2021-12-07T15:28:52Z

I have some newbie questions:

Why there are two set of pools?
Which is the officially recommended pool?
Why did we initially chose Microsoft-hosted pools over 1ES?

dagood · 2021-12-07T18:01:15Z

The Microsoft-hosted pools are the same for everyone--non-Microsoft AzDO users and GitHub actions users. They're based on https://github.com/actions/virtual-environments. These machines have historically been more available than other pools, so they're the default as long as they work. They tend to have less memory and disk space, but this is ok a lot of the time.

For a while there were pools maintained by dnceng that had beefier machines, and in some cases have the capability of running signing jobs. You'd use them when needed. We were using them until dnceng sent us PR #231 to switch to 1ES pools.

Now, 1ES pools are taking over for the dnceng pools. Until today I thought they had the same use-case. But it turns out that we're supposed to use 1ES pools for any part of the build that can influence the bits our customers end up getting. (Our signed tar.gz/zips.) So, we'll have to switch over to those for our buildandpack builders at minimum, anyway.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try 1ES pools for Windows builds to remove need for retry logic #300

Try 1ES pools for Windows builds to remove need for retry logic #300

dagood commented Dec 6, 2021

qmuntal commented Dec 7, 2021

dagood commented Dec 7, 2021

Try 1ES pools for Windows builds to remove need for retry logic #300

Try 1ES pools for Windows builds to remove need for retry logic #300

Comments

dagood commented Dec 6, 2021

qmuntal commented Dec 7, 2021

dagood commented Dec 7, 2021