Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try 1ES pools for Windows builds to remove need for retry logic #300

Open
dagood opened this issue Dec 6, 2021 · 2 comments
Open

Try 1ES pools for Windows builds to remove need for retry logic #300

dagood opened this issue Dec 6, 2021 · 2 comments

Comments

@dagood
Copy link
Member

dagood commented Dec 6, 2021

For this flakiness issue, we're adding retries to the Windows jobs:

The only known explanations for the flakiness are Windows antivirus or an Azure scan gone wrong and keeping the file open.
https://teams.microsoft.com/l/message/19:[email protected]/1638482594255?tenantId=72f988bf-86f1-41af-91ab-2d7cd011db47&groupId=4d73664c-9f2f-450d-82a5-c2f02756606d&parentMessageId=1638482594255&teamName=.NET%20Core%20Eng%20Services%20Partners&channelName=First%20Responders&createdTime=1638482594255

You can never, really, fully shut off defender or WU, but we do our best to

One thing we just noticed is that Azure Security pack stuff seems to be "turning on" for our VMs even when we specify the properties not to

In the end trying to figure it out is madness, Retry Is The Way.

We've been using Microsoft-hosted pools. We could try out 1ES pools, where:

  • The environment is different.
    • The hardware is (last I heard) more powerful.
    • dnceng is more directly aware of the attempts made to disable scans. (The Microsoft-hosted agents do seem to attempt to do this, but we/dnceng only saw this by reading the scripts--not involved in writing them.)
    • Any change, big or small, could make this work more reliably. We don't have a lot of info--not even sure about the cause.
      • E.g. a faster disk could make scans complete in time to avoid breaking our builds?
  • We are more likely to have success with a 1ES pool bug report/fix, because we have more direct lines of communication.

After switching to the 1ES pool, we can use https://github.com/jaredpar/runfo to scan the pipeline logs for retries and see if the number goes down. (Look for Running 'make' attempt 2 of 5....)

We can also run stress test jobs like https://dev.azure.com/dnceng/internal/_build/results?buildId=1495624&view=results to get more data, quickly.

@qmuntal
Copy link
Member

qmuntal commented Dec 7, 2021

I have some newbie questions:

  • Why there are two set of pools?
  • Which is the officially recommended pool?
  • Why did we initially chose Microsoft-hosted pools over 1ES?

@dagood
Copy link
Member Author

dagood commented Dec 7, 2021

The Microsoft-hosted pools are the same for everyone--non-Microsoft AzDO users and GitHub actions users. They're based on https://github.com/actions/virtual-environments. These machines have historically been more available than other pools, so they're the default as long as they work. They tend to have less memory and disk space, but this is ok a lot of the time.

For a while there were pools maintained by dnceng that had beefier machines, and in some cases have the capability of running signing jobs. You'd use them when needed. We were using them until dnceng sent us PR #231 to switch to 1ES pools.

Now, 1ES pools are taking over for the dnceng pools. Until today I thought they had the same use-case. But it turns out that we're supposed to use 1ES pools for any part of the build that can influence the bits our customers end up getting. (Our signed tar.gz/zips.) So, we'll have to switch over to those for our buildandpack builders at minimum, anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants