Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent HTTP 429 errors from git push in CI/CD when auto-committing docs and converted content #1698

Closed
aj-stein-nist opened this issue Mar 3, 2023 · 14 comments · Fixed by #1731
Assignees
Labels
bug Scope: CI/CD Enhancements to the project's Continuous Integration and Continuous Delivery pipeline.

Comments

@aj-stein-nist
Copy link
Contributor

Describe the bug

This has been happening with increasing frequency, but there are times despite valid authentication tokens that do not expire, auto-committing with the oscalbuilder service account fails intermittently with HTTP 429 response code errors (NOTE: we transitioned to PAT tokens so the git push in CI/CD happens via HTTP, not SSH, authenticated with the PAT). I have not been able to find a satisfactory answer to why this happens, but it does occur from time to time.

Who is the bug affecting

Anyone trying to push website content changes (to nist-pages branch) or auto-generated content (to main, develop, or feature-* branches).

What is affected by this bug

CI/CD

How do we replicate this issue

Re-running the same GHA workflows over and over again after failures at the very end.

Review these workflow runs, specifically not the 4th and final but run attempts 1, 2, and 3. Other examples from last evening required running six times before success.

Expected behavior (i.e. solution)

Authentication, authorization, and git push to official repo branches is correct and without error.

Other comments

No response

@aj-stein-nist aj-stein-nist added bug Scope: CI/CD Enhancements to the project's Continuous Integration and Continuous Delivery pipeline. labels Mar 3, 2023
@aj-stein-nist
Copy link
Contributor Author

@nikitawootten-nist per conversation about handoff this afternoon, if this impedes the sprint or you the sprint gets lighter and you have spare cycles to investigate, let me know. It happens intermittently and you know have sufficient perms to investigate as thoroughly as I do when it comes up for the time being.

@Compton-US
Copy link
Contributor

I got it. I'm going to take a first look on Monday.

@aj-stein-nist
Copy link
Contributor Author

I got it. I'm going to take a first look on Monday.

You know where to find me with oscalbuilder tokens and/or admin controlled area of the settings, I had looked into this before and get very unclear explanation of when GitHub's API rate limits.

Generally: actions/checkout#775
From my previous report (not from git clone, but similar behavior in different context and one day it just went away): https://github.com/github/docs/discussions/17620

@nikitawootten-nist
Copy link
Contributor

Would a "simple" (if a bit janky) solution be to fork https://github.com/peaceiris/actions-gh-pages to add an exponential backoff to the failing steps?

@aj-stein-nist aj-stein-nist moved this from Todo to In Progress in NIST OSCAL Work Board Mar 21, 2023
@aj-stein-nist
Copy link
Contributor Author

@Compton-NIST, when you have time can you summarize where you were with you current work and thinking on this? Others were curious about pitching but I told them you had been starting work on this and then should sync up with you if interested.

Compton-US pushed a commit to Compton-US/OSCAL that referenced this issue Mar 21, 2023
@Compton-US
Copy link
Contributor

Added PR #1721 to start watching x-rate-limit headers in workflow executions.

One thing I think we DO want to change is how often we run lychee-action. We run on generation and validation, and we might want to just do that on validation. I do see 429's pretty consistently there, but things are odd between the lychee-action and the other steps. Hopefully the extra logging will give us some clues on the root.

I do see others adding a backoff like @nikitawootten-nist suggested, but wanted to see if we could make other adjustments first, or pick up on a potential root cause. More to come. We may end up needing to merge the PR above into a feature- branch under OSCAL.

@nikitawootten-nist
Copy link
Contributor

One thing I think we DO want to change is how often we run lychee-action. We run on generation and validation, and we might want to just do that on validation. I do see 429's pretty consistently there, but things are odd between the lychee-action and the other steps. Hopefully the extra logging will give us some clues on the root.

If we're going the route of "making less noise", I think it's worth noting that a lot of the URLs being checked by Lychee are duplicates with anchor tags (pages.nist.gov/.../#heading1 and the like), which Lychee does not filter out or check. Is it worth looking into removing duplicate checks that are potentially eating into some quota?

@aj-stein-nist
Copy link
Contributor Author

@aj-stein-nist needs to check out Chris' feature branch and start looking at the actual metrics 429s. Nice work, @Compton-NIST.

@Compton-US Compton-US moved this from In Progress to Under Review in NIST OSCAL Work Board Mar 22, 2023
@aj-stein-nist
Copy link
Contributor Author

We need to pick a feature branch to integrate it into to monitor 429 errors, evaluate potential sources, and track down what the root cause is. @Compton-NIST did not mention any particular feature branch, so we just pick one.

Also Chris mentioned lychee is checking some links 4 times in a row, not just once based on appearance, should investigate that further iff it impacts this rate limiting.

aj-stein-nist pushed a commit to Compton-US/OSCAL that referenced this issue Mar 24, 2023
aj-stein-nist pushed a commit that referenced this issue Mar 24, 2023
* Pushing generated website pages [ci skip]

* Pushing generated website pages [ci skip]

* Pushing generated website pages [ci skip]

* Pushing generated website pages [ci skip]

* Pushing generated website pages [ci skip]

* Track rate limits as workflows execute. #1698

* Check rate limit if token present.

---------

Co-authored-by: OSCAL GitHub Actions Bot <[email protected]>
@aj-stein-nist
Copy link
Contributor Author

After more and more examination on my personal fork, I am starting to believe the 429s are a potential issue, but this symptom is regrettably masking a permissions issue, but I am not 100% clear yet on that. More testing needing and more to follow, potentially opening a new separate issue once I know more of what the actual problem is.

@aj-stein-nist aj-stein-nist self-assigned this Mar 25, 2023
@aj-stein-nist
Copy link
Contributor Author

Well, that's very very frustrating and this is a teachable moment for me. The 429 rate limiting, if I can guess from logging, is the maybe the consequence of the same runner with our repo running a token with improper permissions and/or expired token. I have set up a specific token for debugging with the same oscalbuilder account. What is odd is that it works only with slightly more elevated permissions than the labels on the PAT token selection imply.

image

Previously, the only permission for that PAT and it had worked reliably in the past (in comparison to the other stable one) only had repo:public_repo, like it has had for a long time. That stable PAT has most certainly not expired, per the GitHub UI.

I pushed through a test run for now, but removed permissions again for the weekend to examine again next week first thing. It is quite odd that "Update GitHub Action workflows" permissions is needed for the workflow itself, but I will continue more trial and error and nail down specific documentation that indicates that is needed given the confusing label.

That said, the rate limit checking is useful and we need to figure out what to do with this issue. More to follow.

@aj-stein-nist
Copy link
Contributor Author

I am going to move off work on this to #1726 because symptom of the root cause was triggered by this work, but it is not strictly related to this issue. We can circle back to this to complete in sprint though, it is very useful.

@aj-stein-nist aj-stein-nist moved this from Under Review to Done in NIST OSCAL Work Board Mar 30, 2023
@aj-stein-nist
Copy link
Contributor Author

Thanks for your work @Compton-NIST, I am marking this one done for this sprint.

@aj-stein-nist
Copy link
Contributor Author

As discussed in sprint review today, this work was done, but does need to be integrated and pulled in, not just in the feature branch, for early warning detection to prevent anymore #1726 kind of issues. Doing that and will close this when we merge in a cross-issue PR shortly.

@aj-stein-nist aj-stein-nist reopened this Mar 31, 2023
@github-project-automation github-project-automation bot moved this from Done to In Progress in NIST OSCAL Work Board Mar 31, 2023
@aj-stein-nist aj-stein-nist moved this from In Progress to Under Review in NIST OSCAL Work Board Mar 31, 2023
aj-stein-nist added a commit that referenced this issue Apr 3, 2023
…1726) (#1731)

* Correct GHA action-gh-pages argument for #1726

By doing this, we correctly the PAT usage and not ironically use an existing,
but improperly permissioned GITHUB_TOKEN provided as a context machine identity
for all runs of all workflows, this should fix the builds and stop the cryptic
HTTP 429 rate limit error response. It's cryptic because you get a 429 response
after one single API operation (with git clone) because the token is wrong.

* Remove lychee scans during site build for in #1726

As part of the troubleshooting work, GH docs do indicate scanning links from the GHA
runners can potentially cause rate limiting. We have automated nightly scans and we
review code changes as part of PRs. We can forgot commit-by-commit link scanning as
a short-to-medium term mitigation and enable it again later.

* Add @Compton-NIST's rate limit checks from #1698 for #1726.
@github-project-automation github-project-automation bot moved this from Under Review to Done in NIST OSCAL Work Board Apr 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Scope: CI/CD Enhancements to the project's Continuous Integration and Continuous Delivery pipeline.
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants