-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent HTTP 429 errors from git push
in CI/CD when auto-committing docs and converted content
#1698
Comments
@nikitawootten-nist per conversation about handoff this afternoon, if this impedes the sprint or you the sprint gets lighter and you have spare cycles to investigate, let me know. It happens intermittently and you know have sufficient perms to investigate as thoroughly as I do when it comes up for the time being. |
I got it. I'm going to take a first look on Monday. |
You know where to find me with oscalbuilder tokens and/or admin controlled area of the settings, I had looked into this before and get very unclear explanation of when GitHub's API rate limits. Generally: actions/checkout#775 |
Would a "simple" (if a bit janky) solution be to fork https://github.com/peaceiris/actions-gh-pages to add an exponential backoff to the failing steps? |
@Compton-NIST, when you have time can you summarize where you were with you current work and thinking on this? Others were curious about pitching but I told them you had been starting work on this and then should sync up with you if interested. |
Added PR #1721 to start watching x-rate-limit headers in workflow executions. One thing I think we DO want to change is how often we run lychee-action. We run on generation and validation, and we might want to just do that on validation. I do see 429's pretty consistently there, but things are odd between the lychee-action and the other steps. Hopefully the extra logging will give us some clues on the root. I do see others adding a backoff like @nikitawootten-nist suggested, but wanted to see if we could make other adjustments first, or pick up on a potential root cause. More to come. We may end up needing to merge the PR above into a feature- branch under OSCAL. |
If we're going the route of "making less noise", I think it's worth noting that a lot of the URLs being checked by Lychee are duplicates with anchor tags ( |
@aj-stein-nist needs to check out Chris' feature branch and start looking at the actual metrics 429s. Nice work, @Compton-NIST. |
We need to pick a feature branch to integrate it into to monitor 429 errors, evaluate potential sources, and track down what the root cause is. @Compton-NIST did not mention any particular feature branch, so we just pick one. Also Chris mentioned lychee is checking some links 4 times in a row, not just once based on appearance, should investigate that further iff it impacts this rate limiting. |
* Pushing generated website pages [ci skip] * Pushing generated website pages [ci skip] * Pushing generated website pages [ci skip] * Pushing generated website pages [ci skip] * Pushing generated website pages [ci skip] * Track rate limits as workflows execute. #1698 * Check rate limit if token present. --------- Co-authored-by: OSCAL GitHub Actions Bot <[email protected]>
After more and more examination on my personal fork, I am starting to believe the 429s are a potential issue, but this symptom is regrettably masking a permissions issue, but I am not 100% clear yet on that. More testing needing and more to follow, potentially opening a new separate issue once I know more of what the actual problem is. |
Well, that's very very frustrating and this is a teachable moment for me. The 429 rate limiting, if I can guess from logging, is the maybe the consequence of the same runner with our repo running a token with improper permissions and/or expired token. I have set up a specific token for debugging with the same oscalbuilder account. What is odd is that it works only with slightly more elevated permissions than the labels on the PAT token selection imply. Previously, the only permission for that PAT and it had worked reliably in the past (in comparison to the other stable one) only had I pushed through a test run for now, but removed permissions again for the weekend to examine again next week first thing. It is quite odd that "Update GitHub Action workflows" permissions is needed for the workflow itself, but I will continue more trial and error and nail down specific documentation that indicates that is needed given the confusing label. That said, the rate limit checking is useful and we need to figure out what to do with this issue. More to follow. |
I am going to move off work on this to #1726 because symptom of the root cause was triggered by this work, but it is not strictly related to this issue. We can circle back to this to complete in sprint though, it is very useful. |
Thanks for your work @Compton-NIST, I am marking this one done for this sprint. |
As discussed in sprint review today, this work was done, but does need to be integrated and pulled in, not just in the feature branch, for early warning detection to prevent anymore #1726 kind of issues. Doing that and will close this when we merge in a cross-issue PR shortly. |
…1726) (#1731) * Correct GHA action-gh-pages argument for #1726 By doing this, we correctly the PAT usage and not ironically use an existing, but improperly permissioned GITHUB_TOKEN provided as a context machine identity for all runs of all workflows, this should fix the builds and stop the cryptic HTTP 429 rate limit error response. It's cryptic because you get a 429 response after one single API operation (with git clone) because the token is wrong. * Remove lychee scans during site build for in #1726 As part of the troubleshooting work, GH docs do indicate scanning links from the GHA runners can potentially cause rate limiting. We have automated nightly scans and we review code changes as part of PRs. We can forgot commit-by-commit link scanning as a short-to-medium term mitigation and enable it again later. * Add @Compton-NIST's rate limit checks from #1698 for #1726.
Describe the bug
This has been happening with increasing frequency, but there are times despite valid authentication tokens that do not expire, auto-committing with the oscalbuilder service account fails intermittently with HTTP 429 response code errors (NOTE: we transitioned to PAT tokens so the
git push
in CI/CD happens via HTTP, not SSH, authenticated with the PAT). I have not been able to find a satisfactory answer to why this happens, but it does occur from time to time.Who is the bug affecting
Anyone trying to push website content changes (to
nist-pages
branch) or auto-generated content (tomain
,develop
, orfeature-*
branches).What is affected by this bug
CI/CD
How do we replicate this issue
Re-running the same GHA workflows over and over again after failures at the very end.
Review these workflow runs, specifically not the 4th and final but run attempts 1, 2, and 3. Other examples from last evening required running six times before success.
Expected behavior (i.e. solution)
Authentication, authorization, and
git push
to official repo branches is correct and without error.Other comments
No response
The text was updated successfully, but these errors were encountered: