-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent Hangs on Registry Push #1568
Comments
Interesting piece of information I think, when connected to the console (using either SSM or SSH) of an EC2 instance is when we see this issue. When running this locally and pointed to the same K8S cluster, we have no issues. |
I think this may have something to do with using |
…ushes (#1590) ## Description This PR creates a tunnel per image push (making it easier to implement concurrency - may do that in this PR if we can confirm that issues are mitigated) moves the CRC from the image name to the tag and changes the UI to use a progressbar instead of a spinner for better user feedback. ## Related Issue Relates to #1568 , #1433, #1218, #1364 This also will make #1594 slightly easier. (See aws/containers-roadmap#853) Fixes: #1541 ## Type of change - [X] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Other (security config, docs update, etc) ## Checklist before merging - [X] Test, docs, adr added or updated as needed - [X] [Contributor Guide Steps](https://github.com/defenseunicorns/zarf/blob/main/CONTRIBUTING.md#developer-workflow) followed
@dgershman have you gotten a chance to try v0.26.0 and are you seeing any issues with that and k3d? |
Still having issues with K3D and tried with
|
Thanks for the update @dgershman the SSH note might be an interesting / good lead to look into as well - some but not all had reported that this was happening on EC2 VMs over SSH - I had thought it was bad networking and tested with that locally but potentially there is something more going on with driving image uploads from an SSH session. I'll see if I can keep digging |
@dgershman doing some more testing on this, but if you get a chance does the code in #1721 solve the hanging for you? (note that parts of the push/pull of OCI Zarf packages are broken on that PR - not sure if that is in your flow though - won't affect local Zarf packages) |
Putting this on my todo list for today. |
I ran one test so far that looked good. Will do two more tomorrow. |
I think we are good! |
Ok going to clean up the PR and then rollback one change we made that didn't fix anything (and at this point only serves to slow down image pushes). Will leave this open though in case this doesn't fully solve the issue - the main thought in the PR is that the multithreaded jobs in crane are either overwhelming the resource limits of the registry pod and/or they are stepping on each other and resulting in a stuck state. |
## Description Creating this PR to test the performance impact of setting crane jobs to `1` (which has shown initially effective in resolving the issue of registry push hanging) ## Related Issue Relates to #1568 Fixes #1656 Fixes #1734 ## Type of change - [X] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Other (security config, docs update, etc) ## Checklist before merging - [X] Test, docs, adr added or updated as needed - [X] [Contributor Guide Steps](https://github.com/defenseunicorns/zarf/blob/main/CONTRIBUTING.md#developer-workflow) followed --------- Co-authored-by: Cole Winberry <[email protected]>
Moving this back a milestone to give the fixes in v0.27.0 some time to bake with the community to see if this is truly fixed. |
Closing as fixed for now - we can reopen if it presents itself again |
We are still seeing this issue in zarf version 0.29.2. We have a multi-node cluster on AWS EC2 running RKE2 v1.26.9+rke2r1. Our package size is about 2.9G, the "zarf package deploy..." command stalls and hangs about 80% of the time or so. A retry usually works fine. Here are a few things that we noticed after some extensive testing:
Can you please re-open this issue, else let me know if I should create a new issue? Thanks. |
Environment
Device and OS: AMD64 / Ubuntu 22.04
App version: 0.25.2
Kubernetes distro being used: K3D K3S v1.25.7+k3s1
Other: Big Bang
1.57.1
Steps to reproduce
zarf package deploy zarf-package-big-bang-example-amd64-1.57.1.tar.zst --confirm --log-level=trace
Expected result
That things wouldn't get hung up, and continue along.
Actual Result
Things get hung up and/or sometimes fail / timeout.
Visual Proof (screenshots, videos, text, etc)
Full logs:
full-logs.txt
Severity/Priority
There is a workaround, by keeping retrying until the process succeeds.
Additional Context
bigbang
component.zarf init
as well, specifically when pushes are happening to the registry.The text was updated successfully, but these errors were encountered: