-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
g10k cache sometimes gets corrupted #76
Comments
Next time it happens I will make sure I save a copy of the corrupted cache. |
The issue may be that we are simply allowing g10k to fail and give up here: Lines 164 to 172 in de149b1
Really, we can't allow this tool to ever just give up and fail in production - especially if the only problem is a corrupted Git cache. It should just delete the problematic cache and try again. |
An earlier instance of the output when this failed:
|
could mean that the initial I tried this Puppetfile
If there are only a handful of Puppet modules that are hosted on an unreliable Git server, then you can add it directly to the module:
Or you can add a global setting in your g10k config to allow all your Git modules to fail and your g10k run to continue.
I don't agree with this, in my setup I want g10k to fail if there is anything unreachable, because I only sync the g10k populated environments to my Puppetservers if g10k did run successfully. I'd rather have an older working Puppet environment than an corrupted, half populated environment in production.
Checking the local git repository first, clearing it and retry could be a solution, but then how often should g10k try this? What should it do if the Git repository is completely unreachable? What we can agree on is that the cached git repository should never be empty or corrupted. |
Yes. I think the basic principle is that the cached git repository should never be empty or corrupted, whereas I am seeing them corrupted quite often. I estimate g10k is being called dozens of times per day in about 50 AWS accounts per day at my site and I'm getting a corrupted cache maybe once a fortnight. I can confirm that each time I have seen the cache corrupted, it would fail repeatedly until I deleted the cache, at which point it would always succeed. I guess the next thing to do is for me to wait until this happens again, and make a copy of the corrupted cache. I take it you're saying you haven't actually seen this before? |
I have an example of one of the problematic g10k caches saved now. Here's the problem:
I'll see what else I can glean from this tar ball. |
Reminder to me: I have this saved as a tarball as |
Use of
|
On the other hand if I clone the upstream repo again and run the fsck command:
|
See this Stack Overflow post here, which seems to describe the same problem for others: |
Even after running the fsck command above, the |
@xorpaul , I think if the |
@xorpaul Also, if you would like a tarball of the corrupted Git repo I saved, and copy of the same repo after cloning a fresh copy, let me know where I can send it. |
@alexharv074 Thanks for the debug info. g10k is just calling the git binary to clone and update the local Git repository, if the remote Git server is unable to respond appropriately or sends a corrupted state of the repository, then the only thing g10k can do is retry the checkout. What Git server are you using? Is it running on a VM or hardware? You should open a ticket at this Git server project with this information (cloning and updating multiple repositories at the same time, probably overloading the Git server, so that it sends invalid responses). Maybe you can adjust some settings (worker processes, web server processes) so that g10k doesn't overload your server. I'll have a look at the In the meantime you could try limiting the number of parallel checkouts and pulls with the |
The Git server is Gitlab 8.16.4, running on a RHEL 6 EC2 instance, and the Git client version 1.7.1. In any case, a clean & retry mechanism makes a lot sense to me, whatever the root cause is here. Whether it's the Git server's fault, or whether it's just a random corruption of a cloned Git repo, I still would not expect the tool to give up in production if the problem is that it has corrupted data in its cache. Not sure how hard it is to implement the feature I proposed of course. I would send a PR if only I knew Golang. |
Try out the new v0.4 release: https://github.com/xorpaul/g10k/releases/tag/v0.4 You can limit the number of Goroutines with |
… setting for corrupted Git caches #76
Now you can also retry failed Git commands with 0.4.1 https://github.com/xorpaul/g10k/releases/tag/v0.4.1 Either use
If you then call g10k with this config file and have a corrupted local Git repository, g10k deletes the local cache and retries the Git clone command once:
|
Hi @xorpaul Thanks very much for implementing the feature. However, it does not seem to be working in the expected way:
|
Hi @alexharv074, ah, sorry forgot to add the new CLI parameter to the Puppetfile mode. Please try:
|
I am very happy to say it's working! Before:
Install new version:
After:
And corrupted cache and all, it still copied 52 modules in 7.9 seconds! Thanks so much Andreas, there will be many happy customers at my site, and best of all, I feel confident to roll out g10k at my next place! |
Glad I could help! Did you censor your output?
|
No, I did redact sensitive information using search & replace to update the Git server address, and site-specific info in the Git URL, but the output I showed is otherwise unchanged. To be honest, I was about to see if I could send in a pull request to improve the wording of the error message, but sounds like maybe it's still not behaving the way you expected it to? |
g10k should print a warning that the git command failed and that it retries the git clone command: https://github.com/xorpaul/g10k/blob/master/git.go#L117 Maybe the progress bar from the default verbosity level is the cause that it skipped this line for you. Can you retry using the |
You are correct:
|
Alright then. |
I'll update the output in the next release, so that only the retrying line gets printed, when |
From time to time I find that the g10k cache becomes corrupted and I am forced to delete it. This is a big problem in production and ultimately may mean I can't use g10k in production. A recent example was a failure like this:
The text was updated successfully, but these errors were encountered: