Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional issues with file already closed errors #77

Closed
andrewfraley opened this issue Oct 11, 2017 · 24 comments
Closed

Occasional issues with file already closed errors #77

andrewfraley opened this issue Oct 11, 2017 · 24 comments

Comments

@andrewfraley
Copy link

I sometimes see this issue:

unTar(): error while tar reader.Next() for io.Reader read |0: file already closed

I have my script that runs g10k automatically clear the cache when the last run exited in error, so the next run is usually ok. I only see this happen after changing the Puppetfile.

@xorpaul
Copy link
Owner

xorpaul commented Oct 11, 2017

This looks like a race condition during the extraction of multiple Forge archives.

Can you provide the Puppetfile with Forge module only or at least how many Forge modules this Puppetfile is using?

@andrewfraley
Copy link
Author

andrewfraley commented Oct 11, 2017

I'm actually not using any forge modules, all modules are hosted in a self-hosted github enterprise instance. Other data points that might be useful:

git urls are all HTTPS
all modules either use :branch => or :tag =>

@baurmatt
Copy link

Having the same problems. We're also using git only, while some are https and some are ssh/git. I've added some really simple debug code:

mbaur@mbaur-g10k:~/g10k-source$ git diff
diff --git a/forge.go b/forge.go
index a9eba6a..b6cb9d6 100644
--- a/forge.go
+++ b/forge.go
@@ -303,7 +303,7 @@ func unTar(r io.Reader, targetBaseDir string) {
                        if err == io.EOF {
                                break
                        }
-                       Fatalf(funcName + "(): error while tar reader.Next() for io.Reader " + err.Error())
+                       Fatalf(funcName + "(): error while tar reader.Next() for io.Reader " + err.Error() + targetBaseDir)
                }

                // get the individual filename and extract to the current directory

The targetBaseDir path which is shown, than the error occurred, is the path to a git repository, so there is definitely something wrong here.

If i can help to debug this further, please let me know.

@xorpaul
Copy link
Owner

xorpaul commented Oct 17, 2017

@baurmatt I also added the targetDir to the error message locally.
I still suspect this is some kind of race condition, but the only thing that would help me debug this is if I can reproduce the bug or someone could give me the debug output of the failed g10k run.

@andrewfraley
Copy link
Author

@xorpaul Thanks I'll try to capture a failed run later today

@baurmatt
Copy link

baurmatt commented Nov 1, 2017

@andrewfraley Could you reproduce the issue in your setup?

Some information about our setup:

  • All are git repositories
  • Most get cloned from Github
  • 5 get cloned from our internal Gitlab Server via Git/SSH
  • The error only seems to happen with one of our internal repositories. It about 360MB in size and contains some .war/.tar.gz/.jar/.. files

Still trying to get the debug log through our security team.

@baurmatt
Copy link

baurmatt commented Nov 2, 2017

I was finally able to reliably reproduce this. The problem seems to be some binary files (please don't ask......) which we have in our repository.

mbaur@mbaur-g10k:~$ for i in {1..10}; do ./g10k -config /home/mbaur/g10k.yaml; rm -rf cache/* environments/*; done
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF

I've uploaded all data into https://github.com/baurmatt/g10k-77-control and https://github.com/baurmatt/g10k-77-other. Hope this helps to finally find the issue.

@andrewfraley
Copy link
Author

I'm having trouble reproducing the error while not on our production puppet servers, but I know I've seen it happen when I testing in a Vagrant VM. I'll keep trying, I don't think we have any repos with binary data (at least not larger than a few KB).

@andrewfraley
Copy link
Author

andrewfraley commented Nov 2, 2017

Ok I'm able to reproduce the error, but it never happens on the same module, so I think this is some sort of race condition based on performance of the underlying disks and the remote git server. To reliably make it happen I have to remove the cache directory and remove the target modules directory, so it's a full sync from scratch. I can only make it happen on one of my real servers, which has a much faster connection to the git server, but much slower disks, vs my workstation inside a vagrant instance (workstation has slow connection to git server but fast SSD storage, server has fast connection to git server but pretty slow disks).

2017/11/02 17:55:39 DEBUG executeCommand(): Executing git --git-dir /tmp/g10k-test/modules/https-__redacted.github.enterprise.server_redacted_git_repo.git rev-parse --verify 'v1.1.0'
unTar(): error while tar reader.Next() for io.Reader read |0: file already closed

I should add that the total disk space used by all modules in all environments is 650MB, so we are talking about writing a huge number of small files in a very short amount of time.

Hope this helps.

@andrewfraley
Copy link
Author

Alright I have managed to reliably reproduce the error in a Virtualbox VM by restricting the disk bandwidth.

Steps to limit bandwidth on a Virtualbox disk:

VBoxManage bandwidthctl "vmname" add "Limit" --type disk --limit 10
VBoxManage storageattach "vmname" --storagectl "SATA Controller" --port 0 --device 0 --type hdd --medium "name_of_disk_file.vmdk" --bandwidthgroup Limit

You can change the limit while the machine is running with:
VBoxManage bandwidthctl "vmname" set Limit --limit 20M

For me the errors start at a 50M limit.

More info here: https://www.virtualbox.org/manual/ch05.html#storage-bandwidth-limit

@xorpaul
Copy link
Owner

xorpaul commented Nov 3, 2017

Interesting.

Are you running g10k inside this restricted VM or the test Git server?

Like #76 (comment) you can also try limiting the number of parallel checkouts and pulls with the -maxworker parameter.

@baurmatt
Copy link

baurmatt commented Nov 3, 2017

We're running g10k in a Virtuozzo container with Ubuntu 16.04 on an HP server with SAS HDDs. I just tried to lower the max worker to 1 but it didn't help at all :/

mbaur@mbaur-g10k:~$ ./g10k -maxworker 1 -config /home/mbaur/g10k.yaml
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF

The repository checkout seems to be fine:

mbaur@mbaur-g10k:~/cache/r10k/modules/[email protected]_g10k-77-other.git$ git fsck --full
Checking object directories: 100% (256/256), done.
Checking objects: 100% (18/18), done.

@andrewfraley
Copy link
Author

andrewfraley commented Nov 3, 2017

Yes, g10k is running inside the restricted VM.

Is maxworker supposed to restrict the number of concurrent git commands? Even with -maxworker 1 I still see hundreds of active git processes while g10k is running.

I think I've solved it, though, I'm actually bumping up against the nproc limit. If I restrict the available disk bandwidth down to 5MB/s, I start getting g10k crashes with errors about resource not available, so half the time I get a g10k crash with a big go dump, the other half I get the unTar error. I think it's just random if git can't create a new thread or g10k itself.

After increasing the nproc limit for the user to 9000 the errors stopped happening, even with the disk restricted to 5MB/s.

On CentOS for a user "puppet", add to /etc/security/limits.d/90-nproc.conf:
puppet soft nproc 9000
(you have to log out and back in for it to take effect)

@xorpaul
Copy link
Owner

xorpaul commented Nov 3, 2017

@baurmatt Are you using the latest g10k release? Which git version are you using? Ubuntu 16.04 should be using a somewhat recent version of git though.

I can't reproduce the issue with your test control repo, but I can only test it on my workstation with plenty of RAM, a fast CPU and an SSD. If you could switch all your Puppetfile Github modules to use the https://github.com:baurmatt/g10k-77-other.git instead of [email protected]:baurmatt/g10k-77-other.git then I can also test it on a slower VM.

The problem is that with that test setup you're triggering about 70 git archive processes in parallel that pump out a lot of data to unTar()

And as @andrewfraley already found out: the -maxworker setting was always meant to limit the number of Git and Forge module resolving processes, not the number of local git archive or Forge archive extracting processes.

The way g10k resolves Puppet environments in parallel is a guaranteed way to find the next bottle-neck, which for me was only the number of allowed open file handles that the g10k user is allowed to have.

I'll add a section to the readme that suggests increasing the default limits (nofile, nproc) before running g10k with a large Puppet environment.

Would a parameter help that would limit the local extracting processes, like -maxextractingworkers?

@andrewfraley
Copy link
Author

A parameter to limit extracting processes would definitely be helpful, I think that would help solve another issue I have with g10k pegging the CPUs while running. I don't mind if g10k takes 30s to finish instead of 3s, I just want it to finish within 60s. Thanks!

@baurmatt
Copy link

baurmatt commented Nov 3, 2017

I'm using g10k 0.3.14 and git 2.7.4.

As you suggested, i just switched the module URLs to https. For me the error still exists.

I've also just migrated my Virtuozzo Container on a hardware server with SSD storage, the error exists there as well.

@xorpaul
Copy link
Owner

xorpaul commented Nov 3, 2017

Have you tried increasing the security limits nproc and nofile for your g10k user?

@baurmatt
Copy link

baurmatt commented Nov 3, 2017

If i understand it correctly, my limits are already really high:

mbaur@mbaur-g10k:~$ ulimit -Sn
1024000
mbaur@mbaur-g10k:~$ ulimit -Hn
1024000
mbaur@mbaur-g10k:~$ ulimit -Su
514893
mbaur@mbaur-g10k:~$ ulimit -Hu
514893

@baurmatt
Copy link

baurmatt commented Nov 3, 2017

I've found the issue...

Nov  3 15:59:05 server.example.org kernel: [821998.807769] Out of memory in UB 290422522: OOM killed process 331832 (git) score 0 vm:53000kB, rss:32224kB, swap:0kB

Upgrading my container from 2 to 4GB solved the problem and g10k is running fine. So i guess the -maxextractingworkers parameter would be nice to reduce the ram/cpu consumption.

@xorpaul
Copy link
Owner

xorpaul commented Nov 3, 2017

@baurmatt 😏
That's exactly what I meant with

The way g10k resolves Puppet environments in parallel is a guaranteed way to find the next bottle-neck

I'm currently working on that -maxextractworker parameter.

@xorpaul
Copy link
Owner

xorpaul commented Nov 7, 2017

Try out the new v0.4 release:

https://github.com/xorpaul/g10k/releases/tag/v0.4

You can limit the number of Goroutines with -maxextractworker parameter or as maxextractworker: <INT> g10k config setting.

@baurmatt
Copy link

Works great, thanks! :) As i can't reproduce the error, i think this issue can be closed.

@xorpaul
Copy link
Owner

xorpaul commented Nov 20, 2017

Glad to hear! 😏

@xorpaul xorpaul closed this as completed Nov 20, 2017
@andrewfraley
Copy link
Author

@xorpaul working great for me as well. Dramatically reduced CPU load and no more errors. Using -maxextractworker 10 -maxworker 10 on an 8 core machine. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants