Occasional issues with file already closed errors #77

andrewfraley · 2017-10-11T14:39:14Z

I sometimes see this issue:

unTar(): error while tar reader.Next() for io.Reader read |0: file already closed

I have my script that runs g10k automatically clear the cache when the last run exited in error, so the next run is usually ok. I only see this happen after changing the Puppetfile.

xorpaul · 2017-10-11T15:06:20Z

This looks like a race condition during the extraction of multiple Forge archives.

Can you provide the Puppetfile with Forge module only or at least how many Forge modules this Puppetfile is using?

andrewfraley · 2017-10-11T16:22:50Z

I'm actually not using any forge modules, all modules are hosted in a self-hosted github enterprise instance. Other data points that might be useful:

git urls are all HTTPS
all modules either use :branch => or :tag =>

baurmatt · 2017-10-16T14:15:26Z

Having the same problems. We're also using git only, while some are https and some are ssh/git. I've added some really simple debug code:

mbaur@mbaur-g10k:~/g10k-source$ git diff
diff --git a/forge.go b/forge.go
index a9eba6a..b6cb9d6 100644
--- a/forge.go
+++ b/forge.go
@@ -303,7 +303,7 @@ func unTar(r io.Reader, targetBaseDir string) {
                        if err == io.EOF {
                                break
                        }
-                       Fatalf(funcName + "(): error while tar reader.Next() for io.Reader " + err.Error())
+                       Fatalf(funcName + "(): error while tar reader.Next() for io.Reader " + err.Error() + targetBaseDir)
                }

                // get the individual filename and extract to the current directory

The targetBaseDir path which is shown, than the error occurred, is the path to a git repository, so there is definitely something wrong here.

If i can help to debug this further, please let me know.

xorpaul · 2017-10-17T14:15:21Z

@baurmatt I also added the targetDir to the error message locally.
I still suspect this is some kind of race condition, but the only thing that would help me debug this is if I can reproduce the bug or someone could give me the debug output of the failed g10k run.

andrewfraley · 2017-10-17T14:31:16Z

@xorpaul Thanks I'll try to capture a failed run later today

baurmatt · 2017-11-01T17:11:16Z

@andrewfraley Could you reproduce the issue in your setup?

Some information about our setup:

All are git repositories
Most get cloned from Github
5 get cloned from our internal Gitlab Server via Git/SSH
The error only seems to happen with one of our internal repositories. It about 360MB in size and contains some .war/.tar.gz/.jar/.. files

Still trying to get the debug log through our security team.

baurmatt · 2017-11-02T12:52:23Z

I was finally able to reliably reproduce this. The problem seems to be some binary files (please don't ask......) which we have in our repository.

mbaur@mbaur-g10k:~$ for i in {1..10}; do ./g10k -config /home/mbaur/g10k.yaml; rm -rf cache/* environments/*; done
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF

I've uploaded all data into https://github.com/baurmatt/g10k-77-control and https://github.com/baurmatt/g10k-77-other. Hope this helps to finally find the issue.

andrewfraley · 2017-11-02T16:44:12Z

I'm having trouble reproducing the error while not on our production puppet servers, but I know I've seen it happen when I testing in a Vagrant VM. I'll keep trying, I don't think we have any repos with binary data (at least not larger than a few KB).

andrewfraley · 2017-11-02T18:03:22Z

Ok I'm able to reproduce the error, but it never happens on the same module, so I think this is some sort of race condition based on performance of the underlying disks and the remote git server. To reliably make it happen I have to remove the cache directory and remove the target modules directory, so it's a full sync from scratch. I can only make it happen on one of my real servers, which has a much faster connection to the git server, but much slower disks, vs my workstation inside a vagrant instance (workstation has slow connection to git server but fast SSD storage, server has fast connection to git server but pretty slow disks).

2017/11/02 17:55:39 DEBUG executeCommand(): Executing git --git-dir /tmp/g10k-test/modules/https-__redacted.github.enterprise.server_redacted_git_repo.git rev-parse --verify 'v1.1.0'
unTar(): error while tar reader.Next() for io.Reader read |0: file already closed

I should add that the total disk space used by all modules in all environments is 650MB, so we are talking about writing a huge number of small files in a very short amount of time.

Hope this helps.

andrewfraley · 2017-11-02T20:14:02Z

Alright I have managed to reliably reproduce the error in a Virtualbox VM by restricting the disk bandwidth.

Steps to limit bandwidth on a Virtualbox disk:

VBoxManage bandwidthctl "vmname" add "Limit" --type disk --limit 10
VBoxManage storageattach "vmname" --storagectl "SATA Controller" --port 0 --device 0 --type hdd --medium "name_of_disk_file.vmdk" --bandwidthgroup Limit

You can change the limit while the machine is running with:
VBoxManage bandwidthctl "vmname" set Limit --limit 20M

For me the errors start at a 50M limit.

More info here: https://www.virtualbox.org/manual/ch05.html#storage-bandwidth-limit

xorpaul · 2017-11-03T09:33:02Z

Interesting.

Are you running g10k inside this restricted VM or the test Git server?

Like #76 (comment) you can also try limiting the number of parallel checkouts and pulls with the -maxworker parameter.

baurmatt · 2017-11-03T13:21:58Z

We're running g10k in a Virtuozzo container with Ubuntu 16.04 on an HP server with SAS HDDs. I just tried to lower the max worker to 1 but it didn't help at all :/

mbaur@mbaur-g10k:~$ ./g10k -maxworker 1 -config /home/mbaur/g10k.yaml
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF

The repository checkout seems to be fine:

mbaur@mbaur-g10k:~/cache/r10k/modules/[email protected]_g10k-77-other.git$ git fsck --full
Checking object directories: 100% (256/256), done.
Checking objects: 100% (18/18), done.

andrewfraley · 2017-11-03T14:01:21Z

Yes, g10k is running inside the restricted VM.

Is maxworker supposed to restrict the number of concurrent git commands? Even with -maxworker 1 I still see hundreds of active git processes while g10k is running.

I think I've solved it, though, I'm actually bumping up against the nproc limit. If I restrict the available disk bandwidth down to 5MB/s, I start getting g10k crashes with errors about resource not available, so half the time I get a g10k crash with a big go dump, the other half I get the unTar error. I think it's just random if git can't create a new thread or g10k itself.

After increasing the nproc limit for the user to 9000 the errors stopped happening, even with the disk restricted to 5MB/s.

On CentOS for a user "puppet", add to /etc/security/limits.d/90-nproc.conf:
puppet soft nproc 9000
(you have to log out and back in for it to take effect)

xorpaul · 2017-11-03T14:10:54Z

@baurmatt Are you using the latest g10k release? Which git version are you using? Ubuntu 16.04 should be using a somewhat recent version of git though.

I can't reproduce the issue with your test control repo, but I can only test it on my workstation with plenty of RAM, a fast CPU and an SSD. If you could switch all your Puppetfile Github modules to use the https://github.com:baurmatt/g10k-77-other.git instead of [email protected]:baurmatt/g10k-77-other.git then I can also test it on a slower VM.

The problem is that with that test setup you're triggering about 70 git archive processes in parallel that pump out a lot of data to unTar()

And as @andrewfraley already found out: the -maxworker setting was always meant to limit the number of Git and Forge module resolving processes, not the number of local git archive or Forge archive extracting processes.

The way g10k resolves Puppet environments in parallel is a guaranteed way to find the next bottle-neck, which for me was only the number of allowed open file handles that the g10k user is allowed to have.

I'll add a section to the readme that suggests increasing the default limits (nofile, nproc) before running g10k with a large Puppet environment.

Would a parameter help that would limit the local extracting processes, like -maxextractingworkers?

andrewfraley · 2017-11-03T14:17:45Z

A parameter to limit extracting processes would definitely be helpful, I think that would help solve another issue I have with g10k pegging the CPUs while running. I don't mind if g10k takes 30s to finish instead of 3s, I just want it to finish within 60s. Thanks!

baurmatt · 2017-11-03T14:33:54Z

I'm using g10k 0.3.14 and git 2.7.4.

As you suggested, i just switched the module URLs to https. For me the error still exists.

I've also just migrated my Virtuozzo Container on a hardware server with SSD storage, the error exists there as well.

xorpaul · 2017-11-03T14:35:28Z

Have you tried increasing the security limits nproc and nofile for your g10k user?

baurmatt · 2017-11-03T14:42:41Z

If i understand it correctly, my limits are already really high:

mbaur@mbaur-g10k:~$ ulimit -Sn
1024000
mbaur@mbaur-g10k:~$ ulimit -Hn
1024000
mbaur@mbaur-g10k:~$ ulimit -Su
514893
mbaur@mbaur-g10k:~$ ulimit -Hu
514893

baurmatt · 2017-11-03T15:06:19Z

I've found the issue...

Nov  3 15:59:05 server.example.org kernel: [821998.807769] Out of memory in UB 290422522: OOM killed process 331832 (git) score 0 vm:53000kB, rss:32224kB, swap:0kB

Upgrading my container from 2 to 4GB solved the problem and g10k is running fine. So i guess the -maxextractingworkers parameter would be nice to reduce the ram/cpu consumption.

xorpaul · 2017-11-03T15:10:27Z

@baurmatt 😏
That's exactly what I meant with

The way g10k resolves Puppet environments in parallel is a guaranteed way to find the next bottle-neck

I'm currently working on that -maxextractworker parameter.

…roup #79 #77 #76

xorpaul · 2017-11-07T17:36:58Z

Try out the new v0.4 release:

https://github.com/xorpaul/g10k/releases/tag/v0.4

You can limit the number of Goroutines with -maxextractworker parameter or as maxextractworker: <INT> g10k config setting.

baurmatt · 2017-11-20T15:39:05Z

Works great, thanks! :) As i can't reproduce the error, i think this issue can be closed.

xorpaul · 2017-11-20T15:54:05Z

Glad to hear! 😏

andrewfraley · 2017-11-20T16:41:07Z

@xorpaul working great for me as well. Dramatically reduced CPU load and no more errors. Using -maxextractworker 10 -maxworker 10 on an 8 core machine. Thanks!

xorpaul added a commit that referenced this issue Nov 7, 2017

add -maxextractworker parameter/config setting using remeh/sizedwaitg…

0d07d80

…roup #79 #77 #76

xorpaul closed this as completed Nov 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occasional issues with file already closed errors #77

Occasional issues with file already closed errors #77

andrewfraley commented Oct 11, 2017

xorpaul commented Oct 11, 2017

andrewfraley commented Oct 11, 2017 •

edited

Loading

baurmatt commented Oct 16, 2017

xorpaul commented Oct 17, 2017

andrewfraley commented Oct 17, 2017

baurmatt commented Nov 1, 2017

baurmatt commented Nov 2, 2017

andrewfraley commented Nov 2, 2017

andrewfraley commented Nov 2, 2017 •

edited

Loading

andrewfraley commented Nov 2, 2017

xorpaul commented Nov 3, 2017

baurmatt commented Nov 3, 2017

andrewfraley commented Nov 3, 2017 •

edited

Loading

xorpaul commented Nov 3, 2017 •

edited

Loading

andrewfraley commented Nov 3, 2017

baurmatt commented Nov 3, 2017

xorpaul commented Nov 3, 2017

baurmatt commented Nov 3, 2017

baurmatt commented Nov 3, 2017

xorpaul commented Nov 3, 2017

xorpaul commented Nov 7, 2017

baurmatt commented Nov 20, 2017

xorpaul commented Nov 20, 2017

andrewfraley commented Nov 20, 2017

Occasional issues with file already closed errors #77

Occasional issues with file already closed errors #77

Comments

andrewfraley commented Oct 11, 2017

xorpaul commented Oct 11, 2017

andrewfraley commented Oct 11, 2017 • edited Loading

baurmatt commented Oct 16, 2017

xorpaul commented Oct 17, 2017

andrewfraley commented Oct 17, 2017

baurmatt commented Nov 1, 2017

baurmatt commented Nov 2, 2017

andrewfraley commented Nov 2, 2017

andrewfraley commented Nov 2, 2017 • edited Loading

andrewfraley commented Nov 2, 2017

xorpaul commented Nov 3, 2017

baurmatt commented Nov 3, 2017

andrewfraley commented Nov 3, 2017 • edited Loading

xorpaul commented Nov 3, 2017 • edited Loading

andrewfraley commented Nov 3, 2017

baurmatt commented Nov 3, 2017

xorpaul commented Nov 3, 2017

baurmatt commented Nov 3, 2017

baurmatt commented Nov 3, 2017

xorpaul commented Nov 3, 2017

xorpaul commented Nov 7, 2017

baurmatt commented Nov 20, 2017

xorpaul commented Nov 20, 2017

andrewfraley commented Nov 20, 2017

andrewfraley commented Oct 11, 2017 •

edited

Loading

andrewfraley commented Nov 2, 2017 •

edited

Loading

andrewfraley commented Nov 3, 2017 •

edited

Loading

xorpaul commented Nov 3, 2017 •

edited

Loading