Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue cloning large repositories #301

Closed
jantman opened this issue Jun 25, 2015 · 17 comments
Closed

Issue cloning large repositories #301

jantman opened this issue Jun 25, 2015 · 17 comments

Comments

@jantman
Copy link

jantman commented Jun 25, 2015

GitPython seems to break on large repositories. I've hacked a bit of additional output into the assertion at https://github.com/gitpython-developers/GitPython/blob/master/git/remote.py#L576 :

        # NOTE: We assume to fetch at least enough progress lines to allow matching each fetch head line with it.
        assert len(fetch_info_lines) >= len(fetch_head_info), "%d <= %d - len(fetch_head_info) <= len(fetch_info_lines) : len(%s) <= len(%s)" % (
            len(fetch_head_info),
            len(fetch_info_lines),
            fetch_head_info,
            fetch_info_lines)

When I run git.Repo.clone_from() or .fetch(), it produces:

[DEBUG reconcile_git_repos.py:95 -       clone_or_fetch() ] Fetching origin in /localfs/users/jantman/B-07815/storyville2.git
Traceback (most recent call last):
  File "./reconcile_git_repos.py", line 168, in <module>
    args = parse_args(sys.argv[1:])
  File "./reconcile_git_repos.py", line 75, in run
    self.clone_or_fetch(path, url, fetch=fetch)
  File "./reconcile_git_repos.py", line 96, in clone_or_fetch
    except git.exc.NoSuchPathError:
  File "/home/jantman/tickets/B-07815/lib/python2.7/site-packages/git/remote.py", line 654, in fetch
    res = self._get_fetch_info_from_stderr(proc, progress or RemoteProgress())
  File "/home/jantman/tickets/B-07815/lib/python2.7/site-packages/git/remote.py", line 588, in _get_fetch_info_from_stderr
    fetch_info_lines)
AssertionError: 1944 <= 1716 : len([u"c415f018

... and then an amazingly long line of output.

This appears to happen with very large repositories; I honestly can't tell if the issue here is commit messages or the repository size - the one I used for testing has (yeah..... :( ) 4,789 branches.

@jantman
Copy link
Author

jantman commented Jun 25, 2015

Updated issue - fixed formatting, and corrected description, this does not get better after multiple fetches. I have at least one repo that GitPython simply cannot fetch.

@Byron Byron added this to the v1.0.2 - Fixes milestone Jun 26, 2015
Byron added a commit that referenced this issue Jun 26, 2015
Write debug files for stderr output as well as FETCH_INFO into
the git repository, e.g. into `.git/`

Help with #301
@Byron
Copy link
Member

Byron commented Jun 26, 2015

The issue described looks very much like #232, which was fixed back in the days. Maybe, the fix didn't work after all.

What happens is that GitPython receives output from stderr, line by line, and from the FETCH_INFO file git leaves on disk. The information of both data sources is merged to obtain all the branch information we seek to provide.

If these data sources do not match up for some reason, the assertion will fail.

Assuming you are already using the latest version on pypi or GitHub, we could try to allow me to reproduce the issue. To do that, you can run your code with GitPython on this branch which leaves debug files in the .git/ directory of the respective repo. These you could provide by some means so I can integrate them into the test-suite and fix the issue.

Thank you.

Byron added a commit that referenced this issue Jun 26, 2015
Write debug files for stderr output as well as FETCH_INFO into
the git repository, e.g. into `.git/`

Help with #301
@Byron
Copy link
Member

Byron commented Jun 26, 2015

By the way: a quick workaround should be to leave GitPython out of the equation and just run repo.git.fetch(). That way, you will not have any branch information, but the fetch will definitely work.
The Repo.clone_from() operation can also be approximated by using the Git type directly.

@jantman
Copy link
Author

jantman commented Jun 26, 2015

@Byron thanks so much for the quick reply on this. This is happening with 1.0.1 from pypi (I wish I could say I was running one of the old RCs and the upgrade fixed it, but no such luck).

So unfortunately this is a private, internal repo, and given some of the commit messages that I've seen, I don't feel comfortable providing... anything... from this repo externally.

However, I've created a test repository at https://github.com/jantman/gitpython_issue_301 with 5,000 randomly-generated commits in it, each in its own branch. Very interestingly, I've been able to reproduce the issue, but only when using SSH as a transport; when using HTTPS, it works fine.

Code to reproduce:

#!/usr/bin/env python

import git
"""
# this works
url = 'https://github.com/jantman/gitpython_issue_301.git'
"""
# this doesn't
url = '[email protected]:jantman/gitpython_issue_301.git'
path = 'testrepo'
print("cloning")
repo = git.Repo.clone_from(url, path)
print("fetching")
repo.remotes.origin.fetch()
print("done")

The failure happens during the fetch. I've added you as a collaborator on the repo so you can test if you want.

@Byron
Copy link
Member

Byron commented Jun 26, 2015

You can watch the development stream on youtube.

GitPython #12 [issue 301 - investigate fetching repos with huge amount of branches]

thumb

@jantman
Copy link
Author

jantman commented Jun 30, 2015

@Byron That was actually really interesting to watch... it's the first time I've seen a livestream of development work, let alone the thought process of working through an issue I opened.

@Byron
Copy link
Member

Byron commented Jul 1, 2015

Thank you ! And wait until you have seen a good one of that kind !

PS: I have the issue in my inbox and will get to it eventually. A few things piled up.

Byron added a commit that referenced this issue Jul 3, 2015
This allows us to use the main thread to parse stderr to get progress,
and resolve assertion failures hopefully once and for all.

Relates to #301
Byron added a commit that referenced this issue Jul 3, 2015
@Byron
Copy link
Member

Byron commented Jul 3, 2015

Alright, a fix was pushed, and I could verify that the provided means of reproduction do not fail anymore.
Travis is now testing for it as well.

Please let me know if it works for you as well.

@Byron
Copy link
Member

Byron commented Jul 3, 2015

You can watch the development stream on youtube.

GitPython #15 [issue 301 finally fixed]

thumb

@jantman
Copy link
Author

jantman commented Jul 4, 2015

Wow! I watched the second recording (the first one seems to have been deleted?)... quite a tricky issue. Thanks sooo much for giving this so much attention!

Fix is confirmed for me, using both the test repo I created on GitHub (which I've now deleted, since you created https://github.com/gitpython-developers/gitpython_issue_301 ) and using the original internal repositories that I found the problem with. Also, I don't have any numbers to back this up, but subjectively, it seems to run quite a bit faster.

I've used GitPython quite a bit (though usually only for one-off scripts, nothing that really runs on a regular basis) and REALLY appreciate the work you're doing getting the project caught up. If there's anything I can do to help, I've got a lot on my plate, but feel free to ask.

@Byron
Copy link
Member

Byron commented Jul 4, 2015

Indeed, video 14 seems to be broken - youtube was unable to convert the video file, maybe the upload was corrupted somehow. In a way it's good, as it shows me trying to figure out what's so special about this particular way of fetching branches, after all it worked fine in my test which spawned a process printing the branch information to stdout and stderr.

The speedup might be due to the improved line parsing, which is now handled by C instead of pure python. The fix was to only open stderr of the git fetch process, which now allows to just call readlines() on it.

Thanks for your help in writing excellent issues, the time you spent to help reproducing them is much appreciated as well !

@ianwestcott
Copy link

This does not seem to be fixed in HEAD.

I ran into this issue with GitPython 1.0.1, and while troubleshooting I installed the package directly from the HEAD of this repo to see if that would fix the problem. The output of the error changes in my case, but the error itself does not – in 1.0.1 I would get the contents of each list in the error, whereas now I only get len(XXX) <= len(XXX).

The repo I'm acting on is private, so some details are necessarily excluded. In summary: when running .fetch(), I get the same AssertionError in _get_fetch_info_from_stderr(). In my case, one side of the assertion consistently contains 324 items, and the other side contains a maximum of 111, but sometimes other counts (when running it multiple times I got 59, 58, and even 7). Also, the call occasionally succeeds without throwing the AssertionError at all.

In the detailed output that was produced by 1.0.1, what I see are lists of branches. The formatting is noticeably different between the lists on either side of the assertion. One side looks like this:

u"REDACTED_REF\tnot-for-merge\tbranch 'REDACTED_BRANCH_NAME' of github.com:REDACTED_REPO_PATH\n"

Whereas the other side looks like this:

u' = [up to date]      REDACTED_BRANCH_NAME -> origin/REDACTED_BRANCH_NAME'

In the example from which I drew this output, the latter side was the smaller one. Also, it's worth noting that the branch names in both lists are sorted alphabetically, and that the latter list is noticeably truncated at a seemingly arbitrary point. Could this be an API pagination issue, or something similar?

Let me know if there is anything I can do to help troubleshoot further. Thanks!

@Byron
Copy link
Member

Byron commented Oct 15, 2015

@ianwestcott Thanks for chiming in on this one, and for offering your help to do additional troubleshooting. What's happening is that git will provide real-time information about the fetched heads on stderr, while finally writing additional, complementary information about them into the FETCH_HEAD file. Both data sources have to be matched line-by-line to help GitPython assemble all the information about the fetch operation.

In the past, this was problematic as getting all lines from stderr could result in GitPython deadlocking. By now, this operation has been reduced to a simple, synchronous read as in for line in proc.stderr.readlines():. The example provided by you seems to indicate non-deterministic behaviour - it rarely works, but often doesn't - which with that code should not be possible anymore.

Would that mean that readlines() cannot be trusted ? Considering the issue is fixed for @jantman, maybe something else is different for you. Something worth noting would be the python version and OS you are using.

@ianwestcott
Copy link

Hi @Byron,

Ah, I suppose that info would be helpful, yes. I'm using Python 2.7.10 (installed via Homebrew) in OS X El Capitan (10.11). Let me know how I can be of further assistance in troubleshooting this. I can also open a new issue if you'd prefer not to clutter this one further. Thanks!

@Byron
Copy link
Member

Byron commented Oct 17, 2015

@ianwestcott A new issue would be great ! My primary goal would be to reproduce the issue you are seeing with the latest HEAD. That might be involve some work on your part I suppose, but without a way to reproduce the issue, there can be no fix.
Another option would be if you tried a few things locally, maybe with my guidance, which would allow you to work on the private repository that shows these issues.
To my mind, it seems odd that proc.stderr.readlines() shows non-deterministic behaviour, it simply makes no sense to me ... :(.

@ianwestcott
Copy link

To follow up on this:

My issue persisted for a long time, and I never got to the bottom of it. I learned that the problem is specific to OS X El Capitan (10.11), and that it seems to be related to how git fetch --verbose writes to stderr, and the capturing of that output. It can be reproduced in El Cap by running the following commands within a repo that has many branches:

# this will output a complete list of refs
git fetch --verbose 2>&1
# this will mysteriously output a truncated list of refs
git fetch --verbose 2>&1 | cat

I could reproduce the issue regardless of the version of git installed in El Cap (I tested the Mac system version, homebrew's version, and the latest release compiled from source).

Anyway, the good news is that while I never figured out the root cause of the truncated output, I am pleased to report that it looks like GitPython no longer throws an AssertionError as a result of it, probably due to the fix in #442. I have not had a problem since upgrading to GitPython 2.0.6. Which is a huge relief, because I was stumped at how to further troubleshoot the issue with git. :)

@Byron
Copy link
Member

Byron commented Jul 17, 2016

@ianwestcott Thanks for the write-up ! It's incredible to see such behaviour can actually exist, and even does so with the plain command-line program. I myself didn't experience such behaviour yet, and am puzzled as what's causing is, or how to fix it.
However, it's great to hear at least some issue was resolved for you ... that assertion certainly was the most expensive one I have ever written (considering the amount of issues it spawned), good it's gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants