Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional "fatal: unable to connect to localhost" on CI #1676

Open
EliahKagan opened this issue Sep 21, 2023 · 7 comments
Open

Occasional "fatal: unable to connect to localhost" on CI #1676

EliahKagan opened this issue Sep 21, 2023 · 7 comments

Comments

@EliahKagan
Copy link
Contributor

EliahKagan commented Sep 21, 2023

From time to time I get an error like this on CI:

FAILED test/test_base.py::TestBase::test_with_rw_remote_and_rw_repo - git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
  cmdline: git ls-remote daemon_origin
  stderr: 'fatal: unable to connect to localhost:
localhost[0: ::1]: errno=Connection refused
localhost[1: 127.0.0.1]: errno=Connection refused
'

I'm not sure why this happens, but I see it every few days or so, on days I'm pushing lots of commits (more precisely: on days I run git push many times, since CI is only running on the tip of the pushed branch). It happens both in my fork, in PRs here on this upstream repository (as in #1675 detailed above), and in pushes to this upstream repository (as in d1c1f31). Rerunning the CI check gets rid of it; that is, it doesn't recur. I don't think I've had this locally, but I don't run the tests locally as much as on CI.

Searching reveals that something like this, perhaps the exact thing, happened in #1119, so this may be known. I'm not sure if its known still to be occasionally occurring, or if there is anything that can reasonably be done to make it happen even less often or not at all. (If it's not worth having an issue on this, then I don't mind this simply being closed.)

@EliahKagan
Copy link
Contributor Author

As a brief update that I hope to flesh out more at some point: I suspect this is actually related to the problem that HIDE_WINDOWS_FREEZE_ERRORS is about on native Windows systems. The same tests, or a least one of them, seem affected. I suspect what's happening is that git-daemon is occasionally unresponsive on any platform (though it seems to be a CI issue, as I tried running this test 10,000 times on my Ubuntu system today and everything was fine), but that on some Windows systems the connection does not time out for much longer. This hunch, which could very well be wrong, is based on a wispy recollection of other issues on some Windows systems with network connections to unresponsive servers blocking for an extended time. I don't remember the details. To be clear, this is not something that Windows users would expect to experience regularly; I believe it's something specific that I am just not fully recalling.

@EliahKagan

This comment was marked as outdated.

@EliahKagan

This comment was marked as outdated.

@EliahKagan
Copy link
Contributor Author

I think the way to proceed with this is to modify the one affected test so that, when it fails in this specific way, it retries several times. Since as noted above this situation may already be a cause of extended blocking on Windows (where the test is disabled by default and not currently run on CI), retrying should probably only be done on non-Windows systems.

This should achieve at least one of two things:

  • If the failures are random, the problem is effectively fixed, because failing on each try will be no more common than other unusual kinds of CI failures that tend not to persist (e.g. failing to check out the repository in the first place).
  • If the failures are not random, such that retrying often also fails, then we have learned something the problem that that may help to figure out the cause.

@Byron
Copy link
Member

Byron commented Aug 16, 2024

Sounds good to me, thank you!

@EliahKagan
Copy link
Contributor Author

I have not yet done #1676 (comment). But when I saw the most recent occurrence of this in #1989, I was reminded of the git-daemon connection problem that had affected gitoxide, discussed in GitoxideLabs/gitoxide#1726, described in GitoxideLabs/gitoxide#1726 (comment), and fixed in GitoxideLabs/gitoxide#1731. Could this somehow also be related to the problem of connections that have not been properly closed?

@Byron
Copy link
Member

Byron commented Jan 5, 2025

That is a great point!

GitPython definitely does something similar, and it's likely they run into the same problems. However, the issue seems to happen here more often than in gitoxide which (fortunately) is back to rock-solid CI runs that seemingly never fail outside of systematic failures which would affect everyone.

GitPython also uses the Git binary much more, which might have additional side-effects and latencies that aren't present in gitoxide tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants