-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LFS: Cloning objects / batch not found #8273
Comments
I've made some more tests. After compiling the version of commit dbd0a2e Fix LFS Locks over SSH (#6999) (#7223) the error appears. The LFS data is large (approximately 10 GB). One commit before (7697a28) everthing works perfectly. I've tried to disable the SSH server. But this doesn't change anything. @zeripath Let me know if you need more information. |
Here you can see the debug log output when the error occurs: PANIC:: runtime error: invalid memory address or nil pointer dereference,
|
I suppose that Gitea is exceeding the number of local socket connections permitted by the OS. Failure: cannot assign requested address See also explanation and possible solution here: Where could I change the setting MaxIdleConnsPerHost and other LFS server settings to make further tests? |
BTW: The error PANIC:: runtime error: invalid memory address or nil pointer dereference does not always appear in the log output. Sometimes the server and client just hang. |
@lunny Who could help to isolate this bug? Is there any Gitea programmer who could support us? I am willing to make more tests but I need some hints. |
@m-a-v: There is also a setting:
which will affect the transfer probably, nevertheless it should not crash the server... |
Another interesting read: https://www.fromdual.com/huge-amount-of-time-wait-connections
|
The problem seems to be the huge amount of connections for the Get request (more than 10k connections for one single client!). See also here: https://medium.com/@valyala/net-http-client-has-the-following-additional-limitations-318ac870ce9d. |
@m-a-v I've been very busy doing other things for a while so have been away from Gitea. I'll take a look at this. I think you're on the right trail with the number of connections thing. IIRC there's another person who had a similar issue. |
@m-a-v I can't understand why dbd0a2e should break things, but I'll double check. Maybe it's possible the request body isn't being closed or something stupid like that. That would cause a leak if so and could explain the issue. The other possiblity is that dbd0a2e has nothing to do with things and it's a Heisenbug relating to the number of connections thing. |
A |
OK, so all these calls to ReadCloser() don't Close(): Line 330 in 57b0d9a
Line 437 in 57b0d9a
Line 456 in 57b0d9a
Whether that's the cause of your bug is another question - however, it would fit with dbd0a2e causing more issues because suddenly you get a lot more calls to unpack. These should be closed so I guess that's at least a starting point for attempting to fix this. (If I find anything else I will update this.) |
@zeripath Thanks a lot. It may take some time until I can test it, but I certainly will. |
It's actually been merged in to 1.10 and 1.9 branches already. |
I've tested it again with 1.10 and it seems that the described LFS bug has been solved or at least it made the error appear for this specific scenario. Before @zeropath fix we had more than 10k connections in a TIME_WAIT state. Now there are still approximately 3.5k connections in the TIME_WAIT state. I assume if multiple clients will access the LFS server the same problem could still occur. Any idea how to improve this? Are there other possible leaks? I assume that a connection which closes will not remain in a TIME_WAIT state. Can anyone confirm this? |
Hi @m-a-v, I guess this means that I must have missed some others. Is there anyway of checking that they're all LFS connections? |
Indirectly, yes. I had only one active client. Before LFS checkout I had two connections on the MariaDB database server instance. During LFS checkout about 3.5k connections and then some minutes later again 2 connections. This article could be interesting: |
LFS checkout causes 3.5K connections?! How many LFS objects do you have? |
12k LFS objects. |
@zeripath Any connections that Gitea leaves open should remain in either |
Could it be that git lfs on the client is also leading connections? |
That would be either
|
I think the problem is more the following: "Your problem is that you are not reusing your MySQL connections within your app but instead you are creating a new connection every time you want to run an SQL query. This involves not only setting up a TCP connection, but then also passing authentication credentials across it. And this is happening for every query (or at least every front-end web request) and it's wasteful and time consuming." I think this would also speed up Gitea's LFS server a lot. source: https://serverfault.com/questions/478691/avoid-time-wait-connections |
AHA! Excellent! Well done for finding that! |
OK We do recycle connections. We use the underlying go sql connection pool. For MySQL there are the following in the
https://docs.gitea.io/en-us/config-cheat-sheet/#database-database I think
|
I think what you need to do is tune those variables better. I think our defaults are highly likely to be incorrect - however, I think they were set to this because of other users complaining of problems. I suspect that MAX_IDLE_CONNECTIONS being set to 0 happened before we adjusted CONN_MAX_LIFETIME and it could be that we could be more generous with both of these. I.e. something like MAX_IDLE_CONNECTIONS 10 and CONN_MAX_LIFETIME 15m would work. |
I could test it again with the repo. Which branch should I take? Which parameters (I've seen that discussions continued)? |
Did you also fix this? |
I have made several experiments with the currently running gitea server(v1.7.4 and with the new version v.1.9.5). The netstat snapshots were created at the peak of the number of open connections. Version 1.7.4
Version 1.9.5 (and same default settings as with 1.7.4
Version 1.9.5 (CONN_MAX_LIFETIME = 45s, MAX_IDLE_CONNS = 10, MAX_OPEN_CONNS = 10)
With both configurations the LFS servers has much too many open connections. So I think we still have serious problems with large LFS repos.
The clone process just freezes at a certain percentage (as soon as there are too many connections). I think this bug should be reopened. |
master (CONN_MAX_LIFETIME = 45s, MAX_IDLE_CONNS = 10, MAX_OPEN_CONNS = 10)
The checkout succeeds but still many used connections remain in TIME_WAIT status. If multiple clients would access the LFS server it could not handle it. |
Your max lifetime is probably too low, 45s seems aggressive. Are you sure all of those connections are db connections? Lots of http connections will be made when dealing with lots of lfs objects. (There's probably some more efficiencies we can do.) If they're all db then multiple users won't change it - you're likely at your max as it should be mathematically determinable: Total Connections = open +idle + timewait If max open=max idle: dC/dt = dO/dt + dW/dt max dO/dt = 0 (as it's fixed) max dW/dT = max_o/max_l - W/max_tw dC/dt is positive around C=0 therefore dC/dt=0 should represent max for positive C and thence maximize W. max_W = max_tw * max_o / max_l If they're all db then you have a very long max tw or I've messed up in my maths somewhere. You can set your time_wait at a server network stack level. |
I've chosen the 45 seconds from the discussion between you and @guillep2k in #8528. How are the connections reused? Where is this made in the code? I assume after a connection is closed it will go in the TIME_WAIT state. I don't know if all are db connections. Why did it work with 1.7.4 almost perfectly (see above)? |
This could be interesting: "Probably the best option, if it's doable: refactor your protocol so that connections that are finished aren't closed, but go into an "idle" state so they can be re-used later, instead of opening up a new connection (like HTTP keep-alive)." "Setting SO_REUSEADDR on the client side doesn't help the server side unless it also sets SO_REUSEADDR" |
@zeripath @m-a-v It must be noticed that not all @m-a-v it would be cool if you'd break your statistics down by listening port number. |
I don't think But |
@guillep2k What do you exactly mean with "it would be cool if you'd break your statistics down by listening port number"? tcp_fin_timeout is set to 60 seconds on my system. Ubuntu 18.04 LTS standard configuration. The question still remains. Why did it work perfectly with 1.7.4 (and earlier) and now anymore? |
|
I don't know, I'd need to check the code. The important thing is that it's taken care of now. 😁 |
I meant "and now not anymore". |
I meant it's now solved by properly handling @m-a-v If you want to investigate what's the specific change between 1.7.4 and 1.9.5 that caused this, I'd be interested in learning about your results. |
on 1.7.4 (9f33aa6) I had lots of connections when cloning too on the peak, when $ netstat -ant | awk '{print $6}' | sort | uniq -c | sort -n
1 Foreign
1 established)
5 LISTEN
10 ESTABLISHED
8599 TIME_WAIT When $ netstat -ant | grep TIME_WAIT | awk '{print $5 " " $6}' | cut -d: -f2 | sort | uniq -c
66 suddenly the client hangs on 97%. |
on 1.11.0+dev-563-gbcac7cb93: |
[x]
):Description
When I upload a repo with LFS objects, the upload mostly works.
While cloning, after some time, the lfs smudge filter (here 58%)
stalls always after some time, saying
After a night of debugging (updating sucessively through all versions with docker),
we come to the conclusions that
Could it be that the following Submissions into 1.8.3 are problematic:
The hints/workarounds in the discussion below, did not solve this issue:
https://discourse.gitea.io/t/solved-git-lfs-upload-repeats-infinitely/635/2
Hopefully this gets some attention, since its a nasty LFS Bug which made us almost to apple crumble. 🍎
The text was updated successfully, but these errors were encountered: