-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime/netpoll: randomly hanging Accept on macOS since go1.18 #54529
Comments
I would suggest that you take a look at Send-Q and Recv-Q of the TCP listener, and see if these are being accumulated and not reduced during this issue, if so, then it may indicate that Go netpoll fails to wake up the goroutine which holds the listener for accepting new sockets, if not, then it might not be a bug of Go. |
I've gone through the netpoll code roughly, but didn't catch anything, maybe I missed something, will keep browsing the source code. |
What's the best way to do so on macOS? Does netstat show this info? EDIT: The answer is yes, a simple |
BTW, What's the output of |
You may want to add a |
Doesn't work here.. |
Then go with |
Another thought: it could be that your |
I found a Please let me know if the above solution doesn't work for your case (: |
I received some of the requested details from a customer: https://gist.github.com/thully/03bc2a4e73cae53e079e2db66d3bebb1 |
Our process listens on Excerpts:
Neither
So @panjf2000 is correct in that the connections are piling up to the But I don't see how increasing somaxconn would make any difference, since the golang process is stuck and refusing to drain sockets off the queue.
|
I asked a user to check the current values, and it says:
So it appears those are incrementing on some users' systems for some reason. And once they hit 128, things stop making their way down to the process? I wonder what the application is supposed to do to tell the kernel to clear out the unaccepted connections? |
I was able to reproduce somewhat on my own M1 mini, using a SYN-flood. Atleast I can make the qlen/incqlen max out, but in my case the numbers reset back to zero very quickly afterwards..
|
Re-reading @panjf2000's advice above, things are starting to make more sense.
I'm not seeing any Send-Q/Recv-Q on the LISTEN socket via
The SYN/accept queue is increasing, and getting stuck. So this is clearly some sort of kernel bug on the macOS side. Users affected by this bug never seen qlen drop down to zero. Even when using
The issue is not a qlen spike, but rather a leak somewhere. However increasing somaxconn is still a useful workaround, as it delays the problem. I confirmed that after changing the sysctl and restarting my program, The users reporting this issue are experiencing maxqlen exhaustion every 1-3 days, with up to 4 leaks an hour. So far one commonality is that all the affected users have some kind of network extension installed, as visible by |
I'm not an expert with macOS and although the
What I really meant by what I said before was for SYN queue and accept queue, therefore, I think your case does indicate an issue inside Go netpoll, it could either be that |
It seems like the app can't do anything other than call accept(). When a real request comes in and accept is called, the qlen is not being cleared out just decremented by one. So it doesn't seem like go could be doing anything else.. unless waking up and calling accept() earlier would have made a difference. |
I filed FB11405981 with Apple A user reports that removing Little Snitch network extension fixed the issue. More details here: https://community.getchannels.com/t/channels-dvr-stops-responding-randomly-on-macos/32418/83 |
I am experiencing this as well in Ventura 13.6.1 and Sonoma 14.1.2
|
this looks normal to me? in my case, note the really long wait times indicating the "stuck" state:
|
Update: I've resolved my issue. The listener's IP address was not routable. Once I fixed that, the listener had no problem accepting connections of course. |
I have met the same problem, when I create more than 200 nodes connected via net.TcpDial. And after many rounds of message deliveries, the final broadcast makes around 40 nodes not receive the broadcast message on my M1 MacBook Air with macOS Sonoma 14.4. |
Context
We ship golang binaries that run a net/http.Server to a number of users on different platforms.
Recently we've started receiving reports from macOS users, mostly using M1 mac minis, of an issue where the http server stops responding.
The reports started shortly after we shipped an upgrade from go1.17 to go1.18
Observations
The issue is very intermittent, with some users experiencing it atleast once a week.
At the same time, my own M1 mac mini has never experience this issue.
We have been trying to debug for a couple months, and have been able to gather some information via users:
does not seem to be ulimit related, as other parts of the binary continue to function properly, creating files and making network requests
the listening socket is active, because
curl
experiences a timeout instead of connection refusedgolang stack trace shows the TCPListener.Accept sitting idle in netpoll
I'm looking for any suggestions on how to debug this further, or if anyone has seen anything similar.
Currently I am wondering if this issue affects just the single socket, or is something process-wide. To verify, we've spun up another listener on a secondary port, and are waiting to hear if that port keeps working when the primary port stops responding.
As a workaround, we are considering adding an internal health check in the process that tries to connect back to itself over loopback. If the connection times out, we could restart the TCPListener.
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Have not tried with go 1.19
The text was updated successfully, but these errors were encountered: