-
-
Notifications
You must be signed in to change notification settings - Fork 894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very slow response times on Heroku using Gunicorn 18.0? #379
Comments
@whenchyiv if possible, it would be useful if you can revert back to the versions you were running when everything worked for Flask-SocketIO, python-socket and python-engineio. If making only that change, and leaving everything else at the newer versions fixes the problem, then I guess I'll need some help in setting up an example similar to yours to debug. |
Already done. :) The application will be under similar conditions to today on Thursday (it's a learning management system using sockets for live classes), so I will update this issue with the results of rolling back the package versions then. |
Rolling back to the following configuration has effectively eliminated the slow response times:
After downgrading, I did briefly see a smattering of slow response times, though nothing like what prompted me to open this issue. In the interest of assisting with debugging, you will find the logs of what I saw below:
(The You will notice that all of these errors happened to the same user. On a hunch, I checked the user's IP - it seems to be a cellular connection, and our app is very bandwidth intensive (two way streaming video over WebRTC). I suspect that the user's cellular connection may explain the idle connection errors, and therefore explain the logs above. Does that make sense? Also, since it does indeed look like downgrading has remedied the slow response time issue, what help would you like with setting up a debugging environment? Thanks again! |
I really at a loss here, because I don't see any significant change in either Flask-SocketIO, python-socketio or python-engineio. Could you try just switching the SocketIO client to see if that's what causing this? I haven't really used the 1.7.x versions much yet, maybe they changed something on their side that affects the server. It would also be useful if you check if this problem can be reproduced locally. |
So far I've been unsuccessful with reproducing this locally. I'll try switching around the SocketIO clients, and see if I can isolate what's causing the issue (the various Flask-SocketIO related packages, or the JS client version), and report back. |
A quick update @miguelgrinberg, the following configuration has been working like a charm with the updated Flask-SocketIO packages:
I'll try upgrading the JS client to 1.7.2 next and report back on the results. I strongly suspect that the 1.7.x branch of the client is the issue, since the updated versions of Flask-SocketIO's packages are running perfectly with the 1.4.8 client. |
@miguelgrinberg a final update on this: I just tried upgrading to the Socket.io JS client 1.7.5 again, and am seeing a smattering of very slow response times and H12/H15 errors. 1.4.8 has been working flawlessly all week, so the issue does seem to be the 1.7.x branch of the Socket.io client based on what I'm seeing right now. That said, it still looks like the client is working fine in spite of the response times, so I'm not sure if it's a significant concern or not. |
The 1.7.x line of the JS Socket.IO client is fairly new. The 1.7.0 came out less than a month ago, and so far they release 1.7.1 and 1.7.2 to address bugs. I would wait to adopt this version for a bit and let it stabilize first. You may want to try the 1.6.0 release which is the one that precedes 1.7.0. |
Sounds good. I'll give 1.6.0 a shot per your suggestion, and hold off on the 1.7 branch for now. Thank you so much for all of your help (and for your work on Flask-SocketIO and its attendant packages - it's really fantastic)! |
@miguelgrinberg, another quick update on this: I'm seeing the issue with the 1.6 branch of the JS client, and after some analysis I've found that it seems to happen with users on slow internet connections that are likely saturating when using streaming video (specifically OpenTok). Might there be some change to Socket.io after the 1.4.x branch that would result in logging very long request times (20,000+ ms) on saturated/slow network connections? The vast majority of users see response times <50ms. I've triple checked my code and don't see anything that would explain the response times. I have also tried scaling to far more dynos than could possibly be needed for 20-40 simultaneous socket connections, which doesn't help - one or two users seems to see these intermittent slow response times regardless, so it does not seem to be an issue with server capacity. Any thoughts? I'm totally baffled. |
Have you tried running the server with Socket.IO logging enabled, to see how things look like? Add |
I actually wasn't aware of the engineio_logger - will take a look, and report back. |
I'm not seeing any issues on the engineio_logger. Everything seems to be working as expected. I have noticed, however, that it is the exact same users seeing these slow response times again and again. Any thoughts regarding that? |
So you don't see these particular clients losing their connections and reconnecting? The slow responses must be causing timeouts. Are these clients using websocket or long-polling? |
The vast majority have fallen back to polling, though it is rarely seen with socket connections. I have been logging connection/disconnections throughout, and the affected clients do seem to disconnect/reconnect intermittently. That said, there does not seem to be a relationship between disconnects and long response times. There are indeed a smattering of timeouts, but nothing consistent. My initial hunch was that this was long polling being incorrectly flagged as a single slow request/response cycle, but the clients only see these response times intermittently - some of the responses to those particular clients are <50ms, as seen by most users, though there is no correlation between what the server is doing and those longer response times. I have also noticed that some of these clients seem to have more than one socket/polling connection open, despite the client code only having one connection var. They do not, however, seem to have multiple tabs or windows open, which would ordinarily explain this. No idea what's going on there, but I figure it might provide some sort of clue about what is going on. 🙂 |
Well, disconnects are definitely not normal, unless the network conditions are bad. And there should be a direct relationship between slow responses and disconnects, since when one of the sides don't get a timely response it assumes the other side is gone and disconnects. I really find it odd that the log does not show any indications of these problems. Can you share them? |
Sure. Below is a sample of what I was just seeing - it's a bit messy because of the keep alive I had implemented previously to troubleshoot the issue, on top of the usual PING/PONG:
|
Couple of comments. There is only one slow request in this log. Heroku timed it out at 30 seconds. The default long polling frequency is 25 seconds, which is actually dangerously close to Heroku's timeout. There is one other (successful) instance in which a request lasted 29.9 seconds. You may want to use a shorter polling interval, or else if possible, increase the timeout for Heroku (don't remember if that is possible). As a side note I see that you are using multiple dynos for this. Are you connecting all these to a message queue? |
Are the other requests that are 1000ms+ simply long polling being logged (and therefore should he ignored), rather than slow requests? There are a handful of such requests in the logs above (which only span ~1 minute of server time - had I pasted several minutes of logs you would seen more, but it would have been mighty long 😬 ). I would be very surprised if any requests take longer than one or two hundred milliseconds for the server to respond, based on what the server is doing. I am indeed using a message queue (Redis), and am using Heroku's session affinity to ensure that requests are routed correctly. So the solution would be to use a shorter interval via the ping_interval var (I don't believe Heroku permits you to extend their timeout interval)? Or is there something else going on here? |
I believe so, yes. That is totally expected and not an indication of a problem. It only becomes a problem when nothing happens during the entire polling cycle, which would make the request linger for 25 seconds, which when accounting for overhead, scheduling of the threads, etc. could get to 30, at which point Heroku will cancel the request and the client will disconnect. I would like to know why these clients do not upgrade the connection from polling to websocket. This could be because the clients don't support websocket, I guess. My suggestion is to reduce this 25 second timeout to maybe 20, and see if that helps prevent the disconnects. This is an argument to the |
Great - I'll reduce the interval var to 20, and see if that doesn't remedy the issue. Thank you so much for all of your help with this!
My suspicion is that they are on networks that are filtering socket connections (e.g. behind corporate proxies or firewalls) - the software is for real estate professionals, many of whom would be in corporate environments with restricted networks. The clients are all Chrome/Firefox/Edge because of the need for WebRTC support, so I don't think that browser support is the issue. I've tried both ws and wss, and neither seems to make a difference for the handful of clients that experience this issue. |
No activity in 4 years. Closing. |
I'm seeing very slow socket response times after upgrading to Flask-SocketIO v2.8.1, and Socket.io client v1.7.2, triggering Heroku and New Relic alerts.
I'm suspicious that this might be Heroku mis-reporting the response times of long connections over sockets and polling, but I did not see this behavior when using Flask-SocketIO v2.7.1 and Socket.io v1.4.5 (my previous configuration), and didn't notice anything in the commits that might explain the change. Client side, sockets seem to work fine. Is this something I should be worried about, or something that I should filter from the app metrics and ignore?
The following are examples of what I'm seeing:
Websocket:
Dec 12 16:24:45 heroku/router: at=info method=GET path="/socket.io/?EIO=3&transport=websocket&sid=51856e7286b246bead0a87d237865d25" dyno=web.2 connect=0ms service=4753545ms status=101 bytes=34616
Polling:
Dec 13 14:53:41 heroku/router: at=info method=GET path="/socket.io/?EIO=3&transport=polling&t=LZwOzOq&sid=1a701a331af742b4b5cde10e3ef9b1e5" dyno=web.2 connect=0ms service=24953ms status=200 bytes=252
Timeouts:
Dec 13 15:02:14 heroku/router: at=error code=H12 desc="Request timeout" method=GET path="/socket.io/?EIO=3&transport=polling&t=LZwQo50&sid=6f73a93f48aa44d4b686fb2c92f5449a" dyno=web.1 connect=1ms service=30000ms status=503 bytes=0
Dec 13 15:05:18 heroku/router: at=error code=H15 desc="Idle connection" method=GET path="/socket.io/?EIO=3&transport=websocket&sid=e77142bf3a884f8ab1dd65eb1396452d" dyno=web.1 connect=0ms service=903107ms status=503 bytes=
In the event that this is a concern, and not expected behavior, some troubleshooting I went through already:
I reviewed #171 and have ensured that I am using Gunicorn v18.0. I have tried using a keep-alive to prevent Heroku from disconnecting sockets after 55s, in case that might be the issue. I have confirmed that the app is monkey patched at the top of app.py. My Procfile reads as follows:
web: newrelic-admin run-program gunicorn --worker-class eventlet -w 1 app:app
Any thoughts? Thanks!
The text was updated successfully, but these errors were encountered: