Flask-SocketIO behind nginx randomly thinks client has disconnected #327

suever · 2016-09-15T16:22:36Z

I've been using Flask-SocketIO to provide real-time interaction between a web-based frontend and background celery-based processes. The site can be found at https://matl.suever.net. The source code is available at https://github.com/suever/MATL-Online. Here's an example link which should print the numbers 1 through 12 in the output window when you hit "Run".

The basic layout of the app is that I have the javascript SocketIO library on the client side send information to the flask application (behind nginx) which schedules a celery task which then emits information back to the client using the message_queue functionality of Flask-SocketIO.

Seemingly randomly, the javascript client will successfully establish a connection, but when the user attempts to submit code to the server (via an emit), nothing happens as the client seems to be waiting for a PONG event which never occurs.

I have enabled debugging on the SocketIO client side as well as the engineIO backend and here is the server log entries when this occurs.

Backend Log

What seems to happen is that the client attempts to upgrade to websocket transport from HTTP polling and during that process, the server side seems to think that the client has disconnected so the PONG event is never sent back to the client which results in a timeout on the client side after 60 seconds, which then causes the connection to be re-established.

I'm at a bit of a loss of where to look next to debug this thing, particularly since the issue happens randomly. I'm open to any suggestions. I figured this was the most relevant place to ask this since there is a good bit of discussion on Flask-SocketIO and nginx here already.

Here is my nginx configuration

nginx.conf


user www-data;
worker_processes auto;
pid /run/nginx.pid;

events {
    worker_connections 768;
    # multi_accept on;
}

http {
    sendfile on;
    keepalive_timeout 3;
    types_hash_max_size 2048;

        map $http_upgrade $connection_upgrade {
            default upgrade;
            ''      close;
        }

    include /etc/nginx/mime.types;
    default_type application/octet-stream;

    ssl_protocols TLSv1 TLSv1.1 TLSv1.2; # Dropping SSLv3, ref: POODLE
    ssl_prefer_server_ciphers on;

    access_log /var/log/nginx/access.log;
    error_log /var/log/nginx/error.log;

    include /etc/nginx/sites-enabled/*;
}

matl.conf

server {
    listen 443  ssl;
    server_name matl.suever.net;

    access_log  /var/log/nginx/matl_access.log;
    error_log   /var/log/nginx/matl_errors.log;

    ssl_certificate     chain.pem;
    ssl_certificate_key privkey.pem;

    ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
    ssl_prefer_server_ciphers on;
    ssl_dhparam /etc/ssl/certs/dhparam.pem;
    ssl_ciphers 'XXX';
    ssl_session_timeout 1d;
    ssl_session_cache shared:SSL:50m;
    ssl_stapling on;
    ssl_stapling_verify on;
    add_header Strict-Transport-Security max-age=15768000;

    location /{
        proxy_pass      http://127.0.0.1:5000;
        proxy_redirect  off;

        proxy_set_header        Host $host;
        proxy_set_header        X-Real-IP $remote_addr;
        proxy_set_header        X-Forwarded-For $proxy_add_x_forwarded_for;
    }

    location /socket.io {
        proxy_pass              http://127.0.0.1:5000/socket.io;
        proxy_redirect          off;
        proxy_buffering         off;

        proxy_set_header        Host $host;
        proxy_set_header        X-Real-IP $remote_addr;
        proxy_set_header        X-Forwarded-For $proxy_add_x_forwarded_for;

        proxy_http_version      1.1;
        proxy_set_header        Upgrade $http_upgrade;
        proxy_set_header        Connection $connection_upgrade;
        proxy_read_timeout      120s;
    }
}

server {
    listen 80;
    server_name matl.suever.net;
    return 301 https://$host$request_uri;
}

And the installed packages within the virtualenv

alembic==0.8.6
amqp==1.4.9
anyjson==0.3.3
Babel==2.3.4
backports.ssl-match-hostname==3.5.0.1
BeautifulSoup==3.2.1
beautifulsoup4==4.5.1
billiard==3.3.0.23
bs4==0.0.1
celery==3.1.23
certifi==2016.2.28
click==6.6
cssmin==0.2.0
eventlet==0.17.4
Flask==0.11.1
Flask-Assets==0.11
Flask-Migrate==1.8.0
Flask-Script==2.0.5
Flask-SocketIO==2.4
Flask-SQLAlchemy==2.1
Flask-WTF==0.12
flower==0.9.1
futures==3.0.5
greenlet==0.4.10
gunicorn==19.6.0
itsdangerous==0.24
Jinja2==2.8
jsmin==2.2.1
kombu==3.0.35
Mako==1.0.4
MarkupSafe==0.23
mpmath==0.19
numpy==1.11.0
-e git://github.com/suever/oct2py.git@79e2507e127b71d9a9cdcee1174e3198e37209f5#egg=oct2py
python-editor==1.0.1
python-engineio==0.9.1
python-socketio==1.3
pytz==2016.4
redis==2.10.5
requests==2.10.0
scipy==0.17.1
six==1.10.0
SQLAlchemy==1.0.13
sympy==1.0
tornado==4.2
webassets==0.11.1
Werkzeug==0.11.10
WTForms==2.1

The text was updated successfully, but these errors were encountered:

miguelgrinberg · 2016-09-15T18:16:33Z

The client-side log that you have indicates that the server is responding with a 400 error here. I do not see the backend side of that interaction. Do you have it?

suever · 2016-09-15T19:15:52Z

@miguelgrinberg Thanks for the response. The 400 coming from the server (I believe) is the traceback here. The 400 is emitted by the server after the client tries to receive the PONG for 60 seconds and then disconnects. I believe this is the same as the theoretically benign issue reported here. I can even have this 400 error by manually disconnecting from the client side:

socket.disconnect()

Now when I manually disconnect, I don't get the line in the log file that appears just before that stack trace:

f66f24976d7c4b2290d080e4f3db743d: Failed websocket upgrade, expected UPGRADE packet, received None instead.

andrewSC · 2016-10-14T13:34:50Z

Any movement on this?

miguelgrinberg · 2016-10-14T18:50:42Z

Well, I have been unable to reproduce this problem. When the client requests the upgrade to websocket, for some reason the server thinks it has been inactive for too long, and decides to kick that client out. Unfortunately these logs don't have timestamps in them, so I can't really see how much time passed between the the different stages of the connection.

suever · 2016-10-16T01:35:59Z

@andrewSC Are you having similar issues?

I've added some logging code into my application and am capturing log outputs when the connection fails to upgrade to get a little more information on what is happening. I hope to post more information here once I get the more detailed log output (with timestamps).

andrewSC · 2016-10-24T14:22:04Z

@suever Sorry I was on vacation. I'm having the same issue but I think it's due to an improperly configured uwsgi instance. I'm going to try and use gunicorn and see how far that gets me.

qwexvf · 2016-10-27T17:22:53Z

@andrewSC hi. if you got anything please please share with us. thanks!

andrewSC · 2016-10-29T15:16:18Z

@qwexvf I honestly switched over to gunicorn from uwsgi and within about 10 mins. was able to have flask-socketio functional. The js client starts with long polling then asks for the websocket upgrade which comes back with a 101, which is all entirely expected. I know it's not quite as simple for some people to switch to gunicorn as their project might be deeply ingrained with uwsgi but that's how I solved my issue. I'm also assuming I didn't have uwsgi configured correctly and it could just be that I wasn't specifying gevent correctly in the config or other potential places. Sorry I couldn't be of more help!

ghost · 2017-02-09T08:34:58Z

This is affecting me as well. Running nginx as a proxy (passing socketio traffic using proxy_pass and other python traffic with uwsgi_pass), uwsgi to handle a flask app. Frontend is using the socketio js lib.

Flask==0.10.1 Flask-SocketIO==2.8.2 gevent==1.2.1 greenlet==0.4.11 grequests==0.3.0 python-engineio==1.1.2 python-socketio==1.6.3 requests==2.13.0 uWSGI==2.0.14

Both backend and frontend running on a single Ubuntu machine, using Chrome.

suever · 2017-02-25T15:09:54Z

I finally got some logs capturing the event with debugging turned on for gunicorn, nginx, and flask-socketio. The only piece I don't have in this interaction is the client-side log, but hopefully from these logs, it's discernable where in the client->nginx->gunicorn->flask-socketio chain the issue is.

This is currently affecting 13% of all sessions on my site.

miguelgrinberg · 2017-02-25T16:15:20Z

@suever Can you explain what you consider a problem in these log sessions? If your user refreshed the page, it is expected that there will be connection lost and then a reconnect. Am I looking at the wrong thing?

suever · 2017-02-25T16:23:03Z

Sorry, let me elaborate a little bit.

So the first time the client connects to the page, an upgrade from polling to a websocket is sent by the client and received by flask-socketio (here) however, the upgrade to the websocket never actually succeeds and the app doesn't work because the client thinks that the connection has been upgraded and therefore it has already terminated the polling but flask-socketio doesn't think that the connection is upgraded yet because it's still waiting on the UPGRADE packet. The inability to use the page causes the user to refresh the page causing the disconnect (everything there is as expected).

However, when the page is reloaded, then the request to upgrade to a websocket is again emitted by the client (here) but this time the upgrade is successful (here)

It seems that in the first case, the UPGRADE packet is somehow lost between nginx, gunicorn, or flask-socketio. I don't understand enough about how that communication occurs to be able to tell from the logs if 1) the upgrade packet is indeed received by nginx and 2) the upgrade packet is properly forwarded to gunicorn.

miguelgrinberg · 2017-02-25T16:53:21Z

Okay, sorry, I understand now.

I think the gunicorn error is unrelated, but if you want to definitely rule that out as a potential problem, you can install the master branch, which has the issue addressed. I believe a new official release of gunicorn with this fix is coming any day now.

The Flask-SocketIO log shows that nothing happens in those 14 seconds between the attempted websocket connection and the time the user refreshed the page. That is odd, because the HTTP connection remains active until the websocket handshake is complete, and only then it is retired. It could be that 14 seconds isn't enough time to catch any traffic, but considering that the page was refreshed, am I correct in assuming that the client appeared blocked during those 14 seconds? If it is easy to repeat the test, it would be better to let the connection stay for a minute or two, to see if the HTTP long-polling mechanism is working.

Unfortunately gunicorn doesn't tell us anything useful, except for that 400 error, which I believe happens after the websocket connection is closed, so it is downstream of the problem. The timestamp on the stack trace seems to agree with this, as the error came at the end of that 14 second period, and the problem we are interested in is at the beginning.

I looked at the nginx log as well, and compared the bad and the good connection cases, and I really don't see any smoking gun there either. It appears on the bad case the connection was established, and there was an initial exchange of data, but then the client did not respond with the expected response in the middle of the handshake.

Out of curiosity, are you able to characterize the clients that tend to fail more in some way? Is it a specific browser version, or some other attribute that makes it more likely to hit this problem?

saeidalidadi · 2017-09-18T12:07:00Z

I have the same issue with production. everything is ok on local. I am using node.js as app server, nginx for reverse proxy and socket.io for ws.
My connection establishes but it immediately gets down.

Schnodderbalken · 2017-12-23T08:47:26Z

I have the exact same problem. Locally (where nginx is not used) everything works fine. However, on production I have the above mentioned behavior. The engineio output is as follows:

2017-12-23 08:33:40,313 [INFO] in engineio: d4222aa461c34cb6aec234baf66f799f: Sending packet OPEN data {'sid': 'd4222aa461c34cb6aec234baf66f799f', 'upgrades': ['websocket'], 'pingTimeout': 60000, 'pingInterval': 25000}
2017-12-23 08:33:40,313 [INFO] in engineio: d4222aa461c34cb6aec234baf66f799f: Sending packet MESSAGE data 0
2017-12-23 08:33:40,557 [INFO] in engineio: d4222aa461c34cb6aec234baf66f799f: Received request to upgrade to websocket
2017-12-23 08:33:40,677 [INFO] in engineio: d4222aa461c34cb6aec234baf66f799f: Sending packet NOOP data None
2017-12-23 08:34:00,833 [INFO] in engineio: 9e9ae7f5b6f64202a29031d8edb6c006: Received packet PING data None
2017-12-23 08:34:00,833 [INFO] in engineio: 9e9ae7f5b6f64202a29031d8edb6c006: Sending packet PONG data None
2017-12-23 08:34:25,869 [INFO] in engineio: 9e9ae7f5b6f64202a29031d8edb6c006: Received packet PING data None
2017-12-23 08:34:25,869 [INFO] in engineio: 9e9ae7f5b6f64202a29031d8edb6c006: Sending packet PONG data None
2017-12-23 08:34:40,810 [INFO] in engineio: d4222aa461c34cb6aec234baf66f799f: Client is gone, closing socket

Here are the images of the reoccurring bad requests and the details of one of them (client side):
https://imgur.com/CQr7ryV

This is the log from nginx

208.100.26.229 - - [23/Dec/2017 08:38:15] code 400, message Bad request version ('RTSP/1.0')
208.100.26.229 - - [23/Dec/2017 08:38:15] "OPTIONS / RTSP/1.0" 400 -
208.100.26.229 - - [23/Dec/2017 08:38:20] code 400, message Bad HTTP/0.9 request type ('\x00')
208.100.26.229 - - [23/Dec/2017 08:38:20] "versionbind" 400 -
208.100.26.229 - - [23/Dec/2017 08:38:25] code 400, message Bad request syntax ('\x00\x00\x00¤ÿSMBr\x00\x00\x00\x00\x08\x01@\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00@\x06\x00\x00\x01\x00\x00\x81\x00\x02PC NETWORK PROGRAM 1.0\x00\x02MICROSOFT NETWORKS 1.03\x00\x02MICROSOFT NETWORKS 3.0\x00\x02LANMAN1.0\x00\x02LM1.2X002\x00\x02Samba\x00\x02NT LANMAN 1.0\x00\x02NT LM 0.12\x00')
208.100.26.229 - - [23/Dec/2017 08:38:25] "¤ÿSMB@@�PC NETWORK PROGRAM 1.0MICROSOFT NETWORKS 1.03MICROSOFT NETWORKS 3.0LANMAN1.0LM1.2X002SambaNT LANMAN 1.0NT LM 0.12" 400 -
208.100.26.229 - - [23/Dec/2017 08:38:30] code 400, message Bad request version ('GetClassName\x00\x00\x00\x02\x04\x00\x00\x00\x00\x01\x00')
208.100.26.229 - - [23/Dec/2017 08:38:30] "
                                                              GetClassName" 400 -
208.100.26.229 - - [23/Dec/2017 08:38:36] code 400, message Bad request version ('\x00\x01\x97|\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')
208.100.26.229 - - [23/Dec/2017 08:38:36] "�(rþ� �|" 400 -
208.100.26.229 - - [23/Dec/2017 08:38:41] code 400, message Bad HTTP/0.9 request type ('\x00')
208.100.26.229 - - [23/Dec/2017 08:38:41] "
                                                              " 400 -
208.100.26.229 - - [23/Dec/2017 08:38:41] code 400, message Bad request syntax ('HELP')
208.100.26.229 - - [23/Dec/2017 08:38:41] "HELP" 400 -
208.100.26.229 - - [23/Dec/2017 08:38:45] code 400, message Bad request syntax ('\x16\x03\x00\x00S\x01\x00\x00O\x03\x00?G×÷º,îê²`~ó\x00ý\x82{¹Õ\x96Èw\x9bæÄÛ<=Ûoï\x10n\x00\x00(\x00\x16\x00\x13\x00')
208.100.26.229 - - [23/Dec/2017 08:38:45] "SO?G×÷º,îê²`~óý�{¹Õ�Èw�æÄÛ<=Ûoïn(" 400 -
208.100.26.229 - - [23/Dec/2017 08:38:45] code 400, message Bad request version ('\x00/\x00')
208.100.26.229 - - [23/Dec/2017 08:38:45] "ieU§ärandom1random2random3random4
                                                                                               /" 400 -
208.100.26.229 - - [23/Dec/2017 08:38:45] code 400, message Bad request syntax ('\x00\x00\x00qj\x81n0\x81k¡\x03\x02\x01\x05¢\x03\x02\x01')
208.100.26.229 - - [23/Dec/2017 08:38:45] "qj�n0�k¡¢" 400 -
208.100.26.229 - - [23/Dec/2017 08:38:50] code 400, message Bad HTTP/0.9 request type ('l\x00')
208.100.26.229 - - [23/Dec/2017 08:38:50] "l
                                                               " 400 -
208.100.26.229 - - [23/Dec/2017 08:38:51] code 400, message Bad request syntax ('\x01default')
208.100.26.229 - - [23/Dec/2017 08:38:51] "default" 400 -
208.100.26.229 - - [23/Dec/2017 08:38:51] code 400, message Bad request syntax ('0\x84\x00\x00\x00-\x02\x01\x07c\x84\x00\x00\x00$\x04\x00')
208.100.26.229 - - [23/Dec/2017 08:38:51] "0�-c�$" 400 -
208.100.26.229 - - [23/Dec/2017 08:38:56] code 400, message Bad HTTP/0.9 request type ('0')
208.100.26.229 - - [23/Dec/2017 08:38:56] "0
                                                               `�" 400 -
208.100.26.229 - - [23/Dec/2017 08:38:56] code 400, message Bad request version ('SIP/2.0')
208.100.26.229 - - [23/Dec/2017 08:38:56] "OPTIONS sip:nm SIP/2.0" 400 -
208.100.26.229 - - [23/Dec/2017 08:39:02] code 400, message Bad request syntax ('TNMP\x04\x00\x00\x00TNME\x00\x00\x04\x00')
208.100.26.229 - - [23/Dec/2017 08:39:02] "TNMPTNME" 400 -
208.100.26.229 - - [23/Dec/2017 08:39:07] code 400, message Bad HTTP/0.9 request type ('\x03\x00\x00')
208.100.26.229 - - [23/Dec/2017 08:39:07] "
                                                              à" 400 -
208.100.26.229 - - [23/Dec/2017 08:39:12] code 400, message Bad request syntax ('DmdT\x00\x00\x00\x17\x00\x00\x00\x01\x00\x00\x00\x00\x11\x11\x00ÿ\x01ÿ\x13')
208.100.26.229 - - [23/Dec/2017 08:39:12] "DmdTÿÿ" 400 -
208.100.26.229 - - [23/Dec/2017 08:39:17] code 400, message Bad HTTP/0.9 request type (':\x00\x00\x00/\x00\x00\x00\x02\x00\x00@\x02\x0f\x00\x01\x00=\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00/\x00\x00\x00\x00\x00\x00\x00\x00\x00@')
208.100.26.229 - - [23/Dec/2017 08:39:17] ":/@=/@" 400 -
208.100.26.229 - - [23/Dec/2017 08:39:22] code 400, message Bad request syntax ('\x01\x00\x00ýÎú\x0b°\xa0\x00\x00\x00MMS\x14\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x12\x00\x00\x00\x01\x00\x03\x00ðððð\x0b\x00\x04\x00\x1c\x00\x03\x00N\x00S\x00P\x00l\x00a\x00y\x00e\x00r\x00/\x009\x00.\x000\x00.\x000\x00.\x002\x009\x008\x000\x00;\x00 \x00{\x000\x000\x000\x000\x00A\x00A\x000\x000\x00-\x000\x00A\x000\x000\x00-\x000\x000\x00a\x000\x00-\x00A\x00A\x000\x00A\x00-\x000\x000\x000\x000\x00A\x000\x00A\x00A\x000\x00A\x00A\x000\x00}\x00\x00\x00àmß_')
208.100.26.229 - - [23/Dec/2017 08:39:22] "ýÎú
                                                                 ° MMSðððð
                                                                          NSPlayer/9.0.0.2980; {0000AA00-0A00-00a0-AA0A-0000A0AA0AA0}àmß_" 400 -
208.100.26.229 - - [23/Dec/2017 08:40:08] code 400, message Bad HTTP/0.9 request type ('\x00Z\x00\x00\x01\x00\x00\x00\x016\x01,\x00\x00\x08\x00\x7fÿ\x7f\x08\x00\x00\x00\x01\x00')
208.100.26.229 - - [23/Dec/2017 08:40:08] "Z6 :4æ(CONNECT_DATA=(COMMAND=version))" 400 -

Could it be something with nginx not being properly configured for long polling or more generally for socket.io?

miguelgrinberg · 2017-12-23T16:53:33Z

@Schnodderbalken the nginx log shows things that appear to be unrelated to HTTP/WebSocket. Any idea what those are? It seems clients for other protocols are connecting to nginx.

miguelgrinberg · 2019-06-08T23:12:36Z

I'm going to close this, as I have no indication this issue persists after other related fixes related to leaking disconnected sockets.

suever · 2019-06-10T11:06:26Z

@miguelgrinberg Thanks! I'm running the latest versions and it seems like the error has gone away 🤞

miguelgrinberg added the question label Sep 15, 2016

suever mentioned this issue Sep 17, 2016

Socket not being upgraded suever/MATL-Online#19

Closed

miguelgrinberg added investigate and removed question labels Feb 25, 2017

miguelgrinberg self-assigned this Feb 25, 2017

suever mentioned this issue May 10, 2017

Server hangs intermittently immediately after WS upgrade #459

Closed

VipulRJ mentioned this issue May 18, 2016

Flask SocketIO memory leak and WebSocket Breaks frequently (During high frequency of data emission) #264

Closed

miguelgrinberg closed this as completed Jun 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flask-SocketIO behind nginx randomly thinks client has disconnected #327

Flask-SocketIO behind nginx randomly thinks client has disconnected #327

suever commented Sep 15, 2016

miguelgrinberg commented Sep 15, 2016

suever commented Sep 15, 2016

andrewSC commented Oct 14, 2016

miguelgrinberg commented Oct 14, 2016

suever commented Oct 16, 2016

andrewSC commented Oct 24, 2016

qwexvf commented Oct 27, 2016

andrewSC commented Oct 29, 2016

ghost commented Feb 9, 2017

suever commented Feb 25, 2017

miguelgrinberg commented Feb 25, 2017

suever commented Feb 25, 2017 •

edited

Loading

miguelgrinberg commented Feb 25, 2017 •

edited

Loading

saeidalidadi commented Sep 18, 2017

Schnodderbalken commented Dec 23, 2017 •

edited

Loading

miguelgrinberg commented Dec 23, 2017

miguelgrinberg commented Jun 8, 2019

suever commented Jun 10, 2019

Flask-SocketIO behind nginx randomly thinks client has disconnected #327

Flask-SocketIO behind nginx randomly thinks client has disconnected #327

Comments

suever commented Sep 15, 2016

miguelgrinberg commented Sep 15, 2016

suever commented Sep 15, 2016

andrewSC commented Oct 14, 2016

miguelgrinberg commented Oct 14, 2016

suever commented Oct 16, 2016

andrewSC commented Oct 24, 2016

qwexvf commented Oct 27, 2016

andrewSC commented Oct 29, 2016

ghost commented Feb 9, 2017

suever commented Feb 25, 2017

miguelgrinberg commented Feb 25, 2017

suever commented Feb 25, 2017 • edited Loading

miguelgrinberg commented Feb 25, 2017 • edited Loading

saeidalidadi commented Sep 18, 2017

Schnodderbalken commented Dec 23, 2017 • edited Loading

miguelgrinberg commented Dec 23, 2017

miguelgrinberg commented Jun 8, 2019

suever commented Jun 10, 2019

suever commented Feb 25, 2017 •

edited

Loading

miguelgrinberg commented Feb 25, 2017 •

edited

Loading

Schnodderbalken commented Dec 23, 2017 •

edited

Loading