Flask-SocketIO Performance Bottleneck - QueuePool Overflow After Upgrade #2049

mphilip9 · 2024-03-17T01:25:14Z

mphilip9
Mar 17, 2024

I'll try to keep this brief, but I have a lot of code examples to show. Please let me know if you need more context!

We recently updated our flask server to handle web socket connections using flask socketio. There was a major performance hit (we use k6 to load test the prod server, and have cloudwatch logs in elastic search for individual apis), which made sense because you can only utilize 1 worker for the server (we used 4 previously).

To solve this, we adjusted the nginx configuration to load balance multiple socketio servers. This helped, and the performance overall certainly improved. But on certain pages the page doesn't load for anywhere from 15-45 seconds. This occurs on pages that hit a large number of APIs (50+), and I'm guessing it is related to this error in sentry, which we never had before adding flask socketio

General Exception `TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30 (Background on this error at: http://sqlalche.me/e/13/3o7r)` while setting user

Based on this error, it seems that when too many requests are made at once and the queuepool limit is reached, the server is hanging and causing these extremely slow load times. So increasing the pool size and max overflow should help to alleviate the problem.

Question: How can I optimize our Flask-SocketIO setup to eliminate the QueuePool overflow errors and regain our previous performance levels? Are there architectural changes or advanced techniques beyond increasing resource limits and adding server nodes that I should consider?

Here is the nginx conf

upstream socketio_nodes {
   # This needs to be enabled for web sockets to work
    ip_hash;

    server 127.0.0.1:5000;
    server 127.0.0.1:5001;
    server 127.0.0.1:5002;
    # to scale the app, just add more nodes here!
}

server {
    listen 80;
    server_name *.our-site.com;

    location / {
        include proxy_params;
	proxy_pass http://socketio_nodes;
    }

    location /socket.io {
        include proxy_params;
        proxy_http_version 1.1;
        proxy_buffering off;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "Upgrade";
        proxy_pass http://socketio_nodes/socket.io;
    }
    # Consider blocking access to source maps for security reasons. These will be uploaded to sentry during build.
    # location ~ \.map$ {
    #     deny all;
    # }
}

Here is the server service

[Unit]
Description=Gunicorn instance to serve our website
After=network.target

[Service]
User=ubuntu
Group=www-data
WorkingDirectory=/home/ubuntu/briefcase
Environment="PATH=/home/ubuntu/briefcaseEnv/bin:/usr/bin:/bin"
ExecStart=/home/ubuntu/briefcaseEnv/bin/gunicorn -k "geventwebsocket.gunicorn.workers.GeventWebSocketWorker" -w 1 --bind 0.0.0.0:%i --log-level=warning wsgi:app

And here in the deployment is where we spin up three servers

sudo systemctl restart [email protected]
sudo systemctl restart [email protected]
sudo systemctl restart [email protected]

If it helps, here is the flask socketio related logic

The socket instance

from flask_socketio import SocketIO, emit
import os


# configure cors_allowed_origins
if os.environ.get('FLASK_ENV') == 'production':
    origins = [
        # app url here
    ]
else:
    # allow all origins for development
    origins = "*"

# initialize your socket instance
# TODO: do we need the async_mode specified? How will this work in production?
socketio = SocketIO(async_mode='gevent', cors_allowed_origins=origins)

In app.py

# monkey patch at the top of the file
from gevent import monkey
monkey.patch_all()
from libs.Sockets.socket_instance import socketio
...
    socketio.init_app(app, message_queue='amqp://')
...
    if parsed_args.url_map:
        logging.info('\n\n########################################\n\n')
        logging.info(app.url_map)
        logging.info(
            '\n\n########################################\n^^^ App URL Map ^^^\n')
    socketio.run(app)(port=5000)

Let me know if any more context is needed (e.g. k6 summaries, gunicorn logs)

miguelgrinberg · 2024-03-17T11:33:43Z

miguelgrinberg
Mar 17, 2024
Maintainer

The error that you are getting is a SQLAlchemy error, not related to Flask-SocketIO. It means that your database connection pool is too small for the amount of concurrent connections you need. You should review your usage of database sessions and make sure you do not hold on to sessions when you don't need them. If you still get errors after that, then you can increase the size of the SQLAlchemy pool.

9 replies

mphilip9 Mar 17, 2024
Author

I also discovered another problem. There are some rogue requests that are querying very large tables for the entire dataset. These are taking 15-30 seconds (the same time as the delay). This issue must have gone unnoticed before because the faster requests needed to render the page were able to load even while these slow requests continued to run in the background. Would this also be due to connection pool size?

miguelgrinberg Mar 17, 2024
Maintainer

It's not possible for me to diagnose this for you, as you have provided barely any details. If you are still asking about the connection pool timeout then the problem is that your application now uses more database sessions than before. The solution is to make more database connections available, or to reduce database session usage, or a combination of both.

mphilip9 Mar 17, 2024
Author

Sorry no this is unrelated to the connection pool timeout error. I'm not getting that error at the moment, but extremely slow requests are causing other requests to hang and this wasn't occurring before. Basically, there are three slow requests occurring and a number of other requests, that should take no more than 1-2 seconds, are not completing until the slow requests are successfully completed. Here is an image (at the bottom) to demonstrate the problem. You can see there are three large/slow requests (client, brief, plan) but a large number of other requests are taking the same amount of time.

I realize this isn't much to go on, let me know if you need more context. I'll try and get pool connection data. Removing these rogue slow requests will solve the immediate problem, but I'm still trying to figure out why after implementing flask socketio the other requests are getting queued. I upped the connection pool size to 50 with a max overflow of 20 but that did not fix the problem

miguelgrinberg Mar 18, 2024
Maintainer

Unfortunately I can't provide an answer, and once again, I don't see how this relates to Flask-SocketIO. This could be a concurrency issue in your application. If you don't have experience working with greenlets then I recommend you read about them and how to avoid blocking code. Or alternatively go back to using threads, which you already know that your application was compatible with.

mphilip9 Mar 18, 2024
Author

Fair enough, I'll do some further research and see if I can figure out this concurrency issue. Thanks again for the taking the time to respond

mphilip9 · 2024-03-18T03:58:27Z

mphilip9
Mar 18, 2024
Author

I believe I have discovered the issue. Greenlets are nonblocking as long as gevent is left to handle the main event loop, hence the need to monkey patch. But libraries written in C, like psycopg2 (which we are using along with sqlalchemy), cannot be monkey patched by gevent or eventlet.

So if there were for example a long running database query (like in my case), if gevent were in control of the event loop, it could switch control to another greenlet that is ready to execute, and resume the original greenlet once the slow operation is complete. But since psycopg is not being monkey patched, the operations are all synchronous and so no other operations can run until that long slow query is complete.

There is a 'psycogreen' package that apparently alleviates this issue, and it appears to have been resolved in psycopg3. I will investigate if either of these solves my issue and report back

1 reply

miguelgrinberg Mar 18, 2024
Maintainer

You are correct. If you are using psycopg2 and gevent together you do need to add psycogreen. That could very well cause the blocking issue you describe.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flask-SocketIO Performance Bottleneck - QueuePool Overflow After Upgrade #2049

{{title}}

Replies: 2 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Flask-SocketIO Performance Bottleneck - QueuePool Overflow After Upgrade #2049

mphilip9 Mar 17, 2024

Replies: 2 comments · 10 replies

miguelgrinberg Mar 17, 2024 Maintainer

mphilip9 Mar 17, 2024 Author

miguelgrinberg Mar 17, 2024 Maintainer

mphilip9 Mar 17, 2024 Author

miguelgrinberg Mar 18, 2024 Maintainer

mphilip9 Mar 18, 2024 Author

mphilip9 Mar 18, 2024 Author

miguelgrinberg Mar 18, 2024 Maintainer

mphilip9
Mar 17, 2024

Replies: 2 comments 10 replies

miguelgrinberg
Mar 17, 2024
Maintainer

mphilip9 Mar 17, 2024
Author

miguelgrinberg Mar 17, 2024
Maintainer

mphilip9 Mar 17, 2024
Author

miguelgrinberg Mar 18, 2024
Maintainer

mphilip9 Mar 18, 2024
Author

mphilip9
Mar 18, 2024
Author

miguelgrinberg Mar 18, 2024
Maintainer