Broken gunicorn implementation #132

balboah · 2013-03-15T14:40:31Z

The commit 2e251ce seem to introduce problems for gunicorn.
If I run gunicorn with more than 1 worker, I get bad file descriptor sockets and everything seem to fall apart.

Tested with gunicorn 0.13 and 0.17.

Running the commit before that one seems to work fine.

The text was updated successfully, but these errors were encountered:

sontek · 2013-04-13T23:09:09Z

@balboah I had issues with that commit as well but I thought we had fixed it since then. I'll do some testing.

sontek · 2013-04-13T23:41:29Z

Yeah, multiple workers are broken.

sunbit · 2013-05-09T06:20:14Z

Any updates on this? If anyone has a clue of what modification could have caused this, i can take a look at it...

fcurella · 2013-05-21T16:37:23Z

just pinging to say I'm having the same issue here.

@sontek : which one are the commits that try to fix this issue?

bvallant · 2013-05-31T12:17:02Z

Seem to have the same problems as well... Websockets sometimes work/sometimes not... Generally spoken everything's quite buggy. I'd also like to investigate on this but don't know where to start?

Generally spoken I'd like to ask how do you generally recommend to run it in production?

stevesw · 2013-06-08T08:00:06Z

I think I have tracked down the issue to these two added lines in socketio.server.SocketIOServer.get_socket():

2e251cec socketio/server.py       (Alexandre Bourget       2012-11-15 17:29:02 -0500 131)         if sessid and not socket:
2e251cec socketio/server.py       (Alexandre Bourget       2012-11-15 17:29:02 -0500 132)             return None  # you ask for a session that doesn't exist!

To elaborate, it looks like each worker process is given its own SocketIOServer, and each SocketIOServer maintains its own dictionary (self.sockets) mapping sessids to Sockets.

During the connection process, two separate requests are made. First, a request is made to: /socket.io/1/ (which I guess is the handshake?), and then this is followed up by another call to something like /socket.io/1/websocket/419838463361 (which actually establishes the socket).

If one worker handles the first request, and then another worker handles the second request, the second request is going to fail because it won't be able to find the sessid saved by the first worker on its instance of SocketIOServer.

The reason it works sporadically is because sometimes you would coincidentally have the same worker handle both requests, so it will be able to find the sessid it saved. Of course, the more workers you have, the less likely that is to occur, which explains the behavior reported in issue #125.

Commenting out those 2 lines of code resolves the issue, but I'm not familiar enough with the project to know whether that's a suitable long term solution. (It will end up creating a duplicate Socket on the second worker. I haven't seen where they are being cleaned up but I guess it could cause a leak or be problematic in some other way?)

stevesw · 2013-06-08T08:33:17Z

After reviewing the code in socketio.handler.SocketIOHandler._do_handshake and the socket.io spec I don't see any reason that the handshake request needs to actually create a Socket, the only thing it uses it for is the socket.sessid which is a randomly generated number. It seems like you could just directly generate that random number for the handshake, and wait to create the Socket until the client makes the follow up 'transport connection' request.

I think that would resolve the connectivity issue for the websocket transport when running multiple workers, but as related issue #112 mentions, it will still be problematic for the polling based transports.

abourget · 2013-09-15T19:12:29Z

Please read invitation to Wednesday September 18th's sprint: https://groups.google.com/forum/#!topic/gevent-socketio/2OIRKA8M2uE

arnuschky · 2013-10-08T17:00:16Z

We are facing the same issue. Is there a solution yet to this problem? Has the workaround proven to be sustainable? I am unwilling just to comment some lines on the production machine...

abourget · 2013-10-08T21:14:23Z

Commenting that out will just mean you don't care if the socket was present before. If you don't save anything in the session.. then it might not change anything for your use case, but it breaks statefullness. It is quite tricky to handle socket.io requests on two workers. Some messages could v be pending on worker 1 while you connect to worker 2.. potentially relaunching background jobs, etc.

To keep statefullness, ideally, one client should always connect to the same worker. Even with front load balancers, the same path through reverse proxies should be kept for the duration of the socketio connection.

It involves devops stuff, amd I'm not sure exactly which part of that problen should be dealt with directly in gevent-socketio.

Any suggestions ?

arnuschky · 2013-10-10T16:05:55Z

Our application is not stateful, so no problems there. I installed the latest version of gevent-socketio and applied the patch. Unfortunately, the patch made no difference at all.

Our application (it's a game) is pretty simple: the website polls data on two different paths regularly (with different frequency). The server sends changes that occur on the server side of things to all connected clients. Up to that point, we did not notice any problems (but maybe they were already there).

We now added a simple chat functionality. The client sends a message to the server, which broadcasts it to all clients (sender included). The server receives the message from the client without problems, but the broadcast arrives at the clients only at about a third of the cases.

I am not even sure if our problem is related to the one outlined in this (and the other) issues, but we noticed that it works 100% when we reduce the gunicorn workers to 1.

@abourget Let me know if I can help/test in any way. I can't really suggest anything for the solution of this as we are only users and not at all familiar with the architecture of gevent.

NB: We were using the pip install up to now, which did not include the line in question. I installed the latest git, but I am not sure if this is advisable. Which version should we use on a production system?

abourget · 2013-10-10T16:33:23Z

About the chat issue:

see, having 2 or more workers, means 2 or more processes. Each process is independant, holds its own Python objects. The socket.io lib keeps a list of open sockets within it's process.. so if it happens 2 users are connected to this process, then "broadcasting" will send messages to those connected locally.. it doesn't know about users connected to another worker, nothing is shared between them.

In order to build something that would broadcast to all users available, you'd need an external system, like a Redis PubSub, or 0mq or RabbitMQ kind of thing.. and every worker would connect to that central element, subscribe, and publish. It would be the glue between the isolated workers.

Does that make sense ?

arnuschky · 2013-10-11T07:44:45Z

Ok, that makes a lot of sense. One question though: the sender doesn't see his own message usually. Does that just mean that he switched worker thread in the meantime?

Thanks for the replies and sorry that this went a bit OT.

abourget · 2013-10-11T13:16:42Z

Depends on which Mixin you're using. There are two broadcast functions in the BroadcastMixin.. one that echoes back to the user, one that doesn't.

arnuschky · 2013-10-13T22:15:24Z

Thanks for your comments. The nondeterministic behavior was definitely due to some instance mixup and a lack of basic understanding. We rewrote the code using redis pub/sub and everything works as expected!

abourget · 2013-10-15T15:32:19Z

Great to hear that :)

On Sun, Oct 13, 2013 at 6:15 PM, arnuschky [email protected] wrote:

Thanks for your comments. The nondeterministic behavior was definitely due
to some instance mixup and a lack of basic understanding. We rewrote the
code using redis pub/sub and everything works as expected!

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/132#issuecomment-26228437
.

jensenw · 2013-11-12T04:39:43Z

hv same problem here , already try using redis but still buggy , sometimes it can send message to browser but sometimes it can't . Any updates ?

ghost · 2013-12-02T05:54:45Z

Going back on the statelessness and sessions, am I correct to assume that the request will be sent only once for socket communication purposes? I am currently using pyramid with a session backed on redis. Is that be a good way to maintain a state – I am using the request.session dict for authentication purposes.

abourget · 2013-12-03T14:30:00Z

Request.session will not be present on other nodes/processes unless also
backed by something like redis. Beaker supports it so you could use it
there.

For each socket.io connection, there is only one pyramid request.. the
rest is out of bound from a pyramid pt of view.

Alexandre
On 2013-12-02 12:54 AM, "zman0225" [email protected] wrote:

Going back on the statelessness and sessions, am I correct to assume that
the request will be sent only once for socket communication purposes? I am
currently using pyramid with a session backed on redis. Is that be a good
way to maintain a state – I am using the request.session dict for
authentication purposes.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/132#issuecomment-29596153
.

jrasanen · 2014-02-21T09:46:42Z

I have the same problem, so is Beaker a decent solution?

fcurella mentioned this issue May 21, 2013

workers > 1 falls back to xhr-polling fcurella/django-push-demo#1

Closed

amleczko added a commit to getpenelope/gevent-socketio that referenced this issue Aug 27, 2013

workaround for abourget#132

6c80c17

arnuschky pushed a commit to arnuschky/gevent-socketio that referenced this issue Dec 9, 2013

Workaround for abourget#132

3cf4575

dustinmm80 mentioned this issue May 5, 2014

Working Behind an (NginX) Proxy? miguelgrinberg/Flask-SocketIO#22

Closed

nchudleigh mentioned this issue Jun 19, 2015

When I use more than 1 worker chat is incredibly unreliable. miguelgrinberg/Flask-SocketIO#122

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken gunicorn implementation #132

Broken gunicorn implementation #132

balboah commented Mar 15, 2013

sontek commented Apr 13, 2013

sontek commented Apr 13, 2013

sunbit commented May 9, 2013

fcurella commented May 21, 2013

bvallant commented May 31, 2013

stevesw commented Jun 8, 2013

stevesw commented Jun 8, 2013

abourget commented Sep 15, 2013

arnuschky commented Oct 8, 2013

abourget commented Oct 8, 2013

arnuschky commented Oct 10, 2013

abourget commented Oct 10, 2013

arnuschky commented Oct 11, 2013

abourget commented Oct 11, 2013

arnuschky commented Oct 13, 2013

abourget commented Oct 15, 2013

jensenw commented Nov 12, 2013

ghost commented Dec 2, 2013

abourget commented Dec 3, 2013

jrasanen commented Feb 21, 2014

Broken gunicorn implementation #132

Broken gunicorn implementation #132

Comments

balboah commented Mar 15, 2013

sontek commented Apr 13, 2013

sontek commented Apr 13, 2013

sunbit commented May 9, 2013

fcurella commented May 21, 2013

bvallant commented May 31, 2013

stevesw commented Jun 8, 2013

stevesw commented Jun 8, 2013

abourget commented Sep 15, 2013

arnuschky commented Oct 8, 2013

abourget commented Oct 8, 2013

arnuschky commented Oct 10, 2013

abourget commented Oct 10, 2013

arnuschky commented Oct 11, 2013

abourget commented Oct 11, 2013

arnuschky commented Oct 13, 2013

abourget commented Oct 15, 2013

jensenw commented Nov 12, 2013

ghost commented Dec 2, 2013

abourget commented Dec 3, 2013

jrasanen commented Feb 21, 2014