Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AsyncServer.handle_request throws unhandled KeyError exceptions #391

Closed
ishmelev23 opened this issue Dec 6, 2024 · 5 comments
Closed

AsyncServer.handle_request throws unhandled KeyError exceptions #391

ishmelev23 opened this issue Dec 6, 2024 · 5 comments
Labels

Comments

@ishmelev23
Copy link

I saw this issue in SocketIO issues section before, but by some reason it wasn't considered a bug.

Describe the bug
When Socket is closed, but hasn't been deleted from memory of BaseServer, it causes KeyError('Session is disconnected`) exceptions which are unhandled by EngineIO or SocketIO servers.
The exception is raised from this method:

if s.closed:
del self.sockets[sid]
raise KeyError('Session is disconnected')

when client continue to send polling GET requests and it ends up here (probably POST as well, but we didn't notice errors related to POST)
if sid not in self.sockets:
self._log_error_once('Invalid session ' + sid, 'bad-sid')
r = self._bad_request('Invalid session ' + sid)
else:
socket = self._get_socket(sid)
if self.transport(sid) != transport and \

In some cases it's expected behavior and handled properly, like

try:
socket = self._get_socket(sid)
except KeyError:
# the socket is not available
self.logger.warning('Cannot send to sid %s', sid)
return
await socket.send(pkt)

and
try:
socket = self._get_socket(sid)
except KeyError: # pragma: no cover
# the socket was already closed or gone
pass
else:
await socket.close()
if sid in self.sockets: # pragma: no cover
del self.sockets[sid]

Apparently, it's some kind of race condition when the socket has been closed, but wasn't deleted by service task

for s in self.sockets.copy().values():
if s.closed:
try:
del self.sockets[s.sid]
except KeyError:
# the socket could have also been removed by
# the _get_socket() method from another thread
pass
elif not s.closing:

The issue is that: the server shouldn't raise an unhandled exception when the client sends requests which are not consistent with the server's internal state.
Right now we handle it by ourselves, but I believe that exceptions related to server's internal state should be managed by server itself.

To Reproduce
Hard to say how to reproduce it in e2e manner, because it just happens in production, and it's flaky, I didn't find a reason, why the closed Socket stays in sockets dictionary of AsyncServer.

But it could be reproduced in unit tests:

  1. Start connection.
  2. Close connection from client side (so the socket will be closed on the server side).
  3. Immediately send polling request to the server.

Expected behavior
In that case, I expect that request will end up in branch and I get BadRequest error from server:

self._log_error_once('Invalid session ' + sid, 'bad-sid')
r = self._bad_request('Invalid session ' + sid)

Bot not in
else:
socket = self._get_socket(sid)
if self.transport(sid) != transport and \
transport != upgrade_header:
self._log_error_once(
'Invalid transport for session ' + sid,
'bad-transport')
r = self._bad_request('Invalid transport')
else:
try:
packets = await socket.handle_get_request(environ)
if isinstance(packets, list):
r = self._ok(packets, jsonp_index=jsonp_index)
else:
r = packets
except exceptions.EngineIOError:
if sid in self.sockets: # pragma: no cover
await self.disconnect(sid)
r = self._bad_request()
if sid in self.sockets and self.sockets[sid].closed:
del self.sockets[sid]

Logs
In logs, I see common pattern: socket either was closed by server, either client itself dropped connection, then this unhandled error happens, when client continue to send GET/POST requests (checked about 15 cases manually).
If my assumption isn't correct, and you need detailed logs, please tell me I will collect it for you.

Idea of bugfix
Seems like it would be just enough to rewrite this part for handling GET/POST:

if sid not in self.sockets:
self._log_error_once('Invalid session ' + sid, 'bad-sid')
r = self._bad_request('Invalid session ' + sid)
else:
socket = self._get_socket(sid)

Handling KeyError as missing socket, then if socket is empty, we will go to the logical branch with bad request, otherwise continue processing request.

Additional context
In our team, we use SocketIO along with FastAPI.
We accept requests from clients with different platforms IOS, Android, Web.
Caught it even on low traffic ~1 rps.
Version: 4.9.1 of EngineIO, but as I see, the same code is till present in main branch.

@miguelgrinberg
Copy link
Owner

I have not removed these exceptions because when these conditions occur it is a clear indication that you have a problem that you need to address. You are implying that this is an internal server issue, but it isn't. This error means that your clients are missing the deadlines to respond to PING packets, or that your server is so busy that it is unable to process the PONG responses from the clients in a timely manner (as I explained many many times in the issues this is discussed). Either way if you see these often then you have a problem that needs to be looked into, and reducing the visibility of these errors may make you less aware of the problem.

But I guess I could be convinced to downgrade this to an error log that does not include the stack trace.

@ishmelev23
Copy link
Author

ishmelev23 commented Dec 6, 2024

But I guess I could be convinced to downgrade this to an error log that does not include the stack trace.

Yes, in general, that's what I'm talking about.

I have not removed these exceptions because when these conditions occur it is a clear indication that you have a problem that you need to address.

In the expected behavior I mentioned, that this issue could be handled by this branch of code

self._log_error_once('Invalid session ' + sid, 'bad-sid')
r = self._bad_request('Invalid session ' + sid)

Here we log the exception and respond with bad request to the client.
This way we keep the issue visible for the client and keep it visible for the developer.
So we don't remove this error, but provide formatting and handling for this error on EngineIO layer.

Is there a specific reason that we need to crash the server by unhandled exception without responding to the client with structured response?

@ishmelev23
Copy link
Author

ishmelev23 commented Dec 6, 2024

Also, the reason, why I'm assuming that this is a bug, because if some time passes, the sid will be deleted from sockets dictionary (because will be garbage collected by service task).
What makes all these requests end up in the branch:

self._log_error_once('Invalid session ' + sid, 'bad-sid')
r = self._bad_request('Invalid session ' + sid)

So the situation is still the same: e.g. client disconnects from server and continue to send requests in closed session.
But reaction is different based on the fact: did service task managed to delete closed socket from sockets dict or didn't.
In the first case, we will properly respond with bad request and log this issue with logger.
In the second case, an unhandled exception will be raised by method handle_request

So it looks like, it is just a race condition, isn't?

@miguelgrinberg
Copy link
Owner

miguelgrinberg commented Dec 6, 2024

You are assuming that the "Invalid session" error has the same purpose, but it is there to cover mainly other problems, such as the server receiving an invalid sid that never existed, or a sid that is owned by a different worker. In these cases the server does not have enough information to provide a better error message.

But when the server receives a sid that was recently closed it can give you a slightly more informative message. You are correct that eventually the server will forget about this sid and if the client continues to send requests they'll be logged with the invalid message, but clients are unlikely to insist anyway, so logging the first (and possibly only) error as a disconnected session makes sense to me, even if it isn't always possible due to timing.

@ishmelev23
Copy link
Author

ishmelev23 commented Dec 6, 2024

Got your idea, thanks.

logging the first (and possibly only) error as a disconnected session makes sense to me, even if it isn't always possible due to timing.

This sounds great. If you need any assistance with pull request, please let me know.

Generally, I think it will be enough, because most probably most of developers set up their monitoring systems and track error rate.
So the spike or high amount of bad request responses will totally attract attention of dev team, then they could check logs and error monitoring system for deeper investigation of this issue.
Crashing the app or responding with 500 error is unnecessary IMHO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants