AsyncServer.handle_request throws unhandled KeyError exceptions #391

ishmelev23 · 2024-12-06T13:12:04Z

I saw this issue in SocketIO issues section before, but by some reason it wasn't considered a bug.

Describe the bug
When Socket is closed, but hasn't been deleted from memory of BaseServer, it causes KeyError('Session is disconnected`) exceptions which are unhandled by EngineIO or SocketIO servers.
The exception is raised from this method:

python-engineio/src/engineio/base_server.py

Lines 227 to 229 in a6e5d92

    
           if s.closed: 
        
               del self.sockets[sid] 
        
               raise KeyError('Session is disconnected')

when client continue to send polling GET requests and it ends up here (probably POST as well, but we didn't notice errors related to POST)

python-engineio/src/engineio/async_server.py

Lines 286 to 291 in a6e5d92

    
           if sid not in self.sockets: 
        
               self._log_error_once('Invalid session ' + sid, 'bad-sid') 
        
               r = self._bad_request('Invalid session ' + sid) 
        
           else: 
        
               socket = self._get_socket(sid) 
        
               if self.transport(sid) != transport and \

In some cases it's expected behavior and handled properly, like

python-engineio/src/engineio/async_server.py

Lines 114 to 120 in a6e5d92

    
           try: 
        
               socket = self._get_socket(sid) 
        
           except KeyError: 
        
               # the socket is not available 
        
               self.logger.warning('Cannot send to sid %s', sid) 
        
               return 
        
           await socket.send(pkt)

and

python-engineio/src/engineio/async_server.py

Lines 189 to 197 in a6e5d92

    
           try: 
        
               socket = self._get_socket(sid) 
        
           except KeyError:  # pragma: no cover 
        
               # the socket was already closed or gone 
        
               pass 
        
           else: 
        
               await socket.close() 
        
               if sid in self.sockets:  # pragma: no cover 
        
                   del self.sockets[sid]

Apparently, it's some kind of race condition when the socket has been closed, but wasn't deleted by service task

python-engineio/src/engineio/async_server.py

Lines 548 to 556 in a6e5d92

    
           for s in self.sockets.copy().values(): 
        
               if s.closed: 
        
                   try: 
        
                       del self.sockets[s.sid] 
        
                   except KeyError: 
        
                       # the socket could have also been removed by 
        
                       # the _get_socket() method from another thread 
        
                       pass 
        
               elif not s.closing:

The issue is that: the server shouldn't raise an unhandled exception when the client sends requests which are not consistent with the server's internal state.
Right now we handle it by ourselves, but I believe that exceptions related to server's internal state should be managed by server itself.

To Reproduce
Hard to say how to reproduce it in e2e manner, because it just happens in production, and it's flaky, I didn't find a reason, why the closed Socket stays in sockets dictionary of AsyncServer.

But it could be reproduced in unit tests:

Start connection.
Close connection from client side (so the socket will be closed on the server side).
Immediately send polling request to the server.

Expected behavior
In that case, I expect that request will end up in branch and I get BadRequest error from server:

python-engineio/src/engineio/async_server.py

Lines 287 to 288 in a6e5d92

    
           self._log_error_once('Invalid session ' + sid, 'bad-sid') 
        
           r = self._bad_request('Invalid session ' + sid)

Bot not in

python-engineio/src/engineio/async_server.py

Lines 289 to 309 in a6e5d92

    
           else: 
        
               socket = self._get_socket(sid) 
        
               if self.transport(sid) != transport and \ 
        
                       transport != upgrade_header: 
        
                   self._log_error_once( 
        
                       'Invalid transport for session ' + sid, 
        
                       'bad-transport') 
        
                   r = self._bad_request('Invalid transport') 
        
               else: 
        
                   try: 
        
                       packets = await socket.handle_get_request(environ) 
        
                       if isinstance(packets, list): 
        
                           r = self._ok(packets, jsonp_index=jsonp_index) 
        
                       else: 
        
                           r = packets 
        
                   except exceptions.EngineIOError: 
        
                       if sid in self.sockets:  # pragma: no cover 
        
                           await self.disconnect(sid) 
        
                       r = self._bad_request() 
        
                   if sid in self.sockets and self.sockets[sid].closed: 
        
                       del self.sockets[sid]

Logs
In logs, I see common pattern: socket either was closed by server, either client itself dropped connection, then this unhandled error happens, when client continue to send GET/POST requests (checked about 15 cases manually).
If my assumption isn't correct, and you need detailed logs, please tell me I will collect it for you.

Idea of bugfix
Seems like it would be just enough to rewrite this part for handling GET/POST:

python-engineio/src/engineio/async_server.py

Lines 286 to 290 in a6e5d92

    
           if sid not in self.sockets: 
        
               self._log_error_once('Invalid session ' + sid, 'bad-sid') 
        
               r = self._bad_request('Invalid session ' + sid) 
        
           else: 
        
               socket = self._get_socket(sid)

Handling KeyError as missing socket, then if socket is empty, we will go to the logical branch with bad request, otherwise continue processing request.

Additional context
In our team, we use SocketIO along with FastAPI.
We accept requests from clients with different platforms IOS, Android, Web.
Caught it even on low traffic ~1 rps.
Version: 4.9.1 of EngineIO, but as I see, the same code is till present in main branch.

miguelgrinberg · 2024-12-06T15:46:02Z

I have not removed these exceptions because when these conditions occur it is a clear indication that you have a problem that you need to address. You are implying that this is an internal server issue, but it isn't. This error means that your clients are missing the deadlines to respond to PING packets, or that your server is so busy that it is unable to process the PONG responses from the clients in a timely manner (as I explained many many times in the issues this is discussed). Either way if you see these often then you have a problem that needs to be looked into, and reducing the visibility of these errors may make you less aware of the problem.

But I guess I could be convinced to downgrade this to an error log that does not include the stack trace.

ishmelev23 · 2024-12-06T16:06:32Z

But I guess I could be convinced to downgrade this to an error log that does not include the stack trace.

Yes, in general, that's what I'm talking about.

I have not removed these exceptions because when these conditions occur it is a clear indication that you have a problem that you need to address.

In the expected behavior I mentioned, that this issue could be handled by this branch of code

python-engineio/src/engineio/async_server.py

Lines 287 to 288 in a6e5d92

    
           self._log_error_once('Invalid session ' + sid, 'bad-sid') 
        
           r = self._bad_request('Invalid session ' + sid)

Here we log the exception and respond with bad request to the client.
This way we keep the issue visible for the client and keep it visible for the developer.
So we don't remove this error, but provide formatting and handling for this error on EngineIO layer.

Is there a specific reason that we need to crash the server by unhandled exception without responding to the client with structured response?

ishmelev23 · 2024-12-06T16:17:49Z

Also, the reason, why I'm assuming that this is a bug, because if some time passes, the sid will be deleted from sockets dictionary (because will be garbage collected by service task).
What makes all these requests end up in the branch:

python-engineio/src/engineio/async_server.py

Lines 287 to 288 in a6e5d92

    
           self._log_error_once('Invalid session ' + sid, 'bad-sid') 
        
           r = self._bad_request('Invalid session ' + sid)

So the situation is still the same: e.g. client disconnects from server and continue to send requests in closed session.
But reaction is different based on the fact: did service task managed to delete closed socket from sockets dict or didn't.
In the first case, we will properly respond with bad request and log this issue with logger.
In the second case, an unhandled exception will be raised by method handle_request

So it looks like, it is just a race condition, isn't?

miguelgrinberg · 2024-12-06T16:32:16Z

You are assuming that the "Invalid session" error has the same purpose, but it is there to cover mainly other problems, such as the server receiving an invalid sid that never existed, or a sid that is owned by a different worker. In these cases the server does not have enough information to provide a better error message.

But when the server receives a sid that was recently closed it can give you a slightly more informative message. You are correct that eventually the server will forget about this sid and if the client continues to send requests they'll be logged with the invalid message, but clients are unlikely to insist anyway, so logging the first (and possibly only) error as a disconnected session makes sense to me, even if it isn't always possible due to timing.

ishmelev23 · 2024-12-06T17:02:19Z

Got your idea, thanks.

logging the first (and possibly only) error as a disconnected session makes sense to me, even if it isn't always possible due to timing.

This sounds great. If you need any assistance with pull request, please let me know.

Generally, I think it will be enough, because most probably most of developers set up their monitoring systems and track error rate.
So the spike or high amount of bad request responses will totally attract attention of dev team, then they could check logs and error monitoring system for deeper investigation of this issue.
Crashing the app or responding with 500 error is unnecessary IMHO.

miguelgrinberg added the question label Dec 6, 2024

miguelgrinberg closed this as completed in 55a9e46 Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AsyncServer.handle_request throws unhandled KeyError exceptions #391

AsyncServer.handle_request throws unhandled KeyError exceptions #391

ishmelev23 commented Dec 6, 2024

miguelgrinberg commented Dec 6, 2024

ishmelev23 commented Dec 6, 2024 •

edited

Loading

ishmelev23 commented Dec 6, 2024 •

edited

Loading

miguelgrinberg commented Dec 6, 2024 •

edited

Loading

ishmelev23 commented Dec 6, 2024 •

edited

Loading

AsyncServer.handle_request throws unhandled KeyError exceptions #391

AsyncServer.handle_request throws unhandled KeyError exceptions #391

Comments

ishmelev23 commented Dec 6, 2024

miguelgrinberg commented Dec 6, 2024

ishmelev23 commented Dec 6, 2024 • edited Loading

ishmelev23 commented Dec 6, 2024 • edited Loading

miguelgrinberg commented Dec 6, 2024 • edited Loading

ishmelev23 commented Dec 6, 2024 • edited Loading

ishmelev23 commented Dec 6, 2024 •

edited

Loading

ishmelev23 commented Dec 6, 2024 •

edited

Loading

miguelgrinberg commented Dec 6, 2024 •

edited

Loading

ishmelev23 commented Dec 6, 2024 •

edited

Loading