-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
channel#queue can return wrong queue after Timeout::Error #558
Comments
If there was a channel operation timeout we should clear all pending continuations/responses and register no queues. That's it. A PR would be welcome. |
What complicates things is that a response might arrive a moment after a timeout was registered by the channel. In that case the response should be thrown away but it's not easy to know what's the originating response was. For server-named queues the specified name is always an empty string and the response is a generated name, for example. Clearing continuation response before starting a new continuation might be sufficient to avoid confusing behavior in that scenario. |
Yeah, I fee like there's probably many scenarios where the reader loop may get a responses from another message after a Timeout. Rather than trying to ensure the continuation queues are always correct after various timeouts, I think doing a higher level sanity check that the queue.name returned from the channel#queue if the queue name isn't empty (server named). An exception from channel#queue is better than returning the wrong queue. |
Sorry but I'm not interested in sanity checks and workarounds. Those can be introduced in your own code. Bunny channels are not supposed to be shared between threads so it should be possible to clear continuations on timeouts. |
Curiously Java client now uses a variation of the suggested "sanity check" mechanism: all responses are checked for "compatibility" with the outstanding continuation and the incompatible ones are ignored. So maybe doing that is not such a bad idea after all. |
@sbonebrake can you please give #566 a try? It seems to do something sensible in the provided repro scenario but with timeouts, delays and such it's not always obvious what that looks like :) |
I just encountered this issue in 2.9 and upgraded to 2.12 to see if I could reproduce it and sadly I still am in a setup with a share connection and a shared channel. So this new logic may be useful in some cases, but (understandably) does not protect you in a situation where for example you have:
What happens in this case is similar to what @sbonebrake described except it all starts with a This occurs in multiple workers until at some point the continuation timeout is reached and they all starts throwing Timeouts — which is initially confusing. The simplest solution for me was to stop sharing the channel across threads as it's clearly recommended against in Bunny's concurrency documentation but it does seem like something people may not realize unless they happen upon that article. It took me a few days to question why we were sharing the channel. I first thought it was a misguided attempt to be thrifty but with 30 concurrent workers in production channel usage can become fairly taxing on a rabbit cluster, which some people definitely may want to avoid. I don't know if there's a simple way to detect that the channel is shared between more than a single thread an alert users that it's a bad idea. Or if @michaelklishin you'd prefer follow the example of the Java client and try to handle such out-of-order situations. I'd love to help with repro and implementation if I can. |
There is a fundamental problem with sharing a channel. Some protocol methods are synchronous (expect a response) and a channel cannot have two pending operations — or more, since Bunny API encourages method chaining — and have a clear way of figuring out what the user would expect. Heck, most users have no idea how some concurrent scenarios should work, they want the maintainer to pull a magic rabbit (no pun intended) out of the hat. What Java client does and this client does for Don't share channels between threads. If you don't read the docs and run into issues that have been documented for years, you can only blame yourself. |
Reproduced with Bunny 2.11.0 and RabbitMQ 3.7.6
channel#queue will return wrong queue if called after a Timeout::Error occurs during previous #queue call.
We hit this issue where we started getting messages enqueued in the wrong queues. In our logs we found a Timeout::Error with this call stack:
Our code does not dispose of the Session when this occurs. In this case, the next call to Bunny is another channel#queue for a different queue. During this second call, the wait_on_continuations call returns the DeclareOk from the first call. This causes the first queue to be incorrectly returned from channel#queue, but the channels @queues will contain the correct mapping.
Order of events:
Because of step 11 and 12, subsequent calls to channel.queue("Bob") will be correct. Because of step 9, the first call to any queue will return the previously declared queue. In our case we declare around 30 queues in a row and the first enqueue to each of them ended up publishing to the wrong queue.
There is an implementation of this here: https://github.com/bmorton/bunny_repro
Debug Logs:
The Bunny Changelog says automatic_recovery isn't started on Timeout::Error because "It will be started in the network activity loop anyway". However the reader loop will never get a Timeout::Error if the continuation_timeout (default 15s) is < read_timeout and the DeclareOk is returned between after continuation_timeout and before read_timeout.
https://github.com/ruby-amqp/bunny/blob/master/ChangeLog.md#time-bound-continuations
There's several approaches to fixing this, but the simplest may be to register the queue from any unexpected DeclareOk and try again:
The text was updated successfully, but these errors were encountered: