-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Channel exceptions and consumer cancellations cause the transport to stop processing messages #843
Comments
Thanks @chrisvanderpennen. We'll take a look and see what we can do about this. Did you encounter this issue or in a running system? |
We encountered it in a production system yesterday. We were fortunate enough to have a Kubernetes autoscaler watching queue depth so we only had a mildly degraded system, but the only thing that tipped us off to there being a problem at all is that our telemetry alarm for queues with no consumers kept tripping when the autoscaler terminated the extra instances despite all replicas showing as healthy. I have no doubt there's a bug in one of our handlers that triggered it which we'll investigate and resolve separately, but because we don't have access to the consumer from our code we can't build an interim detection or recovery without forking the entire library. |
Thanks a lot for bringing this up! |
We're actively working on a fix for this #894 Our research suggests that the problematic timeout was introduced in 3.8.15 with the default value of 15 minutes and then in 3.8.17, the default was changed to 30 minutes. Can you confirm that you are on 3.8.15 or higher of RabbitMQ? |
Thanks for the update! We're running 3.8.16 in production. I should probably log a job to update... |
@chrisvanderpennen we've tweaked the issue description to add some more details |
@chrisvanderpennen @lukasweber We've released NServiceBus.RabbitMQ 6.1.1, so give it a try and let us know if you're still running into problems. |
We were able to push the fix to our production environment last week and it works perfectly fine and as expected! This saves us some time to figure out the root cause of the issue on our side. Thanks a lot :) |
I know this issue is resolved, but it seems that while NSB will now attempt to reconnect, the message Recoverability mechanism doesn't get applied. Short example, if I have a handler that does an infinite loop or sleep (basically never returns), can I short-circuit that message in the pipeline somehow so it doesn't keep getting rehandled on reconnection? |
Hi @NArnott, For the scenario you are describing, where the handlers are truly in an infinite loop or otherwise never complete, that's going to be a problem regardless of what transport you're using. Recoverability would never be able to save you from that. Any endpoint with that problem would consume messages up to the concurrency limit, and then stop attempting to process any additional messages. However, if you have handlers that are consistently taking longer to execute than the configured acknowledgement timeout, but do eventually complete, then that is a scenario that the current transport design is not handling properly. I've opened #927 to track this problem. |
Maybe this is something that can be addresses in the next NSB major version with Cancellation Tokens? Being that the handler currently has no way of knowing what that timeout would be, would be nice to have that token to monitor and cancel out any long running process if we know the Transport has dropped us. |
@NArnott you are correct and this is something that is already included in the upcoming v8 see is that what you are looking for? |
@andreasohlund Yes, I am aware CTs are coming in the next major release. I just wanted to see if they'd also trigger for this RabbitMQ use case, when it drops the connection due to taking too long. |
Symptoms
The endpoint stops processing messages with a single error message logged and doesn't resume processing until it is restarted.
Who's affected
All users of the transport are affected by the issue, however the consumer acknowledgement timeout introduced in RabbitMQ 3.8.15 has increased the likelihood of experiencing the problem. Any handler running longer than the default timeout timeout value (15 minutes in 3.8.15, changed to 30 minutes in 3.8.17) will cause the problem to occur.
Root cause
The message pump was handling only connection shutdown events and didn't detect channel or consumer failures, causing the consumer to stop receiving new messages from the broker.
Backported to
Original issue content
From the RabbitMQ docs:
While it is unexpected that a handler could take longer than 30 minutes to complete, should a service instance trigger this condition it will continue with no indication that there is a problem until manually restarted, but will not receive any further messages from RabbitMQ.
Consumer timeouts are signalled on AsyncEventingBasicConsumer via the ConsumerCancelled event, which is currently not subscribed in MessagePump.
I've posted a gist with a reproduction. To run, assuming Docker for Windows:
Running
docker logs -f rabbitmq
in a second terminal will eventually print:The management UI will show 0 consumers for the service queue, while the service continues to run as if nothing is wrong.
The text was updated successfully, but these errors were encountered: