Channel exceptions and consumer cancellations cause the transport to stop processing messages #843

chrisvanderpennen · 2021-07-14T04:11:33Z

Symptoms

The endpoint stops processing messages with a single error message logged and doesn't resume processing until it is restarted.

Who's affected

All users of the transport are affected by the issue, however the consumer acknowledgement timeout introduced in RabbitMQ 3.8.15 has increased the likelihood of experiencing the problem. Any handler running longer than the default timeout timeout value (15 minutes in 3.8.15, changed to 30 minutes in 3.8.17) will cause the problem to occur.

Root cause

The message pump was handling only connection shutdown events and didn't detect channel or consumer failures, causing the consumer to stop receiving new messages from the broker.

Backported to

6.1.1

Original issue content

From the RabbitMQ docs:

In modern RabbitMQ versions, a timeout is enforced on consumer delivery acknowledgement. This helps detect buggy (stuck) consumers that never acknowledge deliveries. Such consumers can affect node's on disk data compaction and potentially drive nodes out of disk space.
If a consumer does not ack its delivery for more than the timeout value (30 minutes by default), its channel will be closed with a PRECONDITION_FAILED channel exception. The error will be logged by the node that the consumer was connected to.

While it is unexpected that a handler could take longer than 30 minutes to complete, should a service instance trigger this condition it will continue with no indication that there is a problem until manually restarted, but will not receive any further messages from RabbitMQ.

Consumer timeouts are signalled on AsyncEventingBasicConsumer via the ConsumerCancelled event, which is currently not subscribed in MessagePump.

I've posted a gist with a reproduction. To run, assuming Docker for Windows:

docker run --name rabbitmq -d -p 15672:15672 -p 5672:5672 --mount "type=bind,source=$(resolve-path timeout.conf),target=/etc/rabbitmq/conf.d/timeout.conf" rabbitmq:management
dotnet run

Running docker logs -f rabbitmq in a second terminal will eventually print:

2021-07-14 04:08:19.868 [warning] <0.1206.0> Consumer [hostname] - timeoutRepro on channel 1 has timed out waiting for delivery acknowledgement. Timeout used: 5000 ms. This timeout value can be configured, see consumers doc guide to learn more
2021-07-14 04:08:19.869 [error] <0.1206.0> Channel error on connection <0.1197.0> (172.17.0.1:37270 -> 172.17.0.2:5672, vhost: '/', user: 'guest'), channel 1:
operation none caused a channel exception precondition_failed: delivery acknowledgement on channel 1 timed out. Timeout value used: 5000 ms. This timeout value can be configured, see consumers doc guide to learn more

The management UI will show 0 consumers for the service queue, while the service continues to run as if nothing is wrong.

The text was updated successfully, but these errors were encountered:

mikeminutillo · 2021-07-15T03:27:12Z

Thanks @chrisvanderpennen. We'll take a look and see what we can do about this. Did you encounter this issue or in a running system?

chrisvanderpennen · 2021-07-15T03:44:35Z

We encountered it in a production system yesterday. We were fortunate enough to have a Kubernetes autoscaler watching queue depth so we only had a mildly degraded system, but the only thing that tipped us off to there being a problem at all is that our telemetry alarm for queues with no consumers kept tripping when the autoscaler terminated the extra instances despite all replicas showing as healthy.

I have no doubt there's a bug in one of our handlers that triggered it which we'll investigate and resolve separately, but because we don't have access to the consumer from our code we can't build an interim detection or recovery without forking the entire library.

lukasweber · 2021-09-28T11:40:37Z

Thanks a lot for bringing this up!
We were running into the same issue on our production system recently, but were unable to reproduce where it was coming from. Probably there is also a bug on our side which seems to trigger this and we'll keep investigating. Would be nice if you could also have look at this on your side, since it's really hard to detect the issue with no access to the consumer.

andreasohlund · 2021-10-25T07:48:13Z

We're actively working on a fix for this #894

Our research suggests that the problematic timeout was introduced in 3.8.15 with the default value of 15 minutes and then in 3.8.17, the default was changed to 30 minutes.

Can you confirm that you are on 3.8.15 or higher of RabbitMQ?

chrisvanderpennen · 2021-10-25T08:10:01Z

Thanks for the update! We're running 3.8.16 in production. ^{I should probably log a job to update...}

andreasohlund · 2021-10-25T15:00:18Z

@chrisvanderpennen we've tweaked the issue description to add some more details

bording · 2021-11-01T18:29:57Z

@chrisvanderpennen @lukasweber We've released NServiceBus.RabbitMQ 6.1.1, so give it a try and let us know if you're still running into problems.

lukasweber · 2021-11-15T07:44:14Z

We were able to push the fix to our production environment last week and it works perfectly fine and as expected! This saves us some time to figure out the root cause of the issue on our side. Thanks a lot :)

NArnott · 2022-01-17T21:41:09Z

I know this issue is resolved, but it seems that while NSB will now attempt to reconnect, the message Recoverability mechanism doesn't get applied. Short example, if I have a handler that does an infinite loop or sleep (basically never returns), can I short-circuit that message in the pipeline somehow so it doesn't keep getting rehandled on reconnection?

bording · 2022-01-19T23:08:07Z

Hi @NArnott,

For the scenario you are describing, where the handlers are truly in an infinite loop or otherwise never complete, that's going to be a problem regardless of what transport you're using. Recoverability would never be able to save you from that. Any endpoint with that problem would consume messages up to the concurrency limit, and then stop attempting to process any additional messages.

However, if you have handlers that are consistently taking longer to execute than the configured acknowledgement timeout, but do eventually complete, then that is a scenario that the current transport design is not handling properly. I've opened #927 to track this problem.

NArnott · 2022-01-25T15:24:39Z

@bording,

Maybe this is something that can be addresses in the next NSB major version with Cancellation Tokens? Being that the handler currently has no way of knowing what that timeout would be, would be nice to have that token to monitor and cancel out any long running process if we know the Transport has dropped us.

andreasohlund · 2022-01-25T15:29:33Z

@NArnott you are correct and this is something that is already included in the upcoming v8 see

https://docs.particular.net/nservicebus/hosting/cooperative-cancellation?version=core_8#inside-the-message-processing-pipeline

is that what you are looking for?

NArnott · 2022-01-25T15:32:05Z

@andreasohlund Yes, I am aware CTs are coming in the next major release. I just wanted to see if they'd also trigger for this RabbitMQ use case, when it drops the connection due to taking too long.

andreasohlund · 2022-02-01T08:04:25Z

Sorry for the delay @NArnott what you are asking about is very reasonable but not currently supported. I have raised #931 to track adding support for this. This means that our next enhancement release will consider this but I have no ETA for this yet.

Hope this make sense!

andreasohlund added the bug label Oct 25, 2021

andreasohlund changed the title ~~Message pump consumer cancellations due to acknowledgement timeout are never recovered~~ Channel exceptions or consumer cancellation causes the transport to stop processing messages Oct 25, 2021

andreasohlund mentioned this issue Oct 25, 2021

Add manual connection recovery logic #894

Merged

bording closed this as completed in #894 Oct 29, 2021

bording mentioned this issue Oct 29, 2021

Channel exceptions and consumer cancellations cause the transport to stop processing messages #901

Merged

bording changed the title ~~Channel exceptions or consumer cancellation causes the transport to stop processing messages~~ Channel exceptions and consumer cancellations cause the transport to stop processing messages Oct 29, 2021

andreasohlund mentioned this issue Nov 2, 2021

Channel exceptions and consumer cancellations in RabbitMQ can prevent message processing Particular/ServiceControl#2668

Merged

andreasohlund mentioned this issue Feb 1, 2022

Allow users to react to channel errors using the message processing cancellation token #931

Open

keifgwinn mentioned this issue Jul 21, 2022

RabbitMQ 3.10 changed behaviour of unacknowledged messages 3drepo/3drepobouncer#585

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Channel exceptions and consumer cancellations cause the transport to stop processing messages #843

Channel exceptions and consumer cancellations cause the transport to stop processing messages #843

chrisvanderpennen commented Jul 14, 2021 •

edited by bording

Loading

mikeminutillo commented Jul 15, 2021

chrisvanderpennen commented Jul 15, 2021

lukasweber commented Sep 28, 2021

andreasohlund commented Oct 25, 2021

chrisvanderpennen commented Oct 25, 2021

andreasohlund commented Oct 25, 2021

bording commented Nov 1, 2021

lukasweber commented Nov 15, 2021 •

edited

Loading

NArnott commented Jan 17, 2022 •

edited

Loading

bording commented Jan 19, 2022

NArnott commented Jan 25, 2022

andreasohlund commented Jan 25, 2022

NArnott commented Jan 25, 2022

andreasohlund commented Feb 1, 2022

Channel exceptions and consumer cancellations cause the transport to stop processing messages #843

Channel exceptions and consumer cancellations cause the transport to stop processing messages #843

Comments

chrisvanderpennen commented Jul 14, 2021 • edited by bording Loading

Symptoms

Who's affected

Root cause

Backported to

Original issue content

mikeminutillo commented Jul 15, 2021

chrisvanderpennen commented Jul 15, 2021

lukasweber commented Sep 28, 2021

andreasohlund commented Oct 25, 2021

chrisvanderpennen commented Oct 25, 2021

andreasohlund commented Oct 25, 2021

bording commented Nov 1, 2021

lukasweber commented Nov 15, 2021 • edited Loading

NArnott commented Jan 17, 2022 • edited Loading

bording commented Jan 19, 2022

NArnott commented Jan 25, 2022

andreasohlund commented Jan 25, 2022

NArnott commented Jan 25, 2022

andreasohlund commented Feb 1, 2022

chrisvanderpennen commented Jul 14, 2021 •

edited by bording

Loading

lukasweber commented Nov 15, 2021 •

edited

Loading

NArnott commented Jan 17, 2022 •

edited

Loading