Nodes getting killed by OOM after upgrade to 3.13.7 #12401

m0n5t3r · 2024-09-27T19:11:15Z

m0n5t3r
Sep 27, 2024

Community Support Policy

I have read RabbitMQ's Community Support Policy

RabbitMQ version used

3.13.7

How is RabbitMQ deployed?

Debian package

Steps to reproduce the behavior in question

Hello all

After a recent update to 3.13.7 (from 3.10.something, going through 3.11 because I guess feature flags are a thing now...): 3 node cluster, all queues mirrored, AWS c6.xlarge instances, one queue has about 70k deferred messages from celery (which means the consumer holds them in memory without acknowledging them until the specified time) - after upgrading, one of the nodes started getting killed by OOM on every deploy (which necessarily restarts celery consumers and the application producers - in total about 300 connections). The nodes are accessed in DNS round robin.

After some further flailing which involved bumping up the instance size and adding IOPS to the storage, now it looks like two of the instances explode on every deploy... incidentally, the ones that ~~don't host~~ aren't primaries of that one queue with the deferred messages seem to blow up now, basically going from 500-ish MB to $machine_ram in seconds and getting killed by OOM.

We're going to switch to quorum queues eventually, but 1) we just found out they exist, and 2) the version of celery we use doesn't support them; apart from going back to 3.10 (which is non-trivial on a production cluster), what would be the options we have?

I did find a discussion detailing a similar behavior, but it was rabbitmq 3.6 and he was told to upgrade...

michaelklishin · 2024-09-27T21:11:14Z

michaelklishin
Sep 27, 2024
Maintainer

3.13.x is out of community support.

The memory usage profile of your node is entirely dependent on the workload. Modern quorum queues have a certain footprint which is stable and streams have a very minimal footprint.

Classic queue v2 act as lazy queues, meaning they keep data in memory for only a short period.

Relevant doc guides:

And sorry to respond this way but quorum queues have been around since 2018.

0 replies

michaelklishin · 2024-09-27T21:11:44Z

michaelklishin
Sep 27, 2024
Maintainer

@m0n5t3r start with collecting metrics and switching to CQv2 on RabbitMQ 4.0.2.

0 replies

michaelklishin · 2024-09-27T21:23:05Z

michaelklishin
Sep 27, 2024
Maintainer

…and Memory Footprint with Classic Queues (v2 specifically).

0 replies

mkuratczyk · 2024-09-30T05:41:14Z

mkuratczyk
Sep 30, 2024
Maintainer

I'd add that you might be interested in setting mirroring_sync_max_throughput, see this PR for details: #3925.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes getting killed by OOM after upgrade to 3.13.7 #12401

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Nodes getting killed by OOM after upgrade to 3.13.7 #12401

m0n5t3r Sep 27, 2024

Community Support Policy

RabbitMQ version used

How is RabbitMQ deployed?

Steps to reproduce the behavior in question

Replies: 4 comments

michaelklishin Sep 27, 2024 Maintainer

michaelklishin Sep 27, 2024 Maintainer

michaelklishin Sep 27, 2024 Maintainer

mkuratczyk Sep 30, 2024 Maintainer

m0n5t3r
Sep 27, 2024

michaelklishin
Sep 27, 2024
Maintainer

michaelklishin
Sep 27, 2024
Maintainer

michaelklishin
Sep 27, 2024
Maintainer

mkuratczyk
Sep 30, 2024
Maintainer