Prometheus: expose a metric for network partition observed on the node #2508

Artimi · 2019-11-01T12:33:29Z

Hi,
I'm happy that RabbitMQ got its own Prometheus exporter. 👍 also for prepared Grafana dashboard, it looks great.
I was using deadtrickster prometheus rabbitmq exporter before. It allowed me to monitor network partition with its rabbitmq_node_up metric https://github.com/deadtrickster/prometheus_rabbitmq_exporter#nodes because it shows whether one node is connected to another. I have three nodes in my cluster for now and thus I can write an alert

sum(rabbitmq_node_up{app="rabbitmq"}) != 9

To check that each RabbitMQ node is connected to the other two. When it is not I'm alerted. Would it be possible to add something similar to this plugin?

The text was updated successfully, but these errors were encountered:

michaelklishin · 2019-11-01T18:36:59Z

There is a separate Erlang distribution exporter and Grafana dashboard. It provides a lot more information than a boolean or an integer gauge.

michaelklishin · 2019-11-01T18:38:55Z

Whether a node is "connected" is not very clear cut. Network links can slow down, be permanntly saturated and so on. These new dashboards are significantly more descriptive and advanced:

@gerhard can explain how to provision them but I believe the above panels demonstrate that this is a much better approach.

gerhard · 2019-11-01T20:01:42Z

While showing network partitions from RabbitMQ's perspective would be helpful, it is not as helpful as understanding what happens at the Erlang Distribution level. This is the layer within the Erlang VM (RabbitMQ's runtime), on which all RabbitMQ communication depends. For example, links between RabbitMQ nodes are bi-directional, may have TLS enabled, and may be switching between different states. The Erlang-Distribution dashboard captures all this information. Without understanding what is happening in this context, a RabbitMQ partition is worth knowing about, for sure, but it doesn't contain sufficient detail to understand why it happened.

I can appreciate that the Erlang-Distribution dashboard contains more information than you would expect, and I can see how being able to see network partitions on the RabbitMQ-Overview dashboard would be helpful. I will re-open this issue since I would like to add this on RabbitMQ-Overview, and most likely link to Erlang-Distribution from RabbitMQ-Overview for those that want to dig deeper.

While Erlang-Distribution dashboard is not uploaded to RabbitMQ Grafana org yet, you can get an import-friendly version by running the following make target in the root of this repo, i.e.:

cd github.com/rabbitmq/rabbitmq-prometheus
make Erlang-Distribution.json 
{
  "__requires": [
    {
      "type": "grafana",
      "id": "grafana",
      "name": "Grafana",
      "version": "6.0.0"
    },
...
  "timezone": "",
  "title": "Erlang-Distribution",
  "uid": "d-SFCCmZz",
  "version": 1,
  "__inputs": [
    {
      "name": "DS_PROMETHEUS",
      "label": "prometheus",
      "description": "",
      "type": "datasource",
      "pluginId": "prometheus",
      "pluginName": "Prometheus"
    }
  ]
}

There will be a recording of Observe & Understand RabbitMQ from RabbitMQ Summit on YouTube later on this month which will have more detail on this dashboard, as well as others. There will also be a RabbitMQ webinar on the 12th of December on the same topic. If you want to join, follow our Twitter where we will post more info.

Artimi · 2019-11-04T12:16:14Z

Thank you @michaelklishin for pointing me to the Erlang - Distribution dashboard. And thank you @gerhard for taking care of this issue. I was working only with RabbitMQ - Overview dashboard and wasn't aware of Erlang - Distribution. I took the dashboard from https://github.com/rabbitmq/rabbitmq-prometheus/tree/master/docker/grafana/dashboards and it seems really informant. I will prepare my alerts according to the queries on the dashboard.

michaelklishin · 2019-11-05T12:57:11Z

@gerhard @Artimi when should we consider this issue to be resolved?

gerhard · 2019-11-05T22:07:09Z

We can close this issue when we have exposed a network partition metric from RabbitMQ's perspective, and when this metric is displayed on RabbitMQ-Overview dashboard. RabbitMQ can still consider itself partitioned while everything looks healthy from the Erlang Distribution perspective. To account for this scenario, operators will need to take into account both perspectives - meaning both metrics - to be certain that it's a genuine network partition, and not something that can be resolved by restarting the partitioned RabbitMQ node so that it can re-join the cluster and resume service.

sventschui · 2020-04-10T08:32:26Z

Disclaimer: I‘m totally new to erlang and rabbitmq, therefore sorry for the noob questions right away.

I might be able to invest some hours into a PR to implement this. From browsing the code I suspect one approach would be to call rabbit_mnesia:partitions/0 and expose the length of the partition list as rabbitmq_partitions_count. Does that sound reasonable?

michaelklishin · 2020-04-11T08:32:35Z

Thank you for considering a contribution. Yes, that's the source of information we should use. Whether we expose just a count or more details is up for debate. Having a metric that describes what nodes the emitting one is partitioned from can be useful in my opinion.

On Fri, 10 Apr 2020 at 11:32, Sven ***@***.***> wrote: Disclaimer: I‘m totally new to erlang and rabbitmq, therefore sorry for the noob questions right away. I might be able to invest some hours into a PR to implement this. From browsing the code I suspect one approach would be to call rabbit_mnesia:partitions/0 and expose the length of the partition list as rabbitmq_partitions_count. Does that sound reasonable? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/rabbitmq/rabbitmq-prometheus/issues/15#issuecomment-611938041>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAAIQRIW7CFBR4G5NCZETLRL3KSRANCNFSM4JH2B43Q> .

-- Staff Software Engineer, Pivotal/RabbitMQ

ggustafsson · 2022-08-19T08:07:03Z

Is someone looking at this? We have been bitten by this a few times at work now so I would really like to see this being implemented. I wanted to fix this myself but after having looked at the code I quickly realized that there is just no way I could do this in a reasonable time because I have no prior knowledge of Erlang. Any help would be greatly appreciated!

michaelklishin · 2022-08-19T08:43:34Z

Those who need this metric are welcome to look into it. Since this is a piece of node-local state, it can be a list of nodes the reporting node sees as disconnected. Visualising and alerting this would be easier if this list was checked for emptiness but for extra context, this should be reported as a list of [observed as] unavailable peers.

frittentheke · 2023-09-05T08:54:02Z

I just had a RabbitMQ node of a three node cluster split of into a partition. The management interface reported:

Network partition detected
Mnesia reports that this RabbitMQ cluster has experienced a network partition. There is a risk of losing data. Please read https://www.rabbitmq.com/partitions.html

also so did the rabbitmqctl cluster_status:

Network Partitions
Node rabbit@node1 cannot communicate with rabbit@node2, rabbit@node3

I have all dashboards available but was unable to spot any indication of this partitioning.
When searching through this repo to find the proper metric to be alerted on, I found this issue.

Am I reading this correctly, that there is currently no metric available that would have allowed be to be alerted on such a massive disruption in the RabbitMQ clustering?

gomoripeti · 2023-09-11T15:04:08Z

I am happy to have a look.
If I see it correctly prometheus does not support list of strings as values.
would the following format be acceptable:
rabbitmq_partitioned{from="rabbit@<host>"} 1 for each element in the partitions list

If there is no netsplit, no such metric is returned
Alarm could be configured for something like sum(rabbitmq_partitioned) > 0

ggustafsson · 2023-09-12T07:23:11Z

I am happy to have a look. If I see it correctly prometheus does not support list of strings as values. would the following format be acceptable: rabbitmq_partitioned{from="rabbit@<host>"} 1 for each element in the partitions list

If there is no netsplit, no such metric is returned Alarm could be configured for something like sum(rabbitmq_partitioned) > 0

That looks perfect to me! Defintely something that should be added to the Grafana dashboard afterwards.

michaelklishin · 2023-09-19T21:42:03Z

#9465 introduces a very similar metric that should be enough, and will be forward compatible with 3.13 and 4.0: the number of unreachable cluster peers.

E.g. with three nodes running in a cluster of five, or with two nodes disconnected from peers you'd get

# TYPE rabbitmq_unreachable_cluster_peers_count gauge
# HELP rabbitmq_unreachable_cluster_peers_count Number of peers in the cluster the current node cannot reach.
rabbitmq_unreachable_cluster_peers_count 2

truong-hua · 2024-02-29T12:15:13Z

@michaelklishin the unreachable peer count is not the network partition. I have a cluster which is under network partitioning and there is a warning that may cause data lost, but rabbitmq_unreachable_cluster_peers_count is always zero because in the cluster of 3, only the connection among 2 nodes are broken. I'm using 3.12.9

michaelklishin · 2024-02-29T12:18:38Z

It is sufficient to determine whether or not there is a partition. Obviously this is a per-node metric that must be monitored on all nodes.

There isn't a magical metric not derived from intractable peer count that would work better. Ask yourself how do you determine if a node has lost its connection to any peers. Voilà, that's exactly what we offer as a per-node metric.

michaelklishin · 2024-02-29T12:21:09Z

Our team has no plans of going back to the way partitions were reported in the "pre-Prometheus" era.

This metric is per node and not per cluster, but so are all Prometheus metrics in general for a fairly obvious reason. These metrics allow you to detect a partition for alerting or Grafana visualization purposes. Time to move on from trying to convince the core team that somehow the old metric was superior. It was not and it does not fit the Prometheus scraping approach where each node only reports its own metrics, and other tools such as Grafana can aggregate them to produce a cluster-wide view.

michaelklishin closed this as completed Nov 1, 2019

gerhard reopened this Nov 1, 2019

gerhard transferred this issue from rabbitmq/rabbitmq-prometheus Nov 13, 2020

johanneswuerbach mentioned this issue Aug 18, 2021

Expose resource alarms as Prometheus metrics #2653

Closed

deadtrickster self-assigned this Aug 18, 2021

johanneswuerbach mentioned this issue Oct 11, 2021

feat(prom): expose cluster id in identity #3554

Merged

12 tasks

lukebakken self-assigned this Jun 2, 2023

lukebakken mentioned this issue Jun 13, 2023

HTTP API: GET /api/nodes/{node} should return important information when stats are disabled #8556

Open

michaelklishin changed the title ~~Expose metric for a network partition~~ Prometheus: expose a metric for network partition observed on the node Jul 5, 2023

gomoripeti mentioned this issue Sep 12, 2023

Expose netsplit/partition info via Prometheus #9376

Closed

12 tasks

gomoripeti mentioned this issue Sep 19, 2023

Expose number of unreachable cluster peers via Prometheus #9465

Merged

12 tasks

michaelklishin closed this as completed in #9465 Sep 19, 2023

michaelklishin added this to the 3.12.5 milestone Sep 19, 2023

rabbitmq locked and limited conversation to collaborators Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus: expose a metric for network partition observed on the node #2508

Prometheus: expose a metric for network partition observed on the node #2508

Artimi commented Nov 1, 2019

michaelklishin commented Nov 1, 2019

michaelklishin commented Nov 1, 2019

gerhard commented Nov 1, 2019 •

edited

Loading

Artimi commented Nov 4, 2019

michaelklishin commented Nov 5, 2019

gerhard commented Nov 5, 2019

sventschui commented Apr 10, 2020

michaelklishin commented Apr 11, 2020 via email

ggustafsson commented Aug 19, 2022

michaelklishin commented Aug 19, 2022

frittentheke commented Sep 5, 2023

gomoripeti commented Sep 11, 2023

ggustafsson commented Sep 12, 2023

michaelklishin commented Sep 19, 2023

truong-hua commented Feb 29, 2024 •

edited

Loading

michaelklishin commented Feb 29, 2024

michaelklishin commented Feb 29, 2024 •

edited

Loading

Prometheus: expose a metric for network partition observed on the node #2508

Prometheus: expose a metric for network partition observed on the node #2508

Comments

Artimi commented Nov 1, 2019

michaelklishin commented Nov 1, 2019

michaelklishin commented Nov 1, 2019

gerhard commented Nov 1, 2019 • edited Loading

Artimi commented Nov 4, 2019

michaelklishin commented Nov 5, 2019

gerhard commented Nov 5, 2019

sventschui commented Apr 10, 2020

michaelklishin commented Apr 11, 2020 via email

ggustafsson commented Aug 19, 2022

michaelklishin commented Aug 19, 2022

frittentheke commented Sep 5, 2023

gomoripeti commented Sep 11, 2023

ggustafsson commented Sep 12, 2023

michaelklishin commented Sep 19, 2023

truong-hua commented Feb 29, 2024 • edited Loading

michaelklishin commented Feb 29, 2024

michaelklishin commented Feb 29, 2024 • edited Loading

gerhard commented Nov 1, 2019 •

edited

Loading

truong-hua commented Feb 29, 2024 •

edited

Loading

michaelklishin commented Feb 29, 2024 •

edited

Loading