Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus: expose a metric for network partition observed on the node #2508

Closed
Artimi opened this issue Nov 1, 2019 · 17 comments · Fixed by #9465
Closed

Prometheus: expose a metric for network partition observed on the node #2508

Artimi opened this issue Nov 1, 2019 · 17 comments · Fixed by #9465
Assignees
Milestone

Comments

@Artimi
Copy link

Artimi commented Nov 1, 2019

Hi,
I'm happy that RabbitMQ got its own Prometheus exporter. 👍 also for prepared Grafana dashboard, it looks great.
I was using deadtrickster prometheus rabbitmq exporter before. It allowed me to monitor network partition with its rabbitmq_node_up metric https://github.com/deadtrickster/prometheus_rabbitmq_exporter#nodes because it shows whether one node is connected to another. I have three nodes in my cluster for now and thus I can write an alert

sum(rabbitmq_node_up{app="rabbitmq"}) != 9

To check that each RabbitMQ node is connected to the other two. When it is not I'm alerted. Would it be possible to add something similar to this plugin?

@michaelklishin
Copy link
Member

There is a separate Erlang distribution exporter and Grafana dashboard. It provides a lot more information than a boolean or an integer gauge.

@michaelklishin
Copy link
Member

Whether a node is "connected" is not very clear cut. Network links can slow down, be permanntly saturated and so on. These new dashboards are significantly more descriptive and advanced:

Erlang Distribution Links

Erlang Distribution Traffic

@gerhard can explain how to provision them but I believe the above panels demonstrate that this is a much better approach.

@gerhard
Copy link
Contributor

gerhard commented Nov 1, 2019

While showing network partitions from RabbitMQ's perspective would be helpful, it is not as helpful as understanding what happens at the Erlang Distribution level. This is the layer within the Erlang VM (RabbitMQ's runtime), on which all RabbitMQ communication depends. For example, links between RabbitMQ nodes are bi-directional, may have TLS enabled, and may be switching between different states. The Erlang-Distribution dashboard captures all this information. Without understanding what is happening in this context, a RabbitMQ partition is worth knowing about, for sure, but it doesn't contain sufficient detail to understand why it happened.

I can appreciate that the Erlang-Distribution dashboard contains more information than you would expect, and I can see how being able to see network partitions on the RabbitMQ-Overview dashboard would be helpful. I will re-open this issue since I would like to add this on RabbitMQ-Overview, and most likely link to Erlang-Distribution from RabbitMQ-Overview for those that want to dig deeper.

While Erlang-Distribution dashboard is not uploaded to RabbitMQ Grafana org yet, you can get an import-friendly version by running the following make target in the root of this repo, i.e.:

cd github.com/rabbitmq/rabbitmq-prometheus
make Erlang-Distribution.json 
{
  "__requires": [
    {
      "type": "grafana",
      "id": "grafana",
      "name": "Grafana",
      "version": "6.0.0"
    },
...
  "timezone": "",
  "title": "Erlang-Distribution",
  "uid": "d-SFCCmZz",
  "version": 1,
  "__inputs": [
    {
      "name": "DS_PROMETHEUS",
      "label": "prometheus",
      "description": "",
      "type": "datasource",
      "pluginId": "prometheus",
      "pluginName": "Prometheus"
    }
  ]
}

There will be a recording of Observe & Understand RabbitMQ from RabbitMQ Summit on YouTube later on this month which will have more detail on this dashboard, as well as others. There will also be a RabbitMQ webinar on the 12th of December on the same topic. If you want to join, follow our Twitter where we will post more info.

@gerhard gerhard reopened this Nov 1, 2019
@Artimi
Copy link
Author

Artimi commented Nov 4, 2019

Thank you @michaelklishin for pointing me to the Erlang - Distribution dashboard. And thank you @gerhard for taking care of this issue. I was working only with RabbitMQ - Overview dashboard and wasn't aware of Erlang - Distribution. I took the dashboard from https://github.com/rabbitmq/rabbitmq-prometheus/tree/master/docker/grafana/dashboards and it seems really informant. I will prepare my alerts according to the queries on the dashboard.

@michaelklishin
Copy link
Member

@gerhard @Artimi when should we consider this issue to be resolved?

@gerhard
Copy link
Contributor

gerhard commented Nov 5, 2019

We can close this issue when we have exposed a network partition metric from RabbitMQ's perspective, and when this metric is displayed on RabbitMQ-Overview dashboard. RabbitMQ can still consider itself partitioned while everything looks healthy from the Erlang Distribution perspective. To account for this scenario, operators will need to take into account both perspectives - meaning both metrics - to be certain that it's a genuine network partition, and not something that can be resolved by restarting the partitioned RabbitMQ node so that it can re-join the cluster and resume service.

@sventschui
Copy link

Disclaimer: I‘m totally new to erlang and rabbitmq, therefore sorry for the noob questions right away.

I might be able to invest some hours into a PR to implement this. From browsing the code I suspect one approach would be to call rabbit_mnesia:partitions/0 and expose the length of the partition list as rabbitmq_partitions_count. Does that sound reasonable?

@michaelklishin
Copy link
Member

michaelklishin commented Apr 11, 2020 via email

@gerhard gerhard transferred this issue from rabbitmq/rabbitmq-prometheus Nov 13, 2020
@deadtrickster deadtrickster self-assigned this Aug 18, 2021
@ggustafsson
Copy link

Is someone looking at this? We have been bitten by this a few times at work now so I would really like to see this being implemented. I wanted to fix this myself but after having looked at the code I quickly realized that there is just no way I could do this in a reasonable time because I have no prior knowledge of Erlang. Any help would be greatly appreciated!

@michaelklishin
Copy link
Member

Those who need this metric are welcome to look into it. Since this is a piece of node-local state, it can be a list of nodes the reporting node sees as disconnected. Visualising and alerting this would be easier if this list was checked for emptiness but for extra context, this should be reported as a list of [observed as] unavailable peers.

@lukebakken lukebakken self-assigned this Jun 2, 2023
@michaelklishin michaelklishin changed the title Expose metric for a network partition Prometheus: expose a metric for network partition observed on the node Jul 5, 2023
@frittentheke
Copy link

I just had a RabbitMQ node of a three node cluster split of into a partition. The management interface reported:

Network partition detected
Mnesia reports that this RabbitMQ cluster has experienced a network partition. There is a risk of losing data. Please read https://www.rabbitmq.com/partitions.html

also so did the rabbitmqctl cluster_status:

Network Partitions
Node rabbit@node1 cannot communicate with rabbit@node2, rabbit@node3

I have all dashboards available but was unable to spot any indication of this partitioning.
When searching through this repo to find the proper metric to be alerted on, I found this issue.

Am I reading this correctly, that there is currently no metric available that would have allowed be to be alerted on such a massive disruption in the RabbitMQ clustering?

@gomoripeti
Copy link
Contributor

I am happy to have a look.
If I see it correctly prometheus does not support list of strings as values.
would the following format be acceptable:
rabbitmq_partitioned{from="rabbit@<host>"} 1 for each element in the partitions list

If there is no netsplit, no such metric is returned
Alarm could be configured for something like sum(rabbitmq_partitioned) > 0

@ggustafsson
Copy link

I am happy to have a look. If I see it correctly prometheus does not support list of strings as values. would the following format be acceptable: rabbitmq_partitioned{from="rabbit@<host>"} 1 for each element in the partitions list

If there is no netsplit, no such metric is returned Alarm could be configured for something like sum(rabbitmq_partitioned) > 0

That looks perfect to me! Defintely something that should be added to the Grafana dashboard afterwards.

@michaelklishin
Copy link
Member

#9465 introduces a very similar metric that should be enough, and will be forward compatible with 3.13 and 4.0: the number of unreachable cluster peers.

E.g. with three nodes running in a cluster of five, or with two nodes disconnected from peers you'd get

# TYPE rabbitmq_unreachable_cluster_peers_count gauge
# HELP rabbitmq_unreachable_cluster_peers_count Number of peers in the cluster the current node cannot reach.
rabbitmq_unreachable_cluster_peers_count 2

@truong-hua
Copy link

truong-hua commented Feb 29, 2024

@michaelklishin the unreachable peer count is not the network partition. I have a cluster which is under network partitioning and there is a warning that may cause data lost, but rabbitmq_unreachable_cluster_peers_count is always zero because in the cluster of 3, only the connection among 2 nodes are broken. I'm using 3.12.9

@michaelklishin
Copy link
Member

It is sufficient to determine whether or not there is a partition. Obviously this is a per-node metric that must be monitored on all nodes.

There isn't a magical metric not derived from intractable peer count that would work better. Ask yourself how do you determine if a node has lost its connection to any peers. Voilà, that's exactly what we offer as a per-node metric.

@rabbitmq rabbitmq locked and limited conversation to collaborators Feb 29, 2024
@michaelklishin
Copy link
Member

michaelklishin commented Feb 29, 2024

Our team has no plans of going back to the way partitions were reported in the "pre-Prometheus" era.

This metric is per node and not per cluster, but so are all Prometheus metrics in general for a fairly obvious reason. These metrics allow you to detect a partition for alerting or Grafana visualization purposes. Time to move on from trying to convince the core team that somehow the old metric was superior. It was not and it does not fit the Prometheus scraping approach where each node only reports its own metrics, and other tools such as Grafana can aggregate them to produce a cluster-wide view.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.