-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus: expose a metric for network partition observed on the node #2508
Comments
There is a separate Erlang distribution exporter and Grafana dashboard. It provides a lot more information than a boolean or an integer gauge. |
Whether a node is "connected" is not very clear cut. Network links can slow down, be permanntly saturated and so on. These new dashboards are significantly more descriptive and advanced: @gerhard can explain how to provision them but I believe the above panels demonstrate that this is a much better approach. |
While showing network partitions from RabbitMQ's perspective would be helpful, it is not as helpful as understanding what happens at the Erlang Distribution level. This is the layer within the Erlang VM (RabbitMQ's runtime), on which all RabbitMQ communication depends. For example, links between RabbitMQ nodes are bi-directional, may have TLS enabled, and may be switching between different states. The Erlang-Distribution dashboard captures all this information. Without understanding what is happening in this context, a RabbitMQ partition is worth knowing about, for sure, but it doesn't contain sufficient detail to understand why it happened. I can appreciate that the Erlang-Distribution dashboard contains more information than you would expect, and I can see how being able to see network partitions on the RabbitMQ-Overview dashboard would be helpful. I will re-open this issue since I would like to add this on RabbitMQ-Overview, and most likely link to Erlang-Distribution from RabbitMQ-Overview for those that want to dig deeper. While Erlang-Distribution dashboard is not uploaded to RabbitMQ Grafana org yet, you can get an import-friendly version by running the following cd github.com/rabbitmq/rabbitmq-prometheus
make Erlang-Distribution.json
{
"__requires": [
{
"type": "grafana",
"id": "grafana",
"name": "Grafana",
"version": "6.0.0"
},
...
"timezone": "",
"title": "Erlang-Distribution",
"uid": "d-SFCCmZz",
"version": 1,
"__inputs": [
{
"name": "DS_PROMETHEUS",
"label": "prometheus",
"description": "",
"type": "datasource",
"pluginId": "prometheus",
"pluginName": "Prometheus"
}
]
} There will be a recording of Observe & Understand RabbitMQ from RabbitMQ Summit on YouTube later on this month which will have more detail on this dashboard, as well as others. There will also be a RabbitMQ webinar on the 12th of December on the same topic. If you want to join, follow our Twitter where we will post more info. |
Thank you @michaelklishin for pointing me to the |
We can close this issue when we have exposed a network partition metric from RabbitMQ's perspective, and when this metric is displayed on RabbitMQ-Overview dashboard. RabbitMQ can still consider itself partitioned while everything looks healthy from the Erlang Distribution perspective. To account for this scenario, operators will need to take into account both perspectives - meaning both metrics - to be certain that it's a genuine network partition, and not something that can be resolved by restarting the partitioned RabbitMQ node so that it can re-join the cluster and resume service. |
Disclaimer: I‘m totally new to erlang and rabbitmq, therefore sorry for the noob questions right away. I might be able to invest some hours into a PR to implement this. From browsing the code I suspect one approach would be to call |
Thank you for considering a contribution. Yes, that's the source of
information we should use. Whether we expose just a count or more details
is up for debate. Having a metric that describes what nodes the emitting
one is partitioned from can be useful in my opinion.
On Fri, 10 Apr 2020 at 11:32, Sven ***@***.***> wrote:
Disclaimer: I‘m totally new to erlang and rabbitmq, therefore sorry for
the noob questions right away.
I might be able to invest some hours into a PR to implement this. From
browsing the code I suspect one approach would be to call
rabbit_mnesia:partitions/0 and expose the length of the partition list as
rabbitmq_partitions_count. Does that sound reasonable?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/rabbitmq/rabbitmq-prometheus/issues/15#issuecomment-611938041>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAAIQRIW7CFBR4G5NCZETLRL3KSRANCNFSM4JH2B43Q>
.
--
Staff Software Engineer, Pivotal/RabbitMQ
|
Is someone looking at this? We have been bitten by this a few times at work now so I would really like to see this being implemented. I wanted to fix this myself but after having looked at the code I quickly realized that there is just no way I could do this in a reasonable time because I have no prior knowledge of Erlang. Any help would be greatly appreciated! |
Those who need this metric are welcome to look into it. Since this is a piece of node-local state, it can be a list of nodes the reporting node sees as disconnected. Visualising and alerting this would be easier if this list was checked for emptiness but for extra context, this should be reported as a list of [observed as] unavailable peers. |
I just had a RabbitMQ node of a three node cluster split of into a partition. The management interface reported:
also so did the rabbitmqctl cluster_status:
I have all dashboards available but was unable to spot any indication of this partitioning. Am I reading this correctly, that there is currently no metric available that would have allowed be to be alerted on such a massive disruption in the RabbitMQ clustering? |
I am happy to have a look. If there is no netsplit, no such metric is returned |
That looks perfect to me! Defintely something that should be added to the Grafana dashboard afterwards. |
#9465 introduces a very similar metric that should be enough, and will be forward compatible with 3.13 and 4.0: the number of unreachable cluster peers. E.g. with three nodes running in a cluster of five, or with two nodes disconnected from peers you'd get
|
@michaelklishin the unreachable peer count is not the network partition. I have a cluster which is under network partitioning and there is a warning that may cause data lost, but rabbitmq_unreachable_cluster_peers_count is always zero because in the cluster of 3, only the connection among 2 nodes are broken. I'm using 3.12.9 |
It is sufficient to determine whether or not there is a partition. Obviously this is a per-node metric that must be monitored on all nodes. There isn't a magical metric not derived from intractable peer count that would work better. Ask yourself how do you determine if a node has lost its connection to any peers. Voilà, that's exactly what we offer as a per-node metric. |
Our team has no plans of going back to the way partitions were reported in the "pre-Prometheus" era. This metric is per node and not per cluster, but so are all Prometheus metrics in general for a fairly obvious reason. These metrics allow you to detect a partition for alerting or Grafana visualization purposes. Time to move on from trying to convince the core team that somehow the old metric was superior. It was not and it does not fit the Prometheus scraping approach where each node only reports its own metrics, and other tools such as Grafana can aggregate them to produce a cluster-wide view. |
Hi,
I'm happy that RabbitMQ got its own Prometheus exporter. 👍 also for prepared Grafana dashboard, it looks great.
I was using deadtrickster prometheus rabbitmq exporter before. It allowed me to monitor network partition with its
rabbitmq_node_up
metric https://github.com/deadtrickster/prometheus_rabbitmq_exporter#nodes because it shows whether one node is connected to another. I have three nodes in my cluster for now and thus I can write an alertTo check that each RabbitMQ node is connected to the other two. When it is not I'm alerted. Would it be possible to add something similar to this plugin?
The text was updated successfully, but these errors were encountered: