You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Scenario: Two Alertmanager clusters (A, B) are running in the same Kubernetes clusters.
If Alertmanager cluster A is scaled down by one instance with IP address X and within a small time range Alertmanager cluster B is scaled up by one instance with the recycled IP address X the two clusters will merge resulting in one big cluster.
We are hitting this problem in the Prometheus Operator end-to-end test suite (prometheus-operator/prometheus-operator#2544) on a single node (small CIDR space) Kubernetes cluster.
The probability of this happening in production systems is questionable. Most setups probably only include a single Alertmanager cluster, in addition, CIDR ranges might be a lot bigger and IP address recycling might not happen as frequently.
This could be prevented via a unique identifier per Alertmanager cluster, disallowing instances with different identifiers to join. In addition #1819 introducing mutual TLS could stop accidental cluster merges in case trust chains a scoped per cluster.
The purpose of this issue is to document the failure for the future and give anyone hitting the same issue a central place to discuss further precedence.
The text was updated successfully, but these errors were encountered:
While TLS chain of trust could accidentally solve this, I don’t think this is the appropriate solution. As you proposed a separate mechanism sounds reasonable.
As for the probability, this is actually not all that low I recall Kubernetes IP recycling to have caused various problems across the board.
Scenario: Two Alertmanager clusters (A, B) are running in the same Kubernetes clusters.
If Alertmanager cluster A is scaled down by one instance with IP address X and within a small time range Alertmanager cluster B is scaled up by one instance with the recycled IP address X the two clusters will merge resulting in one big cluster.
We are hitting this problem in the Prometheus Operator end-to-end test suite (prometheus-operator/prometheus-operator#2544) on a single node (small CIDR space) Kubernetes cluster.
The probability of this happening in production systems is questionable. Most setups probably only include a single Alertmanager cluster, in addition, CIDR ranges might be a lot bigger and IP address recycling might not happen as frequently.
This could be prevented via a unique identifier per Alertmanager cluster, disallowing instances with different identifiers to join. In addition #1819 introducing mutual TLS could stop accidental cluster merges in case trust chains a scoped per cluster.
The purpose of this issue is to document the failure for the future and give anyone hitting the same issue a central place to discuss further precedence.
The text was updated successfully, but these errors were encountered: