Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8s peer discovery v2 #13050

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

k8s peer discovery v2 #13050

wants to merge 3 commits into from

Conversation

mkuratczyk
Copy link
Contributor

@mkuratczyk mkuratczyk commented Jan 10, 2025

Proposed Changes

Completely different implementation of a peer discovery mechanism for Kubernetes. Rather than querying the Kubernetes API, assume the deployment uses a StatefulSet, as it should and just check the local node name. Then:

  • if the node name's suffix is -0 - form a new cluster
  • in any other case, try to join the -0 node; keep trying forever - only the -0 node can form a new cluster

While it works differently internally, it's completely backwards compatible - existing configuration options are accepted but ignored. Cluster Operator can deploy cluster with these changes and everything works with no changes on the Operator side.

Benefits of this approach:

Drawbacks:

  • existing plugin works perfectly fine for a vast majority of users so there's a risk of introducing significant changes

TODO:

  • discuss this approach
  • consider adding some configurability just in case (allow overriding the node name / seed node name?)
  • test in different Kubernetes environments

@mkuratczyk mkuratczyk force-pushed the k8s-peer-discovery-simplified branch from 7aadc44 to 54eecc1 Compare January 10, 2025 11:23
@mkuratczyk mkuratczyk force-pushed the k8s-peer-discovery-simplified branch 2 times, most recently from 872519e to d153aaf Compare January 20, 2025 09:49
deps/rabbit/src/rabbit_peer_discovery.erl Outdated Show resolved Hide resolved
deps/rabbitmq_peer_discovery_k8s/README.md Outdated Show resolved Hide resolved
deps/rabbitmq_peer_discovery_k8s/README.md Outdated Show resolved Hide resolved
deps/rabbitmq_peer_discovery_k8s/README.md Outdated Show resolved Hide resolved
@mkuratczyk mkuratczyk force-pushed the k8s-peer-discovery-simplified branch from d153aaf to 5a8865b Compare January 22, 2025 07:55
This came up when testing whether all nodes would wait for node 0
could not start. To test this, I was killing node 0 every few seconds,
so it would attempt to start and then get killed. This led to other
nodes sometimes discovering it and attempting to sync, but then crashing
with function_clause as it went away.
Rather than querying the Kubernetes API, just check the local node name
and try to connect to the pod with ID `0` (`-0` suffix). Only the pod
with ID 0 can form a new cluster - all other pods will wait forever.
This should prevent any race conditions and incorrectly formed clusters.
@mkuratczyk mkuratczyk force-pushed the k8s-peer-discovery-simplified branch from 5a8865b to b8d3186 Compare January 22, 2025 09:23
@mkuratczyk mkuratczyk marked this pull request as ready for review January 22, 2025 13:37
@ansd
Copy link
Member

ansd commented Jan 23, 2025

I think the general idea is good. It's simple and works with both parallel and sequential startup of nodes.

I leave a couple of comments/thoughts/questions here:

  1. Yet another benefit (not listed) above is that the nodes do not need permission to query the K8s Endpoints object (which simplifies things further in security restrictive environments).
  2. The ordinal is hardcoded to 0 here. StatefulSets can be configured to use a different .spec.ordinals.start. Should this configurability be ignored or addressed in this PR?
  3. I haven't fully studied the code path, but what happens if the seed node 0 is down? For initial cluster formation, it's fine to make the whole deployment fail, but what happens if 0 gets down in my 3-node prod cluster, I don't get it back up, I panic, and want to scale out the cluster to a 4th node. The 4th node won't be able to contact 0. Does the 4th node then contact node 1 or 2? Does the boot process completely fail? Does the boot process succeeds, but it's not clustered and I can run join_cluster manually?

@mkuratczyk
Copy link
Contributor Author

I didn't know Kubernetes added configurability. I was thinking about adding some configurability just in case there were some crazy configurations where the defaults didn't work, but this makes such configurability a must (why would anyone set a custom ordinal start though? why?! ;) )

There's currently no fallback - it will keep trying to contact server-0 forever. I haven't tried if it's possible to use a manual join_cluster operation to resolve a cluster in such a state. Will check. Adding a fallback to another node would go against the whole idea of how this approach works, so I'd rather not do it. However, the aforementioned configurability could perhaps be used to make the 4th node contact a different one than -0.

@dumbbell
Copy link
Member

In David’s example, node 4 could join node 1 after checking it is already clustered with node 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants