Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out-of-date connection strings when running three_data_hall #1958

Closed
simenl opened this issue Mar 6, 2024 · 3 comments · Fixed by #1978
Closed

Out-of-date connection strings when running three_data_hall #1958

simenl opened this issue Mar 6, 2024 · 3 comments · Fixed by #1978
Labels
enhancement New feature or request

Comments

@simenl
Copy link
Collaborator

simenl commented Mar 6, 2024

What happened?

When running with multiple FoundationDBCluster k8s resources, managing the same cluster, like in a three_data_hall setup, a change to the connection string does not propagate to all FoundationDBCluster resources, leaving the connection string in the FoundationDBCluster resource status and the associated ConfigMap (cluster-file) out-of-date.

Foundationdb clients use the cluster-file (connection string) mounted from the ConfigMap. When the connection string is stale, status json reports client.database_status.healthy: false, with the message Cluster file contents do not match current cluster connection string. Verify the cluster file and its parent directory are writable and that the cluster file has not been overwritten externally..

The issue resolves whenever the reconciliation loop of the out-of-date FoundationDBCluster is run again. This seems to require a trigger, such that there is no bound on how long they are out-of-date.

What did you expect to happen?

An update to the connection string should eventually (within a few minutes) propagate to all FoundationDBCluster resources managing the cluster, and their associated ConfigMap.

How can we reproduce it (as minimally and precisely as possible)?

Create a cluster with multiple FoundationDBCluster resource. E.g. by following the three_data_hall example.
Wait for it to reconcile.

Apply a change dummy change to one of the FoundationDBCluster resources which triggers a change of coordinators (e.g. updating the node selector). The connection strings should be out-of-sync between the FoundationDBCluster resources (and ConfigMaps).

Anything else we need to know?

No response

FDB Kubernetes operator

v1.33.0

Kubernetes version

1.26

Cloud provider

Azure, GCP

@simenl simenl added the bug Something isn't working label Mar 6, 2024
@johscheuer
Copy link
Member

johscheuer commented Mar 7, 2024

That issue probably happens for clients outside of the FDB Pods? I believe the issue is that the client cannot write to the mounted cluster file from the ConfigMap. It's better to copy the cluster file to a location where the client can write to, e.g. with an init container. If the client can write to the cluster file it should automatically update the cluster file when the coordinators change.

So the issue is not necessarily that the connection strings are different for some time but rather the issue is that the clients are not able to update their cluster file.

@simenl
Copy link
Collaborator Author

simenl commented Mar 7, 2024

That issue probably happens for clients outside of the FDB Pods? I believe the issue is that the client cannot write to the mounted cluster file from the ConfigMap.

Correct. We mount the cluster-file from the ConfigMap, which has worked excellent with a single FoundationDBCluster resource.
Using an init container to copy the cluster-file would fix the issue for clients that have connected to the cluster.

So the issue is not necessarily that the connection strings are different for some time but rather the issue is that the clients are not able to update their cluster file.

If the cluster file in the ConfigMap is out-of-date, there might be a risk of new pods, which copies the cluster-file from the ConfigMap, getting an invalid/unusable connection string. I cannot tell if this is improbable or impossible.

@johscheuer
Copy link
Member

That issue probably happens for clients outside of the FDB Pods? I believe the issue is that the client cannot write to the mounted cluster file from the ConfigMap.

Correct. We mount the cluster-file from the ConfigMap, which has worked excellent with a single FoundationDBCluster resource. Using an init container to copy the cluster-file would fix the issue for clients that have connected to the cluster.

So the issue is not necessarily that the connection strings are different for some time but rather the issue is that the clients are not able to update their cluster file.

If the cluster file in the ConfigMap is out-of-date, there might be a risk of new pods, which copies the cluster-file from the ConfigMap, getting an invalid/unusable connection string. I cannot tell if this is improbable or impossible.

As long as at least one of the old coordinators are reachable they still can connect to the cluster. So the risk should be minimal for that.

I'll add a different label to this issue, as this is rather a deficit in the current design and not a bug.

I'm going to update the documentation for clients (not sure if we have something like that already in place) with the hint that the cluster file should be moved to a location where the application can write.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants