Out-of-date connection strings when running three_data_hall #1958

simenl · 2024-03-06T15:15:58Z

What happened?

When running with multiple FoundationDBCluster k8s resources, managing the same cluster, like in a three_data_hall setup, a change to the connection string does not propagate to all FoundationDBCluster resources, leaving the connection string in the FoundationDBCluster resource status and the associated ConfigMap (cluster-file) out-of-date.

Foundationdb clients use the cluster-file (connection string) mounted from the ConfigMap. When the connection string is stale, status json reports client.database_status.healthy: false, with the message Cluster file contents do not match current cluster connection string. Verify the cluster file and its parent directory are writable and that the cluster file has not been overwritten externally..

The issue resolves whenever the reconciliation loop of the out-of-date FoundationDBCluster is run again. This seems to require a trigger, such that there is no bound on how long they are out-of-date.

What did you expect to happen?

An update to the connection string should eventually (within a few minutes) propagate to all FoundationDBCluster resources managing the cluster, and their associated ConfigMap.

How can we reproduce it (as minimally and precisely as possible)?

Create a cluster with multiple FoundationDBCluster resource. E.g. by following the three_data_hall example.
Wait for it to reconcile.

Apply a change dummy change to one of the FoundationDBCluster resources which triggers a change of coordinators (e.g. updating the node selector). The connection strings should be out-of-sync between the FoundationDBCluster resources (and ConfigMaps).

Anything else we need to know?

No response

FDB Kubernetes operator

v1.33.0

Kubernetes version

1.26

Cloud provider

Azure, GCP

The text was updated successfully, but these errors were encountered:

johscheuer · 2024-03-07T17:41:01Z

That issue probably happens for clients outside of the FDB Pods? I believe the issue is that the client cannot write to the mounted cluster file from the ConfigMap. It's better to copy the cluster file to a location where the client can write to, e.g. with an init container. If the client can write to the cluster file it should automatically update the cluster file when the coordinators change.

So the issue is not necessarily that the connection strings are different for some time but rather the issue is that the clients are not able to update their cluster file.

simenl · 2024-03-07T20:03:39Z

That issue probably happens for clients outside of the FDB Pods? I believe the issue is that the client cannot write to the mounted cluster file from the ConfigMap.

Correct. We mount the cluster-file from the ConfigMap, which has worked excellent with a single FoundationDBCluster resource.
Using an init container to copy the cluster-file would fix the issue for clients that have connected to the cluster.

So the issue is not necessarily that the connection strings are different for some time but rather the issue is that the clients are not able to update their cluster file.

If the cluster file in the ConfigMap is out-of-date, there might be a risk of new pods, which copies the cluster-file from the ConfigMap, getting an invalid/unusable connection string. I cannot tell if this is improbable or impossible.

johscheuer · 2024-03-08T11:46:54Z

That issue probably happens for clients outside of the FDB Pods? I believe the issue is that the client cannot write to the mounted cluster file from the ConfigMap.

Correct. We mount the cluster-file from the ConfigMap, which has worked excellent with a single FoundationDBCluster resource. Using an init container to copy the cluster-file would fix the issue for clients that have connected to the cluster.

So the issue is not necessarily that the connection strings are different for some time but rather the issue is that the clients are not able to update their cluster file.

If the cluster file in the ConfigMap is out-of-date, there might be a risk of new pods, which copies the cluster-file from the ConfigMap, getting an invalid/unusable connection string. I cannot tell if this is improbable or impossible.

As long as at least one of the old coordinators are reachable they still can connect to the cluster. So the risk should be minimal for that.

I'll add a different label to this issue, as this is rather a deficit in the current design and not a bug.

I'm going to update the documentation for clients (not sure if we have something like that already in place) with the hint that the cluster file should be moved to a location where the application can write.

simenl added the bug Something isn't working label Mar 6, 2024

johscheuer added enhancement New feature or request and removed bug Something isn't working labels Mar 8, 2024

This was referenced Mar 20, 2024

Improve documentation #1975

Merged

Add additional documentation for the copied cluster-file #1978

Merged

johscheuer closed this as completed in #1978 Apr 3, 2024

gm42 mentioned this issue Nov 18, 2024

three_data_hall inconsistent connection string problem #2171

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out-of-date connection strings when running three_data_hall #1958

Out-of-date connection strings when running three_data_hall #1958

simenl commented Mar 6, 2024 •

edited

Loading

johscheuer commented Mar 7, 2024 •

edited

Loading

simenl commented Mar 7, 2024

johscheuer commented Mar 8, 2024

Out-of-date connection strings when running three_data_hall #1958

Out-of-date connection strings when running three_data_hall #1958

Comments

simenl commented Mar 6, 2024 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

FDB Kubernetes operator

Kubernetes version

Cloud provider

johscheuer commented Mar 7, 2024 • edited Loading

simenl commented Mar 7, 2024

johscheuer commented Mar 8, 2024

simenl commented Mar 6, 2024 •

edited

Loading

johscheuer commented Mar 7, 2024 •

edited

Loading