Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

three_data_hall inconsistent connection string problem #2171

Open
gm42 opened this issue Nov 18, 2024 · 20 comments
Open

three_data_hall inconsistent connection string problem #2171

gm42 opened this issue Nov 18, 2024 · 20 comments
Labels
question Further information is requested

Comments

@gm42
Copy link
Contributor

gm42 commented Nov 18, 2024

What happened?

I have found a quirky issue happening when using three_data_hall; the issue consists in the k8s FoundationDB cluster status containing an outdated connection string which does not correspond to what the cluster is currently using:

A:  connectionString:     mydb:[email protected]:4501,10.20.22.84:4501,10.20.60.215:4501,10.20.67.98:4501,10.20.81.100:4501,10.10.265.72:4501,10.20.217.63:4501,10.20.240.122:4501,10.20.243.153:4501
A:  seedConnectionString: <none>

B:  seedConnectionString: mydb:[email protected]:4501,10.20.22.192:4501,10.20.67.101:4501,10.10.210.33:4501,10.10.241.115:4501,10.10.268.53:4501,10.10.277.35:4501,10.10.285.107:4501,10.20.220.3:4501
B:  connectionString: mydb:[email protected]:4501,10.20.60.215:4501,10.20.67.98:4501,10.20.81.100:4501,10.10.220.14:4501,10.10.265.72:4501,10.20.200.1:4501,10.20.217.63:4501,10.20.243.153:4501

C:  seedConnectionString: mydb:[email protected]:4501,10.20.22.192:4501,10.20.67.101:4501,10.10.210.33:4501,10.10.241.115:4501,10.10.268.53:4501,10.10.277.35:4501,10.10.285.107:4501,10.20.220.3:4501
C:  connectionString: mydb:[email protected]:4501,10.20.22.84:4501,10.20.60.215:4501,10.20.67.98:4501,10.20.81.100:4501,10.10.265.72:4501,10.20.217.63:4501,10.20.240.122:4501,10.20.243.153:4501

When checking directly via get \xff\xff/connection_string, the correct connection string for the cluster appears to be the one with generation ID 9pION4FdMvW53gB4ikdjo61cs7HQKtK3.

What did you expect to happen?

The operator-maintained connection string field should always match get \xff\xff/connection_string.

How can we reproduce it (as minimally and precisely as possible)?

  1. trigger pod rotation of some pods in data hall A; for example coordinators, by changing something in the podTemplate for coordinators
  2. repeat for for halls B and C until issue is manifest

Anything else we need to know?

Related: #1958

By my analysis the client issues are symptoms and not the cause: there is always going to be a client with an incorrect connection string if the operator-provided configmap does not contain the same connection string for all halls.

FDB Kubernetes operator

v1.47.0

Kubernetes version

$ kubectl version
Client Version: v1.29.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.4-eks-a737599

Cloud provider

@gm42 gm42 added the bug Something isn't working label Nov 18, 2024
@johscheuer
Copy link
Member

This is a known issue when multiple operator instances are used (like in the multi-region setup or in three-data-hall). I assume you used the split image as this is the default. In the newer unified image we added a propagation mechanism to reduce the time between an out-dated connection string. The TLDR here is that the fdb-kubernetes-monitor updates an annotation when the local cluster file changes (which is the case when the connection string is updated). The annotation change will then trigger a new reconciliation loop and the operator will update the FoundationDBCluster resource and the ConfigMap.

I change the label to question as we fixed it in the unified image and we have no plans to fix it for the split image.

@johscheuer johscheuer added question Further information is requested and removed bug Something isn't working labels Nov 19, 2024
@gm42
Copy link
Contributor Author

gm42 commented Nov 19, 2024

Thanks for the explanation; is the unified image considered stable by now?

In the newer unified image we added a propagation mechanism to reduce the time between an out-dated connection string.

I am indeed using the default split image; however the issue never resolves by itself (it's not a matter of waiting longer, stays intact also for days) unless I manually fix it or trigger further rotations.

@johscheuer
Copy link
Member

Thanks for the explanation; is the unified image considered stable by now?

It's considered stable and can be used in production, but please test it first in your dev/test environment. If anything doesn't work as expected please create a GitHub issue for it.

I am indeed using the default split image; however the issue never resolves by itself (it's not a matter of waiting longer, stays intact also for days) unless I manually fix it or trigger further rotations.

Interesting, I would have thought that after some time (I think the default reconciliation period is 10h) the connection string will be updated by the operator. Has the operator reconciled in that time? I would expect it didn't execute another reconciliation loop in this case.

@gm42
Copy link
Contributor Author

gm42 commented Nov 20, 2024

Has the operator reconciled in that time? I would expect it didn't execute another reconciliation loop in this case.

IIRC operator reports in logs that nothing to reconcile was found, so it does not take any action; I'll reproduce the issue and copy/paste here relevant logs, in case it's of interest for a related bug. I'll then keep this test cluster running, in case you want me to check something else while it's affected.

@johscheuer
Copy link
Member

The operator should notice that the connection strings are not matching: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/v1.47.0/controllers/cluster_controller.go#L580-L586 and updates the connection string in the FoundationDBCluster resource and then the updateStatus reconciler will update the FoundationDBCluster resource status: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/v1.47.0/controllers/update_status.go#L254-L263.

@gm42
Copy link
Contributor Author

gm42 commented Nov 20, 2024

I am indeed running that version of the operator; perhaps there is a cached status issue at play? I'll come back here once I reproduce it.

@gm42
Copy link
Contributor Author

gm42 commented Nov 20, 2024

I think that operator might have detected the problem but was desisting to do any change because cluster was unhealthy, which in turn was caused by clients using the incorrect connection string (depending on which of the 3 configmaps they are using); will report back as soon as I can confirm this.

@gm42
Copy link
Contributor Author

gm42 commented Nov 20, 2024

While I wait to reproduce this issue anew, I have been looking at the logs I have about the last time this issue happened. Both those 2 theories I mentioned seem unfounded; does this perhaps shed some light?

image

e.g. is this evidence that operator is updating connection strings in some sort of loop?

@johscheuer
Copy link
Member

I am indeed running that version of the operator; perhaps there is a cached status issue at play? I'll come back here once I reproduce it.

The cached status shouldn't be an issue because the status is only cached for one reconciliation.

While I wait to reproduce this issue anew, I have been looking at the logs I have about the last time this issue happened. Both those 2 theories I mentioned seem unfounded; does this perhaps shed some light?

Based one those logs it seems like the different coordinator instances have changed the coordinators? Might be worth to check for those logs: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/controllers/change_coordinators.go#L83 during the same timeframe.

@gm42
Copy link
Contributor Author

gm42 commented Nov 29, 2024

Thanks for your reply! I am attaching here the logs of a recent reproduction of this issue, in case they can shed some light.

Notes:

  1. I attached logs starting right when cluster becomes unhealthy until ~5 min before I manually fixed it
  2. just before this time window I did a change to all 3 data hall cluster definitions (e.g. changing some environment variable on an init container by editing the 3 FoundationDB k8s cluster objects); operator then proceeds to rotate coordinator, storage, log and stateless pods; this operation is not in these logs, let me know if they are interesting and I can upload them
  3. this is operator v1.47.0 with some log lines added/removed (no functional changes)
  4. there is no log line about Changing coordinators in this time window nor in the previous 7 days, so it looks like it's not changing coordinators

operator-logs-3dh-cs-issue.json

@gm42
Copy link
Contributor Author

gm42 commented Dec 9, 2024

Yesterday I found another instance of this problem, but it was slightly different:

  • all 3 connection strings in the Kubernetes FoundationDBCluster status were the same
  • they were all different from the actual one used by cluster (get \xff\xff/connection_string)

@johscheuer let me know if I should do some other test or provide more information

@johscheuer
Copy link
Member

I attached logs starting right when cluster becomes unhealthy until ~5 min before I manually fixed it

What do you mean by unhealthy? The cluster was not reachable?

there is no log line about Changing coordinators in this time window nor in the previous 7 days, so it looks like it's not changing coordinators

Based on the "Chose connection option" it seems like all the operator instances used the same connection string, so I'm not totally sure what exactly I should be looking for, assuming it's the same issue where the different FoundationDBCluster resources show a different connection string for some time.

@johscheuer
Copy link
Member

Yesterday I found another instance of this problem, but it was slightly different:

  • all 3 connection strings in the Kubernetes FoundationDBCluster status were the same
  • they were all different from the actual one used by cluster (get \xff\xff/connection_string)

@johscheuer let me know if I should do some other test or provide more information

It would be helpful to get the operator logs, to see if the operator was running the reconcile loop. Was there a Changing coordinators message or any error messages in the logs? Did the issue eventually fix it self or did you made any manual changes?

@gm42
Copy link
Contributor Author

gm42 commented Dec 16, 2024

What do you mean by unhealthy? The cluster was not reachable?

If you fetch the status in JSON format, it's a boolean field under .client.database_status.healthy; it goes to false when clients have configmap issues, for example. Database is perfectly available (that's the other boolean under .client.database_status).

Based on the "Chose connection option" it seems like all the operator instances used the same connection string, so I'm not totally sure what exactly I should be looking for, assuming it's the same issue where the different FoundationDBCluster resources show a different connection string for some time.

Indeed, they are using the same connection string in that log line Chose connection option. The issue is visible for me by doing a kubectl get fdb because the status.connectionString differs from fdbcli get \xff\xff/connection_string, thus clients which use the operator's configmap for cluster file will either fail to connect or be reported in the cluster status as a client connection issue.

Perhaps the problem is that somehow operator fails updating .status.connectionString, and the configmaps?

It would be helpful to get the operator logs, to see if the operator was running the reconcile loop. Was there a Changing coordinators message or any error messages in the logs?

Let me try finding also those logs.

Did the issue eventually fix it self or did you made any manual changes?

It was manually fixed:

  1. set skip: true for each hall cluster
  2. set correct connection string and seed connection string in status subresource
  3. set back skip: false

If left by itself, issue doesn't fix (even after many hours).

@gm42
Copy link
Contributor Author

gm42 commented Dec 16, 2024

These are the logs from that other instance. I can see a single failure to update status in there. Perhaps that's the problem, that if it fails that time, it is never retried?

8thDec2024.json

@johscheuer
Copy link
Member

Indeed, they are using the same connection string in that log line Chose connection option. The issue is visible for me by doing a kubectl get fdb because the status.connectionString differs from fdbcli get \xff\xff/connection_string, thus clients which use the operator's configmap for cluster file will either fail to connect or be reported in the cluster status as a client connection issue.

Where are you running that fdbcli command? \xff\xff/connection_string will show the current content of the connection string (not necessary the connection string from the DB).

The operator should be logging a message when the connection string from the DB is different from the one stored in the status: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/controllers/cluster_controller.go#L580-L586
and it will log when the ConfigMap is updated: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/controllers/update_config_map.go#L72-L80.

In general it's recommended to not use the cluster file from the ConfigMap directly but rather copy it to a writable file to allow the clients to update their cluster file directly when coordinators are changed.

@gm42
Copy link
Contributor Author

gm42 commented Dec 16, 2024

Where are you running that fdbcli command? \xff\xff/connection_string will show the current content of the connection string (not necessary the connection string from the DB).

Thanks, will take note of this. I have been using a random pod to check this.

The operator should be logging a message when the connection string from the DB is different from the one stored in the status: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/controllers/cluster_controller.go#L580-L586
and it will log when the ConfigMap is updated: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/controllers/update_config_map.go#L72-L80.

What about that update_status.go failure in logs? Is it OK to fail there e.g. something else will later update the status anyways?

In general it's recommended to not use the cluster file from the ConfigMap directly but rather copy it to a writable file to allow the clients to update their cluster file directly when coordinators are changed.

Thanks, I think all clients are using OpenWithConnectionString except a couple, will fix them too.

@gm42
Copy link
Contributor Author

gm42 commented Dec 18, 2024

@johscheuer I have one question about the configmap usage: if client connects via OpenWithConnectionString using the operator-provided configmap clusterfile content, shall the client re-do the connection when such configmap is changed?

@johscheuer
Copy link
Member

Where are you running that fdbcli command? \xff\xff/connection_string will show the current content of the connection string (not necessary the connection string from the DB).

Thanks, will take note of this. I have been using a random pod to check this.

The operator should be logging a message when the connection string from the DB is different from the one stored in the status: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/controllers/cluster_controller.go#L580-L586
and it will log when the ConfigMap is updated: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/controllers/update_config_map.go#L72-L80.

What about that update_status.go failure in logs? Is it OK to fail there e.g. something else will later update the status anyways?

In the next reconciliation run the operator should update the ConfigMap.

In general it's recommended to not use the cluster file from the ConfigMap directly but rather copy it to a writable file to allow the clients to update their cluster file directly when coordinators are changed.

Thanks, I think all clients are using OpenWithConnectionString except a couple, will fix them too.

From the FDB bindings:

// OpenWithConnectionString returns a database handle to the FoundationDB cluster identified
// by the provided connection string. This method can be useful for scenarios where you want to connect
// to the database only for a short time e.g. to test different connection strings.
// Caller must call Close() to release resources.

https://github.com/apple/foundationdb/blob/main/bindings/go/src/fdb/fdb.go#L408-L413 and based on your description it rather seems like you are having long running clients. But I would need to look inside the code to answer your question properly (I'll try to find some time for this). The current recommendation would be to copy the cluster file with an init container to a place that is writable, e.g.: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/config/samples/client.yaml#L42-L59.

@gm42
Copy link
Contributor Author

gm42 commented Jan 6, 2025

Edit: @johscheuer my tab wasn't refreshed and I didn't notice your reply, thanks!

I would need to look inside the code to answer your question properly (I'll try to find some time for this). The current recommendation would be to copy the cluster file with an init container to a place that is writable, e.g.: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/config/samples/client.yaml#L42-L59.

Got it, so to write down the flow:

  1. client starts up and copies the clusterfile from the mounted configmap to a writable location
  2. from this moment forward, even if operator makes changes to the configmap, client can continue operating without having to watch that for changes

(I think this also means that it's not necessary to watch the original configmap when using OpenWithConnectionString)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants