-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
three_data_hall inconsistent connection string problem #2171
Comments
This is a known issue when multiple operator instances are used (like in the multi-region setup or in three-data-hall). I assume you used the split image as this is the default. In the newer unified image we added a propagation mechanism to reduce the time between an out-dated connection string. The TLDR here is that the fdb-kubernetes-monitor updates an annotation when the local cluster file changes (which is the case when the connection string is updated). The annotation change will then trigger a new reconciliation loop and the operator will update the I change the label to |
Thanks for the explanation; is the
I am indeed using the default |
It's considered stable and can be used in production, but please test it first in your dev/test environment. If anything doesn't work as expected please create a GitHub issue for it.
Interesting, I would have thought that after some time (I think the default reconciliation period is 10h) the connection string will be updated by the operator. Has the operator reconciled in that time? I would expect it didn't execute another reconciliation loop in this case. |
IIRC operator reports in logs that nothing to reconcile was found, so it does not take any action; I'll reproduce the issue and copy/paste here relevant logs, in case it's of interest for a related bug. I'll then keep this test cluster running, in case you want me to check something else while it's affected. |
The operator should notice that the connection strings are not matching: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/v1.47.0/controllers/cluster_controller.go#L580-L586 and updates the connection string in the |
I am indeed running that version of the operator; perhaps there is a cached status issue at play? I'll come back here once I reproduce it. |
I think that operator might have detected the problem but was desisting to do any change because cluster was unhealthy, which in turn was caused by clients using the incorrect connection string (depending on which of the 3 configmaps they are using); will report back as soon as I can confirm this. |
The cached status shouldn't be an issue because the status is only cached for one reconciliation.
Based one those logs it seems like the different coordinator instances have changed the coordinators? Might be worth to check for those logs: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/controllers/change_coordinators.go#L83 during the same timeframe. |
Thanks for your reply! I am attaching here the logs of a recent reproduction of this issue, in case they can shed some light. Notes:
|
Yesterday I found another instance of this problem, but it was slightly different:
@johscheuer let me know if I should do some other test or provide more information |
What do you mean by
Based on the |
It would be helpful to get the operator logs, to see if the operator was running the reconcile loop. Was there a |
If you fetch the status in JSON format, it's a boolean field under
Indeed, they are using the same connection string in that log line Perhaps the problem is that somehow operator fails updating
Let me try finding also those logs.
It was manually fixed:
If left by itself, issue doesn't fix (even after many hours). |
These are the logs from that other instance. I can see a single failure to update status in there. Perhaps that's the problem, that if it fails that time, it is never retried? |
Where are you running that The operator should be logging a message when the connection string from the DB is different from the one stored in the status: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/controllers/cluster_controller.go#L580-L586 In general it's recommended to not use the cluster file from the ConfigMap directly but rather copy it to a writable file to allow the clients to update their cluster file directly when coordinators are changed. |
Thanks, will take note of this. I have been using a random pod to check this.
What about that
Thanks, I think all clients are using |
@johscheuer I have one question about the configmap usage: if client connects via |
In the next reconciliation run the operator should update the
From the FDB bindings:
https://github.com/apple/foundationdb/blob/main/bindings/go/src/fdb/fdb.go#L408-L413 and based on your description it rather seems like you are having long running clients. But I would need to look inside the code to answer your question properly (I'll try to find some time for this). The current recommendation would be to copy the cluster file with an init container to a place that is writable, e.g.: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/config/samples/client.yaml#L42-L59. |
Edit: @johscheuer my tab wasn't refreshed and I didn't notice your reply, thanks!
Got it, so to write down the flow:
(I think this also means that it's not necessary to watch the original configmap when using |
What happened?
I have found a quirky issue happening when using
three_data_hall
; the issue consists in the k8s FoundationDB cluster status containing an outdated connection string which does not correspond to what the cluster is currently using:When checking directly via
get \xff\xff/connection_string
, the correct connection string for the cluster appears to be the one with generation ID9pION4FdMvW53gB4ikdjo61cs7HQKtK3
.What did you expect to happen?
The operator-maintained connection string field should always match
get \xff\xff/connection_string
.How can we reproduce it (as minimally and precisely as possible)?
Anything else we need to know?
Related: #1958
By my analysis the client issues are symptoms and not the cause: there is always going to be a client with an incorrect connection string if the operator-provided configmap does not contain the same connection string for all halls.
FDB Kubernetes operator
v1.47.0
Kubernetes version
Cloud provider
The text was updated successfully, but these errors were encountered: