-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consul-replicate sync empty data during raft leader election in the master DC #82
Comments
I managed to reproduce this on a test cluster as follows: 3 machines in dc1 consul-replicate running on a consul server in dc2 is replicating 5 different prefixes from dc1 to dc2
The pub prefix got ~120k records in it. On a server in dc1 every 10 seconds I "touch" 9 keys and erase 1 random key
In a separate console i am sending SIGKILL to all consul servers in dc1 which are immediately started again by the "perp" supervisor
After several passes I see how all keys under the pub/ prefix are deleted:
The interesting thing is that the data from the other prefixes is received and shown as already replicated thus not marked for deletion:
However it appears that consul-replicate is receiving blank data from the "pub" prefix watch and it is thinking that all records were erased from the master dc. As a result they are removed from the following dc as well. Not sure if this is problem with consul-replicate not waiting enough for the master dc cluster to reach consistent state and triggering/receiving incomplete keyprefix watch data or it is problem with the consul itself which under certain unknown conditions is sending blank data in response to keyprefix watches. Still not sure how blank data for our pub prefix got past this check as it appears to be per-prefix:
Any assistance will be highly appreciated. |
Still not sure if this is a consul issue or consul-replicate issue. I checked consul changelog and find out the following: Do you think it might be related? Currently I am in process of upgrading my test cluster to the latest consul version so I can see if upgrading it will make this issue go away. Still not sure if consul or consul-replicate issue. Any assistance will be highly appreciated. |
Quick update on the issue. This case has been proven to be present even with the latest consul version available 1.0.6 and raft version 1 Steps to reproduce
|
safels and safetree behave exactly like the native ls and tree with one exception. They will *refuse* to render template, if KV prefix query return blank/empty data. This is especially usefull for rendering mission critical files that do not tolerate ls/tree KV queries to return blank data. safels and safetree work in stale mode just as their ancestors but we get extra safety on top. safels and safetree commands were born as an attempt to mitigate issues described here: hashicorp#1131 hashicorp/consul#3975 hashicorp/consul-replicate#82
safels and safetree behave exactly like the native ls and tree with one exception. They will *refuse* to render template, if KV prefix query return blank/empty data. This is especially usefull for rendering mission critical files that do not tolerate ls/tree KV queries to return blank data. safels and safetree work in stale mode just as their ancestors but we get extra safety on top. safels and safetree commands were born as an attempt to mitigate issues described here: hashicorp#1131 hashicorp/consul#3975 hashicorp/consul-replicate#82
Might avoid doing hashicorp/consul-template#1132 And might fix the following bugs: * hashicorp/consul-replicate#82 * hashicorp#3975 * hashicorp/consul-template#1131
…#4554) Ensure that DB is properly initialized when performing stale queries Addresses: - hashicorp/consul-replicate#82 - #3975 - hashicorp/consul-template#1131
FIxed by hashicorp/consul#4554 |
@pierresouchay We are facing similar issue with consul 1.4.0 . After the upgrade, the consul-replicate started deleting the keys once in a while. We run consul-replicate along with one of the consul server. We're trying to replicate the issue as of now. |
@vaLski, just wanted to say 'thanks' for this thoroughly documented issue. It really helped us out. |
@arecker Have you experienced the same issue again? It is supposed to be fixed in hashicorp/consul#4554. Since we upgraded our consul servers to @nitsh Did you find the reason for the problem and solution for it? Kindly share further details if so. If you are still facing this issue kindly run consul-replicate in trace mode and attach log snippets where we can see tracing information of errors, debugging info while deleting data etc. In my personal case I was seeing some 5xx class errors in the consul-replicate logs during master-dc leader election and shortly afterwards deleting XXXXXXX keys as consul-replicate was configured with the |
@arecker Have you experienced the same issue again? It is supposed to
be fixed in hashicorp/consul#4554. Since we upgraded our consul
servers to 1.2.3 release which carry the mentioned patch, the problem
went away.
We did, but we are still running an older version of consul. We plan to
upgrade to 1.2.3 soon so we don't run into it again.
…--
Alex Recker
[email protected]
|
safels and safetree behave exactly like the native ls and tree with one exception. They will *refuse* to render template, if KV prefix query return blank/empty data. This is especially usefull for rendering mission critical files that do not tolerate ls/tree KV queries to return blank data. safels and safetree work in stale mode just as their ancestors but we get extra safety on top. safels and safetree commands were born as an attempt to mitigate issues described here: #1131 hashicorp/consul#3975 hashicorp/consul-replicate#82
Hello.
We ran into very strange issue with consul-replicate - three times.
During outage and raft leader election in the main data center, consul-replicate in the followers data centers synchronized blank or incomplete data from the main dc.
As a result, huge amount of data is erased from the prefixes we were replicating in the followers data centers. We experienced this with a prefix holding ~150000 subkeys/records.
UPDATE: My latest tests are showing that this issue can NOT be recreated if I configure consul-replicate without max_stale.
I suspect there might be something like race condition, where consul-replicate prefix watch is attached to a main dc follower that did not catch up completely with the master yet and whose kv db was empty resulting in triggering a kv prefix watch event with blank data. In this case I am not really sure how it get past the following check:
Here is a quick timeline what happened + logs
This was the last successful replication prior main dc outage
Main data center goes down and consul-replicate in follower data center start to return errors. Some of the logs skipped as they are identical.
Retry has been successful and some data was replicated:
Main data center goes down again:
The very next log line is showing how consul-replicate synchronized empty data from the main data center resulting all keys from the pub prefix to be deleted
Also after the outage in the main data center was resolved, it refused to synchronize proper data from the main dc KV store again due to the following code:
To force re-synchronization I had to erase recursively, all entries under the following prefix:
Not sure if adjusting the following options could have prevented (sync blank data) this from happening. I am still unsure how to read those timers exactly. Did they mean "force replication each N seconds" or I am getting this wrong. How they are related to the cluster consistent state?
I am aware that the consul version we are using is a bit old, but I suspect that the issue is more to be related consul-replicate.
The text was updated successfully, but these errors were encountered: