Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consul-replicate sync empty data during raft leader election in the master DC #82

Closed
vaLski opened this issue Mar 15, 2018 · 8 comments
Closed

Comments

@vaLski
Copy link

vaLski commented Mar 15, 2018

Hello.

We ran into very strange issue with consul-replicate - three times.

consul-replicate v0.3.1 (172f34b) - configured with **max_stale = "5m"**
consul v0.7.5

During outage and raft leader election in the main data center, consul-replicate in the followers data centers synchronized blank or incomplete data from the main dc.

As a result, huge amount of data is erased from the prefixes we were replicating in the followers data centers. We experienced this with a prefix holding ~150000 subkeys/records.

UPDATE: My latest tests are showing that this issue can NOT be recreated if I configure consul-replicate without max_stale.

I suspect there might be something like race condition, where consul-replicate prefix watch is attached to a main dc follower that did not catch up completely with the master yet and whose kv db was empty resulting in triggering a kv prefix watch event with blank data. In this case I am not really sure how it get past the following check:

// Get the prefix data
view, ok := r.get(prefix)
if !ok {
    log.Printf("[INFO] (runner) no data for %q", prefix.Dependency)
    doneCh <- struct{}{}
    return
}

Here is a quick timeline what happened + logs

This was the last successful replication prior main dc outage

20180313T084044.363696: 2018/03/13 08:40:44.363681 [INFO] (runner) replicated 0 updates, 23 deletes

Main data center goes down and consul-replicate in follower data center start to return errors. Some of the logs skipped as they are identical.

20180313T084122.659085: 2018/03/13 08:41:22.659073 [WARN] (view) kv.list(pub@main-dc): Unexpected response code: 500 (retry attempt 1 after "10s")

Retry has been successful and some data was replicated:

20180313T084251.126752: 2018/03/13 08:42:51.126734 [INFO] (runner) quiescence timers starting
20180313T084251.276830: 2018/03/13 08:42:51.276821 [INFO] (runner) quiescence minTimer fired
20180313T084251.276842: 2018/03/13 08:42:51.276827 [INFO] (runner) running
20180313T084252.059906: 2018/03/13 08:42:52.059891 [INFO] (runner) replicated 49 updates, 53 deletes
20180313T084259.630092: 2018/03/13 08:42:59.630080 [INFO] (runner) quiescence timers starting
20180313T084259.780182: 2018/03/13 08:42:59.780168 [INFO] (runner) quiescence minTimer fired
20180313T084259.780193: 2018/03/13 08:42:59.780179 [INFO] (runner) running

Main data center goes down again:

20180313T084409.232366: 2018/03/13 08:44:09.232265 [WARN] (view) kv.list(pub@main-dc): Unexpected response code: 500 (retry attempt 4 after "1m20s")
20180313T085144.446896: 2018/03/13 08:51:44.446804 [WARN] (view) kv.list(pub@main-us): Unexpected response code: 500 (retry attempt 6 after "5m20s")

The very next log line is showing how consul-replicate synchronized empty data from the main data center resulting all keys from the pub prefix to be deleted

20180313T090040.647493: 2018/03/13 09:00:40.647411 [INFO] (runner) replicated 0 updates, 127719 deletes

Also after the outage in the main data center was resolved, it refused to synchronize proper data from the main dc KV store again due to the following code:

// Ignore if the modify index is old
    if pair.ModifyIndex <= status.LastReplicated {
        log.Printf("[DEBUG] (runner) skipping because %q is already "+
            "replicated", key)
        continue
    }

To force re-synchronization I had to erase recursively, all entries under the following prefix:

consul kv delete -recurse service/consul-replicate/statuses

Not sure if adjusting the following options could have prevented (sync blank data) this from happening. I am still unsure how to read those timers exactly. Did they mean "force replication each N seconds" or I am getting this wrong. How they are related to the cluster consistent state?

# This is the quiescence timers; it defines the minimum and maximum amount of
# time to wait for the cluster to reach a consistent state before rendering a
# replicating. This is useful to enable in systems that have a lot of flapping,
# because it will reduce the the number of times a replication occurs.
wait {
  min = "5s"
  max = "10s"
}

I am aware that the consul version we are using is a bit old, but I suspect that the issue is more to be related consul-replicate.

@vaLski
Copy link
Author

vaLski commented Mar 19, 2018

I managed to reproduce this on a test cluster as follows:

3 machines in dc1
3 machines in dc2

consul-replicate running on a consul server in dc2 is replicating 5 different prefixes from dc1 to dc2

source = "pub@dc1"
source = "shared@dc1"
source = "priv@dc1"
source = "consul-template@dc1"
source = "event/input@dc1"

The pub prefix got ~120k records in it.

On a server in dc1 every 10 seconds I "touch" 9 keys and erase 1 random key

while true; do
  cnt=0
  consul kv get -recurse pub | shuf | head  -n 10  | sed  's/:/ /'  | while read key value;  do
    if [[ ${cnt} == 9 ]];  then
      consul kv delete "${key}"
    else
      consul kv put "${key}" "${value}"
    fi
    cnt=$((${cnt}+1))
  done
  sleep 10
done

In a separate console i am sending SIGKILL to all consul servers in dc1 which are immediately started again by the "perp" supervisor

while true; do
  for i in sof1 sof2 sof3; do
    ssh ${i} 'perpctl k consul' &
  done
  sleep 30
done

After several passes I see how all keys under the pub/ prefix are deleted:

2018-03-19 12:12:55.432404 2018/03/19 12:12:55.432397 [INFO] (runner) replicated 0 updates, 127739 deletes

The interesting thing is that the data from the other prefixes is received and shown as already replicated thus not marked for deletion:

2018-03-19 12:08:33.180186 2018/03/19 12:08:33.180182 [DEBUG] (runner) skipping because "event/input/cc77a17c-5e48-4614-995f-470dd5f4ed0b" is already replicated

However it appears that consul-replicate is receiving blank data from the "pub" prefix watch and it is thinking that all records were erased from the master dc. As a result they are removed from the following dc as well.

Not sure if this is problem with consul-replicate not waiting enough for the master dc cluster to reach consistent state and triggering/receiving incomplete keyprefix watch data or it is problem with the consul itself which under certain unknown conditions is sending blank data in response to keyprefix watches.

Still not sure how blank data for our pub prefix got past this check as it appears to be per-prefix:

// Get the last status
status, err := r.tStatus(prefix)
if err != nil {
    errCh <- fmt.Errorf("failed to read replication status: %s", err)
    return
}

Any assistance will be highly appreciated.

@vaLski
Copy link
Author

vaLski commented Mar 20, 2018

Still not sure if this is a consul issue or consul-replicate issue.

I checked consul changelog and find out the following:

hashicorp/consul#2644

Do you think it might be related?

Currently I am in process of upgrading my test cluster to the latest consul version so I can see if upgrading it will make this issue go away. Still not sure if consul or consul-replicate issue.

Any assistance will be highly appreciated.

@vaLski
Copy link
Author

vaLski commented Mar 20, 2018

Quick update on the issue.

This case has been proven to be present even with the latest consul version available 1.0.6 and raft version 1

Steps to reproduce

  • create 3 servers in dc1
  • create 3 servers in dc2
  • start consul-replicate of selected prefix from dc1 to dc2
  • push ~130000-~150000 kv records in the prefix
  • start randomly reading 7 and erasing 3 records from the prefix each 10 seconds
  • during the same time issue restart forcefully consul on all 3 nodes in the master dc1 each 30 seconds
  • test started 1521549485
  • data loss in DC2 at 1521553967
    2018-03-20 13:52:44.040452 2018/03/20 13:52:44.040433 [INFO] (runner) replicated 0 updates, 127118 deletes

vaLski added a commit to vaLski/consul-template that referenced this issue Aug 15, 2018
safels and safetree behave exactly like the native ls and tree with one exception. They will *refuse* to render template, if KV prefix query return blank/empty data.

This is especially usefull for rendering mission critical files that do not tolerate ls/tree KV queries to return blank data.

safels and safetree work in stale mode just as their ancestors but we get extra safety on top.

safels and safetree commands were born as an attempt to mitigate issues described here:

  hashicorp#1131
  hashicorp/consul#3975
  hashicorp/consul-replicate#82
vaLski added a commit to vaLski/consul-template that referenced this issue Aug 15, 2018
safels and safetree behave exactly like the native ls and tree with one exception. They will *refuse* to render template, if KV prefix query return blank/empty data.

This is especially usefull for rendering mission critical files that do not tolerate ls/tree KV queries to return blank data.

safels and safetree work in stale mode just as their ancestors but we get extra safety on top.

safels and safetree commands were born as an attempt to mitigate issues described here:

  hashicorp#1131
  hashicorp/consul#3975
  hashicorp/consul-replicate#82
freddygv pushed a commit to hashicorp/consul that referenced this issue Aug 23, 2018
@pierresouchay
Copy link

FIxed by hashicorp/consul#4554

@nitsh
Copy link

nitsh commented Jan 3, 2019

@pierresouchay We are facing similar issue with consul 1.4.0 . After the upgrade, the consul-replicate started deleting the keys once in a while. We run consul-replicate along with one of the consul server. We're trying to replicate the issue as of now.

@arecker
Copy link

arecker commented Jan 3, 2019

@vaLski, just wanted to say 'thanks' for this thoroughly documented issue. It really helped us out.

@vaLski
Copy link
Author

vaLski commented Jan 4, 2019

@arecker Have you experienced the same issue again? It is supposed to be fixed in hashicorp/consul#4554. Since we upgraded our consul servers to 1.2.3 release which carry the mentioned patch, the problem went away. Thanks to @pierresouchay for the fix and the assistance provided while tracking this down.

@nitsh Did you find the reason for the problem and solution for it? Kindly share further details if so. If you are still facing this issue kindly run consul-replicate in trace mode and attach log snippets where we can see tracing information of errors, debugging info while deleting data etc. In my personal case I was seeing some 5xx class errors in the consul-replicate logs during master-dc leader election and shortly afterwards deleting XXXXXXX keys as consul-replicate was configured with the stale flag set and it was syncing data from a raft-follower whose raft log? (data) was empty. But that is supposed to be fixed already. However since then, consul-replicate in our setup is configured to NEVER use the stale flag but rather to always query the raft leader in the parent DC.

@arecker
Copy link

arecker commented Jan 4, 2019 via email

eikenb pushed a commit to hashicorp/consul-template that referenced this issue Sep 10, 2019
safels and safetree behave exactly like the native ls and tree with one exception. They will *refuse* to render template, if KV prefix query return blank/empty data.

This is especially usefull for rendering mission critical files that do not tolerate ls/tree KV queries to return blank data.

safels and safetree work in stale mode just as their ancestors but we get extra safety on top.

safels and safetree commands were born as an attempt to mitigate issues described here:

  #1131
  hashicorp/consul#3975
  hashicorp/consul-replicate#82
@vaLski vaLski closed this as completed Feb 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants