Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul Upgrade with Replicate Results in Missing KVs #8351

Closed
alkalinecoffee opened this issue Jul 21, 2020 · 1 comment
Closed

Consul Upgrade with Replicate Results in Missing KVs #8351

alkalinecoffee opened this issue Jul 21, 2020 · 1 comment
Labels
needs-investigation The issue described is detailed and complex.

Comments

@alkalinecoffee
Copy link

alkalinecoffee commented Jul 21, 2020

Overview of the Issue

We have three datacenters running 1.7.2:

us-west-2 (primary DC)

us-east-1 (runs consul-replicate)
us-east-2 (runs consul-replicate)

We were hoping to upgrade to 1.7.3 to avoid the bug described at #7396.

We use consul-replicate v0.4.0 (886abcc) to replicate a subset keys from us-west-2 into the other datacenters (ie consul-replicate -prefix "apps@us-west-2").

  1. We created a new 3-node stack in us-east-1 running 1.7.3 and joined them to the existing 3-node 1.7.2 cluster
  2. Once the new nodes joined the cluster, we noticed that any services we run that use consul-template began to fail with invalid configurations (null/missing KVs, etc)
  3. We then immediately deactivated the new stack, reverting the cluster back to 1.7.2

Upon investigation, we noticed that the /apps folder in us-east-1 no longer appeared in the UI, yet consul-replicate was logging out the following lines to syslog:

2020/07/21 15:24:37.393702 [DEBUG] (runner) skipping because "apps/monitor/build/healthcheck-path" is already replicated
2020/07/21 15:24:37.393717 [DEBUG] (runner) skipping because "apps/monitor/build/https-enabled" is already replicated
2020/07/21 15:24:37.393731 [DEBUG] (runner) skipping because "apps/monitor/build/metrics-path" is already replicated

To clear this odd state out, we tried deleting the key path in us-east-1:

consul kv delete /apps
Success! Deleted key: apps

We restarted consul-replicate again, but the same already replicated messages appeared in the logs. We ended up re-importing the KVs from a backup file which got us back to a healthy state again.

Key Takeaways

  • The issue initially appeared when the new 1.7.3 cluster connected to the 1.7.2 cluster
  • Only KVs that are pulled into the cluster via consul-replicate appeared to be missing in the UI -- all other DC-local KVs remained untouched
    • Its our understanding that consul-replicate does not perform any deletes--only inserts/updates
  • After shutting off the 1.7.3 stack and fully reverting back to 1.7.2, we continued to have problems with missing KVs
  • KVs appeared missing from the cluster, yet consul-replicate showed that they existed (even through multiple restarts)
  • Deleting the KV with consul kv delete /apps and restarting consul-replicate failed to trigger a replication
  • ACLs are not enabled in the affected cluster
  • Importing KVs from a backup file was the only way we could recover these ghosted KVs
    • consul-replicate worked as expected once the missing KVs were imported
  • Our templates were being rendered with empty/null key values and services were reloaded with bad configs
    • We often use the #key function in consul-template to block when the desired key does not exist, which should have prevented this
    • But it appears that consul-template saw that keys somehow existed with null/empty values and rendered them out anyway
  • Possible related issues?

Open Questions

  • Are there any known issues with KV sync during an upgrade that may trigger this broken state?
  • Is there some type of -force option for consul-replicate to trigger a write no matter what?

Operating system and Environment details

Amazon Linux 1, EC2

@jsosulska jsosulska added the needs-investigation The issue described is detailed and complex. label Jul 21, 2020
@ChipV223
Copy link
Contributor

Hi @alkalinecoffee !

I just tried to repro this issue on my end and after the upgrade of the first DC, the Consul Replicate process running in DC2 was still operational. Here is how I set up my repro:

  • 6 VMs using Vagrant(3 VMs for DC1 and 3 VMs for DC2)
  • Both DCs running Consul v1.9.5+ent
  • Created a few KVs in DC1, wan-federated the DCs, and started the Consul Replicate process in DC2 to pull in the DC1 KVs
  • Following the upgrade procedure, I upgraded DC1 to v v1.10.0+entbeta1(leaving the leader server for last)
  • Once I confirmed that DC1 was operational post-upgrade, I check on the Consul Replicate process in DC2 and I didn't see the 020/07/21 15:24:37.393702 [DEBUG] (runner) skipping because "apps/monitor/build/healthcheck-path" is already replicated errors that you mentioned earlier. I also checked the Consul UI on one of the DC2 nodes and saw that the folder I created for the KVs was still there and intact. I even created more KVs in DC1 and saw them being replicated in DC2

I'll close this for now since it's been a while since the last response and I've not been able to repro this behavior. But do feel free to drop a comment if you are still seeing this behavior and I can reopen & look into it further

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-investigation The issue described is detailed and complex.
Projects
None yet
Development

No branches or pull requests

3 participants