You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We were hoping to upgrade to 1.7.3 to avoid the bug described at #7396.
We use consul-replicate v0.4.0 (886abcc) to replicate a subset keys from us-west-2 into the other datacenters (ie consul-replicate -prefix "apps@us-west-2").
We created a new 3-node stack in us-east-1 running 1.7.3 and joined them to the existing 3-node 1.7.2 cluster
Once the new nodes joined the cluster, we noticed that any services we run that use consul-template began to fail with invalid configurations (null/missing KVs, etc)
We then immediately deactivated the new stack, reverting the cluster back to 1.7.2
Upon investigation, we noticed that the /apps folder in us-east-1 no longer appeared in the UI, yet consul-replicate was logging out the following lines to syslog:
2020/07/21 15:24:37.393702 [DEBUG] (runner) skipping because "apps/monitor/build/healthcheck-path" is already replicated
2020/07/21 15:24:37.393717 [DEBUG] (runner) skipping because "apps/monitor/build/https-enabled" is already replicated
2020/07/21 15:24:37.393731 [DEBUG] (runner) skipping because "apps/monitor/build/metrics-path" is already replicated
To clear this odd state out, we tried deleting the key path in us-east-1:
consul kv delete /apps
Success! Deleted key: apps
We restarted consul-replicate again, but the same already replicated messages appeared in the logs. We ended up re-importing the KVs from a backup file which got us back to a healthy state again.
Key Takeaways
The issue initially appeared when the new 1.7.3 cluster connected to the 1.7.2 cluster
Only KVs that are pulled into the cluster via consul-replicate appeared to be missing in the UI -- all other DC-local KVs remained untouched
Its our understanding that consul-replicate does not perform any deletes--only inserts/updates
After shutting off the 1.7.3 stack and fully reverting back to 1.7.2, we continued to have problems with missing KVs
KVs appeared missing from the cluster, yet consul-replicate showed that they existed (even through multiple restarts)
Deleting the KV with consul kv delete /apps and restarting consul-replicate failed to trigger a replication
ACLs are not enabled in the affected cluster
Importing KVs from a backup file was the only way we could recover these ghosted KVs
consul-replicate worked as expected once the missing KVs were imported
Our templates were being rendered with empty/null key values and services were reloaded with bad configs
We often use the #key function in consul-template to block when the desired key does not exist, which should have prevented this
But it appears that consul-template saw that keys somehow existed with null/empty values and rendered them out anyway
I just tried to repro this issue on my end and after the upgrade of the first DC, the Consul Replicate process running in DC2 was still operational. Here is how I set up my repro:
6 VMs using Vagrant(3 VMs for DC1 and 3 VMs for DC2)
Both DCs running Consul v1.9.5+ent
Created a few KVs in DC1, wan-federated the DCs, and started the Consul Replicate process in DC2 to pull in the DC1 KVs
Following the upgrade procedure, I upgraded DC1 to v v1.10.0+entbeta1(leaving the leader server for last)
Once I confirmed that DC1 was operational post-upgrade, I check on the Consul Replicate process in DC2 and I didn't see the 020/07/21 15:24:37.393702 [DEBUG] (runner) skipping because "apps/monitor/build/healthcheck-path" is already replicated errors that you mentioned earlier. I also checked the Consul UI on one of the DC2 nodes and saw that the folder I created for the KVs was still there and intact. I even created more KVs in DC1 and saw them being replicated in DC2
I'll close this for now since it's been a while since the last response and I've not been able to repro this behavior. But do feel free to drop a comment if you are still seeing this behavior and I can reopen & look into it further
Overview of the Issue
We have three datacenters running
1.7.2
:We were hoping to upgrade to
1.7.3
to avoid the bug described at #7396.We use
consul-replicate v0.4.0 (886abcc)
to replicate a subset keys fromus-west-2
into the other datacenters (ieconsul-replicate -prefix "apps@us-west-2"
).us-east-1
running1.7.3
and joined them to the existing 3-node1.7.2
clusterconsul-template
began to fail with invalid configurations (null/missing KVs, etc)1.7.2
Upon investigation, we noticed that the
/apps
folder inus-east-1
no longer appeared in the UI, yetconsul-replicate
was logging out the following lines to syslog:To clear this odd state out, we tried deleting the key path in
us-east-1
:We restarted
consul-replicate
again, but the samealready replicated
messages appeared in the logs. We ended up re-importing the KVs from a backup file which got us back to a healthy state again.Key Takeaways
1.7.3
cluster connected to the1.7.2
clusterconsul-replicate
appeared to be missing in the UI -- all other DC-local KVs remained untouchedconsul-replicate
does not perform any deletes--only inserts/updates1.7.3
stack and fully reverting back to1.7.2
, we continued to have problems with missing KVsconsul-replicate
showed that they existed (even through multiple restarts)consul kv delete /apps
and restartingconsul-replicate
failed to trigger a replicationconsul-replicate
worked as expected once the missing KVs were importedconsul-template
to block when the desired key does not exist, which should have prevented thisconsul-template
saw that keys somehow existed with null/empty values and rendered them out anywayOpen Questions
-force
option forconsul-replicate
to trigger a write no matter what?Operating system and Environment details
Amazon Linux 1, EC2
The text was updated successfully, but these errors were encountered: