Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vault suffering complete freeze after connection issues with the backend #3896

Closed
MrArtichaut opened this issue Feb 2, 2018 · 15 comments
Closed
Milestone

Comments

@MrArtichaut
Copy link

We have multiple Vault instances (in different cluster) using Zookeeper as a backend that suffer complete freeze after some network issues. Symptoms are still the same: Vault doesn't respond anymore and requests (using the CLI or Curl) are stuck for dozens of seconds. A netstat indicates that Vault has no connection to the Zookeeper anymore.

The issue always begin during episode of network issues. Due to network instability, Vault suffers multiple connection failure with Zookeeper and at one point, for no apparent reason, it seems to giveup entirely, leaving the instance without any connection to its backend.

Below are information about one (dev) instance. I've also attached logs (starting at the beginning of the networks issues until we restart it).

Finally, on the same server we have small agent (a sidecar) for registering Vault in our service discovery system. This agent use Zookeeper for registration (the same instance as Vault) and also use the same Go library for the Zookeeper connection (https://github.com/samuel/go-zookeeper). I've also attach the logs of this agent.

What's interesting is that we can see that the agent also suffer connection failure with Zookeeper but eventually recovers and continues as usual the network is one again stable.

vault-logs.txt

agent-logs.txt

Environment:

Key             Value
---             -----
Seal Type       shamir
Sealed          false
Total Shares    5
Threshold       2
Version         0.9.3
Cluster Name    dev-cluster
Cluster ID      439ec622-58c4-0199-c815-4a7798985862
HA Enabled      true
HA Mode         active
HA Cluster      https://vault.dev.net:8201
  • Vault Version: 0.9.3
  • Operating System/Architecture: CentOS 7

Vault Config File:

backend "zookeeper" {
	path = "vault"
	address = "zookeeper.dev.net"
	znode_owner = "digest:vault:************"
	auth_info = "digest:vault:*******"
	redirect_addr ="http://vault.dev.net:8200"
}

listener "tcp" {
	address = "0.0.0.0:8200"
	tls_disable = 1
}
@jefferai
Copy link
Member

jefferai commented Feb 2, 2018

Ping to see if @michaelansel or @adamdecaf or @vespian can help or provide guidance here.

@vespian
Copy link
Contributor

vespian commented Feb 2, 2018

Please have a look/test samuel/go-zookeeper#181 with samuel/go-zookeeper#186 applied.

CC: @mhrabovcin

@jefferai jefferai added this to the 0.9.4 milestone Feb 2, 2018
@jefferai
Copy link
Member

jefferai commented Feb 2, 2018

Adding to milestone to pull deps if it works as a fix.

@MrArtichaut
Copy link
Author

MrArtichaut commented Feb 2, 2018 via email

@MrArtichaut
Copy link
Author

I've reproduced my issue with the 0.9.3 by dropping 80% of the packets from Zookeeper. In only a couple of minute Vault freeze.

I've build a new version of Vault including go-zookeeper with the fix from samuel/go-zookeeper#186 and after 1 hour of (simulated) network issues, there is no deadlock. If a restore the network, Vault is up and running.

(The important part of the fix seems to be in the samuel/go-zookeeper#181 which is already lives in the master)

@MrArtichaut
Copy link
Author

After applying the fix Vault doesn't freeze anymore during network issues. However, after a certain amount of time it seals itself, making him virtually out of service.

I've attached some new logs. Network issues starts at 22:00:39 and Vault seals itself at 22:13:51. Here's the relevant part:

févr. 05 22:13:19 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:19 Connected to 192.168.11.15:2181
févr. 05 22:13:19 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:19 Authenticated: id=99438839260454533, timeout=4000
févr. 05 22:13:19 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:19 Re-submitting `1` credentials after reconnect
févr. 05 22:13:23 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:23 Recv loop terminated: err=read tcp 10.100.51.9:60628->192.168.11.15:2181: i/o timeout
févr. 05 22:13:23 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:23 Send loop terminated: err=<nil>
févr. 05 22:13:25 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:25 Failed to connect to 192.168.11.15:2181: dial tcp 192.168.11.15:2181: i/o timeout
févr. 05 22:13:26 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:26 Connected to 192.168.11.15:2181
févr. 05 22:13:27 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:27 Authenticated: id=99438839260454533, timeout=4000
févr. 05 22:13:27 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:27 Re-submitting `1` credentials after reconnect
févr. 05 22:13:30 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:30 Recv loop terminated: err=read tcp 10.100.51.9:60634->192.168.11.15:2181: i/o timeout
févr. 05 22:13:30 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:30 Send loop terminated: err=<nil>
févr. 05 22:13:30 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:30.007636 [ERROR] core: barrier init check failed: error=failed to check for initialization: zk: connection closed
févr. 05 22:13:30 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:30.007655 [ERROR] error checking health: error=failed to check for initialization: zk: connection closed
févr. 05 22:13:31 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:31 Connected to 192.168.11.15:2181
févr. 05 22:13:38 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:38 Authentication failed: zk: session has been expired by the server
févr. 05 22:13:38 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:38.344175 [ERROR] core: barrier init check failed: error=failed to check for initialization: zk: could not connect to a server
févr. 05 22:13:38 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:38.344196 [ERROR] error checking health: error=failed to check for initialization: zk: could not connect to a server
févr. 05 22:13:38 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:38.344209 [WARN ] core: leadership lost, stopping active operation
févr. 05 22:13:38 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:38.344200 [ERROR] rollback: error rolling back: path=cubbyhole/ error=failed to read salt: zk: could not connect to a server
févr. 05 22:13:40 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:40 Failed to connect to 192.168.11.15:2181: dial tcp 192.168.11.15:2181: i/o timeout
févr. 05 22:13:40 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:40.344751 [ERROR] core: clearing leader advertisement failed: error=Failed to remove "/vault/core/leader/_ae5190f3-15ed-1bd3-29f5-4776728bce26": zk: could not connect to a server
févr. 05 22:13:40 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:40.344770 [INFO ] core: pre-seal teardown starting
févr. 05 22:13:40 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:40.344776 [INFO ] core: stopping cluster listeners
févr. 05 22:13:40 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:40.344784 [INFO ] core: shutting down forwarding rpc listeners
févr. 05 22:13:40 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:40.344813 [INFO ] core: forwarding rpc listeners stopped
févr. 05 22:13:40 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:40.431308 [INFO ] core: rpc listeners successfully shut down
févr. 05 22:13:40 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:40.431354 [INFO ] core: cluster listeners successfully shut down
févr. 05 22:13:40 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:40.431380 [DEBUG] expiration: stop triggered
févr. 05 22:13:40 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:40.431392 [DEBUG] expiration: finished stopping
févr. 05 22:13:40 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:40.431433 [INFO ] rollback: stopping rollback manager
févr. 05 22:13:40 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:40.431520 [INFO ] core: pre-seal teardown complete
févr. 05 22:13:42 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:42 Failed to connect to 192.168.11.15:2181: dial tcp 192.168.11.15:2181: i/o timeout
févr. 05 22:13:43 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:43 Connected to 192.168.11.15:2181
févr. 05 22:13:44 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:44 Authenticated: id=99438839260454541, timeout=4000
févr. 05 22:13:44 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:44 Re-submitting `1` credentials after reconnect
févr. 05 22:13:46 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:46.146197 [DEBUG] forwarding: error sending echo request to active node: error=rpc error: code = Unavailable desc = all SubConns are in TransientFailure
févr. 05 22:13:46 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:46.205854 [INFO ] core: acquired lock, enabling active operation
févr. 05 22:13:49 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:49 Recv loop terminated: err=read tcp 10.100.51.9:60650->192.168.11.15:2181: i/o timeout
févr. 05 22:13:49 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:49 Send loop terminated: err=<nil>
févr. 05 22:13:49 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:49.571597 [ERROR] core: error performing key upgrades: error=error reloading keyring: failed to check for keyring: zk: connection closed
févr. 05 22:13:51 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:51 Failed to connect to 192.168.11.15:2181: dial tcp 192.168.11.15:2181: i/o timeout
févr. 05 22:13:51 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:51.572051 [DEBUG] core: marked as sealed
févr. 05 22:13:51 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:51.572116 [DEBUG] forwarding: error sending echo request to active node: error=rpc error: code = Canceled desc = context canceled
févr. 05 22:13:51 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:51.572174 [DEBUG] core: sealing barrier
févr. 05 22:13:51 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:51.572185 [INFO ] core: vault is sealed
févr. 05 22:13:53 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:53 Failed to connect to 192.168.11.15:2181: dial tcp 192.168.11.15:2181: i/o timeout
févr. 05 22:13:53 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:53.572618 [ERROR] core: barrier init check failed: error=failed to check for initialization: zk: could not connect to a server
févr. 05 22:13:53 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:53.572636 [ERROR] error checking health: error=failed to check for initialization: zk: could not connect to a server
févr. 05 22:13:55 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:55 Failed to connect to 192.168.11.15:2181: dial tcp 192.168.11.15:2181: i/o timeout

Is this "normal" ?

@adamdecaf
Copy link
Contributor

adamdecaf commented Feb 6, 2018

It looks like a connection to zk was lost and another was reestablished, but not reused. Do you have the zk logs? (With connection ids)

I'm not that familiar with the zk -> vault backend. $work hasn't used it for a while now.

févr. 05 22:13:42 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:42 Failed to connect to 192.168.11.15:2181: dial tcp 192.168.11.15:2181: i/o timeout
févr. 05 22:13:43 DEV-VAULT-1 vault[52793]: 2018/02/05 22:13:43 Connected to 192.168.11.15:2181

@jefferai
Copy link
Member

jefferai commented Feb 6, 2018

It's normal in the sense that if Vault is totally unable to even discover whether it is initialized and cannot read its keyring, it can't actually do anything.

@MrArtichaut
Copy link
Author

MrArtichaut commented Feb 13, 2018

After more tests I'm able to reproduce this behavior and I've also discovered a new one. After simulating network issues between the Vault instances and the Zookeeper cluster for 45min, I've restablished a normal network and now my 3 nodes are in standby with no master in the cluster. They are not sealed, but as there is no master the cluster is not usable.

Key             Value
---             -----
Seal Type       shamir
Sealed          false
Total Shares    5
Threshold       2
Version         0.9.3
Cluster Name    vault-cluster-1c75d313
Cluster ID      179b5883-6fbe-587a-cfbd-3411dc389241
HA Enabled      true
HA Mode         standby
HA Cluster      n/a
Key             Value
---             -----
Seal Type       shamir
Sealed          false
Total Shares    5
Threshold       2
Version         0.9.3
Cluster Name    vault-cluster-1c75d313
Cluster ID      179b5883-6fbe-587a-cfbd-3411dc389241
HA Enabled      true
HA Mode         standby
HA Cluster      n/a
Key             Value
---             -----
Seal Type       shamir
Sealed          false
Total Shares    5
Threshold       2
Version         0.9.3
Cluster Name    vault-cluster-1c75d313
Cluster ID      179b5883-6fbe-587a-cfbd-3411dc389241
HA Enabled      true
HA Mode         standby
HA Cluster      n/a

All 3 instances now repeat the following error in the logs:

[ERROR] core: failed to acquire lock: error=lock already held

I've attached logs of the vault instances and the zookeeper instances. The Zookeeper version is 3.4.11. I've build my own version of Vault that includes the fix discussed above.

Finally, here's the iptables rules used on each Vault server to simulate network issues :

iptables -A INPUT -s 10.105.200.20 -m statistic --mode random --probability 0.8 -j DROP
iptables -A INPUT -s 10.105.200.21 -m statistic --mode random --probability 0.8 -j DROP
iptables -A INPUT -s 10.105.200.22 -m statistic --mode random --probability 0.8 -j DROP

vault01.txt
vault02.txt
vault03.txt
zookeeper1.txt
zookeeper2.txt
zookeeper3.txt

@MrArtichaut
Copy link
Author

I've done the same experience with 3 Vault using a Consul cluster as a backend. Same result. After les than 10min, all 3 instances are sealed. The last messages are always the same :

févr. 15 12:57:32 vaultdev02 vault[19634]: 2018/02/15 12:57:32.720804 [ERROR] core: failed to acquire lock: error=failed to acquire lock: Unexpected response code: 500 (No known Consul servers)
févr. 15 12:57:48 vaultdev02 vault[19634]: 2018/02/15 12:57:48.376933 [ERROR] core: failed to acquire lock: error=failed to create session: Unexpected response code: 500 (rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection)
févr. 15 12:57:58 vaultdev02 vault[19634]: 2018/02/15 12:57:58.844756 [ERROR] core: failed to acquire lock: error=failed to create session: Unexpected response code: 500 (rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection)
févr. 15 12:58:58 vaultdev02 vault[19634]: 2018/02/15 12:58:58.641623 [INFO ] core: acquired lock, enabling active operation
févr. 15 12:59:44 vaultdev02 vault[19634]: 2018/02/15 12:59:44.866167 [ERROR] core: leader advertisement setup failed: error=Unexpected response code: 500 (rpc error making call: EOF)
févr. 15 12:59:44 vaultdev02 vault[19634]: 2018/02/15 12:59:44.866869 [ERROR] core: failed to list entries in core/leader: error=Unexpected response code: 500
févr. 15 13:00:09 vaultdev02 vault[19634]: 2018/02/15 13:00:09.924272 [ERROR] core: key rotation periodic upgrade check failed: error=Unexpected response code: 500
févr. 15 13:00:57 vaultdev02 vault[19634]: 2018/02/15 13:00:57.202348 [ERROR] core: failed to acquire lock: error=failed to read lock: Unexpected response code: 500
févr. 15 13:02:04 vaultdev02 vault[19634]: 2018/02/15 13:02:04.385448 [ERROR] core: failed to acquire lock: error=failed to read lock: Unexpected response code: 500
févr. 15 13:02:25 vaultdev02 vault[19634]: 2018/02/15 13:02:25.594556 [INFO ] core: acquired lock, enabling active operation
févr. 15 13:03:26 vaultdev02 vault[19634]: 2018/02/15 13:03:26.602266 [ERROR] core: error performing key upgrades: error=error checking for key upgrades: Unexpected response code: 500
févr. 15 13:03:27 vaultdev02 vault[19634]: 2018/02/15 13:03:27.933879 [INFO ] core: vault is sealed

Is this a "normal" behavior? If yes, it bothers me because 10min is not a very time and having a service as central as Vault not being able to "survive" network issues for 10 minutes seems problematic.

Vault config:

backend "consul" {
        path = "vault"
        address = "127.0.0.1:8500"
        disable_registration = "true"
        redirect_addr ="http://vaultdev01.oodrivegroup.net:8400"
}

listener "tcp" {
        address = "0.0.0.0:8400"
        tls_disable = 1
}

iptables rules to simulates network issues:

iptables -A INPUT -s 10.105.200.31 -m statistic --mode random --probability 0.8 -j DROP
iptables -A INPUT -s 10.105.200.32 -m statistic --mode random --probability 0.8 -j DROP
iptables -A INPUT -s 10.105.200.33 -m statistic --mode random --probability 0.8 -j DROP

10.105.200.XX are consul servers node. I jam the connection between the consul agent and the consul servers, not between Vault and his agent.

@MrArtichaut
Copy link
Author

Should I open a new ticket?

@jefferai
Copy link
Member

HA nodes are reliant on locks provided by the storage layer. If those locks time out, they have to give up leadership. When a new node attempts to change to active state, one of the first things it does is it verifies that it has the latest keyring by reading it from storage. This is a safety feature; if the keyring was rotated and it did not pick up on this (let's say, due to network issues), and new data was written with the new key versions on the formerly-active node, you risk a split-brain/data-loss scenario by continuing on with an in-memory cache of the keyring. When it is unable to successfully read its keyring, there isn't much Vault can safely do. So it seals itself. We will always prioritize data safety over administrative ease.

Like all benchmarks, yours are artificial. There are many failure situations where a Vault node is totally fine. For instance, in a full network outage going to a standby node, it would never successfully complete grabbing the lock, so it would never try to become the active node, never attempt to read the keyring from storage, never encounter problems reading it from storage, and never seal itself. By doing a probabilstic dropping of packets -- something that in my experience occurs with far less frequency than a full outage -- you're enabling an uncommon failure case, and there isn't currently a known safe way to deal with it.

However, any critical infrastructure should have sufficient alerts on logs/metrics. If Vault seals itself, assuming your logs/metrics can make it off the box, operators should be able to react relatively quickly to a) fix the underlying problem and b) unseal Vault.

@adamdecaf
Copy link
Contributor

adamdecaf commented Feb 15, 2018

When it is unable to successfully read its keyring, there isn't much Vault can safely do. So it seals itself. We will always prioritize data safety over administrative ease.

This is exactly what I'd expect from a service like vault. Protection over reliability.

I don't see a problem with flake or down networks causing vault to eventually seal and instead I call that a feature!

@MrArtichaut
Copy link
Author

I totally agree concerning reliability over management ease. My concern was more about the duration. In my simulation, my entire Vault cluster sealed itself after less than 10min of disruption. That's seems short to me.

Is there a way to define a timeout or a keyring period in order to Vault to be able to survive a bit longer?

@jefferai
Copy link
Member

When you say "a bit longer", what exactly is the length of time that you think is appropriate? 11 minutes? 20? An hour?

When Vault seals itself you get an immediate notification that something is wrong, subject to appropriate monitoring. Vault could sit in a spin loop trying over and over to read its keyring. Depending on what the actual problem is, which at Vault's level it may not be privy to (for instance, massive amounts of packet loss and TCP retransmissions, which are simply surfaced as intermittent errors on network calls), this may never succeed and you may end up waiting on Vault for an hour for it to try to recover from an error that it will never recover from, giving you more downtime than if you got an alert immediately due to Vault sealing and were able to fix the problem and unseal. Vault will be unusable in these situations anyways. You're better off knowing about them as soon as possible so you can fix the underlying problem.

You are running a synthetic test to simulate failure conditions that are, in my experience, much less common in the real world than failure conditions that Vault generally can successfully recover from, with an arbitrary idea of how Vault should behave in these conditions. I don't really think I can give you any answer that will make you happy here. All I can say is what I said before: we will always prioritize safety over uptime, and just like with any other piece of critical infrastructure, you should have monitoring set up that can alert you to a catastrophic failure state immediately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants