Skip to content
This repository has been archived by the owner on Oct 23, 2024. It is now read-only.

physical/zookeeper: Re-try to release lock in case of failure #33

Merged
merged 1 commit into from
Jun 5, 2018

Conversation

mhrabovcin
Copy link

Fixes https://jira.mesosphere.com/browse/DCOS-37588

Situation when ZK lock release fails due to temporary connection loss. If release lock is attempted during the connection drop lock may not be released. If ZK library reconnects within the SessionTimeout the lock is held and there is no code to detect, log and attempt to release lock.

This code was merged to official vault master branch - hashicorp#4569

@mhrabovcin mhrabovcin requested review from gpaul and vespian June 1, 2018 13:10
@gpaul
Copy link

gpaul commented Jun 1, 2018

This is not intended to fully fix the issue, just to mitigate it, right?

The go-zookeeper library's code is very hard to follow, I don't trust it at all. For example, Lock() takes a stopCh, but Unlock() doesn't. Both of these should probably return channels which the user can then decide to wait on or break from, with a cancel() func which unblocks any retries.

Also, seeing that the author was ignoring and error returned by a distributed Unlock method is extremely disconcerting: https://github.com/mesosphere/vault/pull/33/files#diff-5d5a99d455c6dfa0caa2534d823970a8L444

Wouldn't simply checking and returning the error from Unlock() on the old line 444 be sufficient?

@mhrabovcin
Copy link
Author

@gpaul you are right about this not being an explicit fix but rather improvement of chances.

Vault initializes underlying zookeeper library with hardcoded SessionTimeout set to 1 second. The patch introduces 10 attempts to release a lock (delete znode). Each attempt is 1 second apart. What we're really trying to address here are temporary connection problems with TCP connection. When a connection drops and client reconnects within a second, the session id is retained and ephemeral (lock) nodes are not automatically deleted. If Vault tries to release a lock during this temporary outage and fails and library reconnects within 1 second the lock is kept. Following 10 attempts improve situation when session id is kept. If session is released, the ZK will delete znodes automatically.

We'll also add explicit logging about the problem, so if it reoccurs we will have exact pointer to a problem.

Also, seeing that the author was ignoring and error returned by a distributed Unlock method is extremely disconcerting: https://github.com/mesosphere/vault/pull/33/files#diff-5d5a99d455c6dfa0caa2534d823970a8L444

Here the situation is worse. Vault, that uses Lock interface is ignoring errors returned from underlying Lock implementation. I've raised that concern hashicorp#4491 (comment) and proposed that clean solution would be to let Vault instance to exit the operation as it entered state that it cannot handle - hashicorp#4491 (comment) . If you'll read the comments in the issue you'll see that maintainer was against that solution.

Because of the discussion with the maintainer we've chosen to go with this patch and improve situation without having explicit fix.

@gpaul
Copy link

gpaul commented Jun 4, 2018

Oh man - ok. What do you think about his last comment?

It feels like the better option is to explicitly close the connection after a step down (perhaps only in the case that the lock fails). Forcing a restart of Vault also means forcing a re-unseal of Vault...it's a very big hammer.
~ hashicorp#4491 (comment)

@mhrabovcin
Copy link
Author

@gpaul I honestly think that Lock object should not be closing global ZK connection. This would be for me most unexpected behavior of Lock component. There is only single ZK connection for the whole Vault program.

@gpaul
Copy link

gpaul commented Jun 4, 2018

True - I guess the point is: when an error occurs it must be handled or it must crash the application. What worries me about the current solution is that it is async. I'm specifically afraid of the fact that the first thing the user will likely do upon getting an error from Unlock is to call Unlock again. You then have two retry loops trying to unlock the same Lock. Should one of them succeed, the user will at some point call Lock() again while there are existing Unlock calls (maybe that's OK as the new Lock uses a different ephemeral node, maybe?)

I guess what bugs me is that this patch adds concurrency (and complexity) without solving the actual issue.

I can move the code to synchronous execution path. The reason why I've moved it into a goroutine is that it is going to block any further execution for 10 seconds. If that is acceptable I'll change it to synchronous.
~ hashicorp#4569 (comment)

I think having this synchronous would make me sleep better at night. It seems similar to a network write or read that takes a long time before returning / timing out. To take the analogy further I propose:

  1. Add a UnlockTimeout which retries unlockInternal every second for UnlockTimeout.
  2. If unlock fails even after N seconds, return an error that can be checked for by implementing Timeout() on that error (the same way the net package does it.)
    See https://golang.org/pkg/net/#AddrError.Timeout and https://golang.org/pkg/net/#AddrError.Temporary
  3. The caller then has to check the error and decide whether to call Unlock again or teardown the application.

Copy link

@gpaul gpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go forward with this and hope it doesn't break something else - it probably doesn't.

@mhrabovcin mhrabovcin merged commit 60f530e into v0.8.3-zkfix Jun 5, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants