-
Notifications
You must be signed in to change notification settings - Fork 3
physical/zookeeper: Re-try to release lock in case of failure #33
Conversation
This is not intended to fully fix the issue, just to mitigate it, right? The go-zookeeper library's code is very hard to follow, I don't trust it at all. For example, Lock() takes a stopCh, but Unlock() doesn't. Both of these should probably return channels which the user can then decide to wait on or break from, with a Also, seeing that the author was ignoring and error returned by a distributed Wouldn't simply checking and returning the error from |
@gpaul you are right about this not being an explicit fix but rather improvement of chances. Vault initializes underlying zookeeper library with hardcoded We'll also add explicit logging about the problem, so if it reoccurs we will have exact pointer to a problem.
Here the situation is worse. Vault, that uses Lock interface is ignoring errors returned from underlying Lock implementation. I've raised that concern hashicorp#4491 (comment) and proposed that clean solution would be to let Vault instance to exit the operation as it entered state that it cannot handle - hashicorp#4491 (comment) . If you'll read the comments in the issue you'll see that maintainer was against that solution. Because of the discussion with the maintainer we've chosen to go with this patch and improve situation without having explicit fix. |
Oh man - ok. What do you think about his last comment?
|
@gpaul I honestly think that |
True - I guess the point is: when an error occurs it must be handled or it must crash the application. What worries me about the current solution is that it is async. I'm specifically afraid of the fact that the first thing the user will likely do upon getting an error from Unlock is to call Unlock again. You then have two retry loops trying to unlock the same Lock. Should one of them succeed, the user will at some point call Lock() again while there are existing Unlock calls (maybe that's OK as the new Lock uses a different ephemeral node, maybe?) I guess what bugs me is that this patch adds concurrency (and complexity) without solving the actual issue.
I think having this synchronous would make me sleep better at night. It seems similar to a network write or read that takes a long time before returning / timing out. To take the analogy further I propose:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go forward with this and hope it doesn't break something else - it probably doesn't.
Fixes https://jira.mesosphere.com/browse/DCOS-37588
Situation when ZK lock release fails due to temporary connection loss. If release lock is attempted during the connection drop lock may not be released. If ZK library reconnects within the SessionTimeout the lock is held and there is no code to detect, log and attempt to release lock.
This code was merged to official vault
master
branch - hashicorp#4569