Use single go-routine instead of mutex for shard-controller #447
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When the leader election on one shard is continuosly failing, this would lead to the node-controller to get stuck as well.
The reason is that shard-controller mutex is kept busy while retrying the leader election. At the same time, if a node fails, the node controller will try to grab the mutex on the shard as well, leading to a standstill. Technically it's not a deadlock (if the leader election recovers, than node-controller would also recover).
Modifications
Instead of using a mutex to ensure only 1 operation is applied on the shard at the same time, we switch to use a dedicated go-routine event loop and we communicate use channels.