Highly-available & fault-tolerant validators #17189

outofforest · 2023-07-29T11:56:50Z

outofforest
Jul 29, 2023

Highly-available & fault-tolerant validators

This document describes my findings related to possible implementation of highly available, fault-tolerant validators.

The purpose of this issue is to discuss this topic with Cosmos SDK team and other interested people, check if this functionality is desired (I believe it is!), provide more details for unclear parts and discover possible dependencies in code I missed. Feel free to comment and give suggestions.

After finishing discussions, if I see that implementing this is possible, I will invest some time to do it.

Motivation

Let's say you run a validator. So you have a server or virtual machine where validator runs.
Now you need a disaster recovery plan in case server goes down or needs a maintenance, because you don't want to be slashed.

For this, you need at least two servers and a procedure to switch validator from one server to another. It might be done
manually or automatically, by using some heartbeat software.

The problem is, no matter what option you choose, currently the same private key of the validator must be moved between
or coexist on both machines. This leads to some problems:

private key, for security reasons, should never ever leave the machine where it was generated on
if mistake is made, by human or software, and both copies of the validator are active at the same time, validator is slashed and tombstoned.

I think we all have seen the cases when even professional companies experienced their validator being tombstoned because of some mistakes.

The source of the issue is that Cosmos SDK has never been designed with those scenarios in mind.
So I started thinking on how it could be fixed, and I've found a possible fix.

If assumption is made that private key of the validator can never leave the machine where it was generated,
then it leads us to an obvious conclusion that each server (main and backup) must hold its own private key.
But then problem arises because validators in Cosmos SDK may sign blocks and proposals only with single private key.
This led me to another conclusion that I need to break this assumption.

After thinking more about it, I developed the idea of this framework:

each machine has its own private key, which is never shared with any other machine
validator is defined on chain by providing all the corresponding public keys (each key representing single HA node) - so, in this design each validator has many keys assigned to it, not just one
only one public key (HA node) may be active at a time - only one server running the validator may sign and propose blocks - the one having private key corresponding to the active public key. Other instances are not considered to be validators.
if the active server goes down, active public key might be switched to another one configured for that validator, so another machine starts signing on behalf of that validator.

I see three possible conditions for switching the public key:

manual, when the staff plans to turn off the active server - it might be done personally by the operator or someone else may be given permissions using authz to do it on behalf of the operator
automatic-off-chain - by any software developed by the operator, used to monitor the servers - as in the case above, by issuing a transaction, signed by allowed private key
automatic-on-chain - by the chain itself - whenever validator misses to propose a block on its turn, public key could be automatically rotated by the consensus protocol

This functionality could be implemented as an extension to the current staking module, eliminating the huge problem validator operators experience when maintaining the validators.

Mechanics of the HA node switching

On the CometBFT side the thing is simple. There is a set of validators, each represented by the public key and voting power.
At the moment there is 1 to 1 relationship between validator in CometBFT and Cosmos SDK.

By implementing this proposal I want one CometBFT validator to be represented by n possible nodes (public keys) in Cosmos SDK,
grouped under common operator's address. At any time exactly one public key in Cosmos SDK is active for each operator,
as a result the 1 to 1 relationship between CometBFT and Cosmos SDK is still maintained. The only difference is that the set
of CometBFT validators is "more dynamic".

In practice, it means that whenever the active public key of Cosmos SDK validator is changed, it must replace the old one in CometBFT
by issuing the validator update in the end blocker of the staking module, providing the active public key with the same voting power.
As a result, CometBFT "knows" only the active public keys constituting the set of active validators.

Terminology

The problem with terminology arises because the word validator may have many meanings now:

a member of validator set in CometBFT
the validator defined by the operator in Cosmos SDK, grouping many fault-tolerant, highly-available servers
the server running the blockchain node

In the spec below I use HA node for the third meaning. But the good wording for first and second meaning is welcomed.

End blocker

As mentioned earlier, whenever the set of active HA nodes is changed, staking module must prepare the set
of validator updates to be passed to CometBFT.
The old HA node must be removed, by setting the voting power of the corresponding public key to 0, and new active one must be added
instead, by setting the voting power for its public key.

Create validator tx

When validator is created it is identified by the operator address which might be treated as an ID of the validator.
Cosmos SDK already enforces that only one validator may be run by each operator, so it's already unique.

When validator is created, its public key is passed as an independent field, meaning we may create many HA nodes,
each using different public key.

func (k msgServer) CreateValidator(ctx context.Context, msg *types.MsgCreateValidator) (*types.MsgCreateValidatorResponse, error)

There is a check verifying that public key is not used by any other validator. I must do the same to check that public key
is unique across all the HA nodes.

Relations to other modules

At the end, AfterValidatorCreated hook is called. slashing and distribution modules subscribe to this hook:

distrobution: fields related to rewards and commissions are initialized. All the operations there, use only the
operator's address so my changes don't affect the logic there. Nothing needs to be modified
slashing: consensus address -> public key relation is stored by the hook. That mapping is used only by
the evidence module to check that the consensus address key reported in the evidence exists in the system. I believe
this is not needed. Anyway, consensus address is derived from the public key, so it is 1 to 1
relationship for each HA node, meaning I may just add the mapping for each node.

To do this, I need new hooks:

HA node created
HA node deleted
to maintain the mapping inside slashing module.

Managing relations between validator and its HA nodes

Proto of staking module defines Validator message containing consensus_pubkey field, storing the public key of the validator.
It must be converted into a slice to store many public keys of all the HA nodes.
There is ConsPubKey() (cryptotypes.PubKey, error) method used in a couple of places to get that key. As there is no single
key for the validator anymore it must be converted to one of the options:

calls to this might be simply removed if not really needed
return all the public keys of all HA nodes
return single public key for provided consensus address
It hasn't been identified yet which solution fits the purpose of each call.

The Validator message should be extended by adding active_consensus_pubkey field indicating which HA node is active at the moment.

There is ValidatorSigningInfo map mapping consensus address to some metrics and information.
That structure contains fields related to the validator itself (not a particular HA node), except the consensus address itself.
The Address field is not used anywhere, so maybe it might be simply removed. Then, the operator's address should be used
as a key in that map because this is the value uniquely identifying the validator, not the consensus address.

HA node states and active node switching

When validator is created, the provided public key constitutes the first HA node. This node is automatically set to active state.
There are 3 possible states for an HA node:

active - it means this HA node signs and proposes blocks - only one HA node per validator might be in this state
enabled - HA node in this state does not sign anything until it is set to active
disabled - HA node in this state does not sign anything and cannot be set to active - it must be set to enabled first

The difference between enabled and disabled is that operator may grant someone else (using authz) permission
to change the active HA node (move it from enabled to active) but at the same time operator might decide that there
are some HA nodes (disabled ones) which cannot be activated, e.g. servers might be maintained or intentionally turned off.

This means that hypothetical heartbeat application might exist, monitoring the status of the servers and switching the active
HA node automatically if the current one is dead. The application should use its own private key (not the one belonging to the operator).
and that private key should be permitted to (with authz) to broadcast transaction selecting the active HA node from the set of enabled ones.
At the same time this private key should not be allowed to enable a disabled HA node.

New transactions

New transactions need to be added to the staking module for:

adding HA nodes
deleting HA nodes
enabling HA node
disabling HA node
selecting active HA node

CreateValidator tx must be modified accordingly to immediately create and activate the first HA node for the validator.
Looks like the structure of the message does not need to be changed.

New queries

querying statuses of HA nodes of the validator
querying the active HA node

To do in next steps

The next step after implementing this proposal would be adding an option to switch the active HA node automatically by the chain
if the current active one missed the opportunity to propose the block. This failover mechanism would eliminate the need
for having the heartbeat application described above because its role would be taken by the chain itself.

alexanderbez · 2023-07-29T17:20:18Z

alexanderbez
Jul 29, 2023

https://github.com/strangelove-ventures/horcrux

5 replies

outofforest Jul 31, 2023
Author

Is it the preferred solution by the Cosmos team over what I proposed?

alexanderbez Jul 31, 2023

Ahhh I so posted this mainly to bring awareness to a solution if you we're familiar with it. I wouldn't say we, the Cosmos SDK team, have any sort of preference. Typically, this decision is left to the node operators in terms of what they feel safe with and has the best tradeoffs.

Now, if a proposal requires changes to the SDK, then that's something we can consider for sure. This is something we should bring up in the SDK community call to discuss.

I haven't heard issues from any validators in terms of fault tolerance, so presume whatever they're using today works well for them.

outofforest Jul 31, 2023
Author

On our network (started in March 2023) we have already experienced two validators being tombstoned due to issues with private key management. I haven't found a way to get a report of tombstoned validators on mintscan, but I believe that such situation happened on many networks.

alexanderbez Jul 31, 2023

Yes, jailing due to liveness happens. Also equivocation happens too, albeit much less. In any case, most large scale and/or sophisticated validators are running reliable solutions today, i.e. either horcrux, TMKS, or something proprietary.

outofforest Aug 1, 2023
Author

Yep, and it would be cool to solve the problem at the source, in Cosmos SDK. I looked at the Horcrux and they report that by using it, signing time is ~50 times slower in average.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Highly-available & fault-tolerant validators #17189

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Highly-available & fault-tolerant validators #17189

outofforest Jul 29, 2023

Highly-available & fault-tolerant validators

Motivation

Mechanics of the HA node switching

Terminology

End blocker

Create validator tx

Relations to other modules

Managing relations between validator and its HA nodes

HA node states and active node switching

New transactions

New queries

To do in next steps

Replies: 1 comment · 5 replies

alexanderbez Jul 29, 2023

outofforest Jul 31, 2023 Author

alexanderbez Jul 31, 2023

outofforest Jul 31, 2023 Author

alexanderbez Jul 31, 2023

outofforest Aug 1, 2023 Author

outofforest
Jul 29, 2023

Replies: 1 comment 5 replies

alexanderbez
Jul 29, 2023

outofforest Jul 31, 2023
Author

outofforest Jul 31, 2023
Author

outofforest Aug 1, 2023
Author