-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable local snapshot restore (backport: #733) #728
Conversation
Sync state in Handshake via RPC CV Cleanup dead code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the initial pass at this solution, very grateful for the work here!
I left some questions / suggestions. I would be glad to fix myself some of the concerns I raised directly in the dev branch, if that's ok with you @yihuang and @chillyvee; but will of course give you priority to make commits on the branch.
if err := doHandshake(stateStore, state, blockStore, genDoc, eventBus, proxyApp, consensusLogger); err != nil { | ||
|
||
// RPC recovery is required if blockStore Height is 0 | ||
if blockStore.Height() == 0 || !stateSync { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we also check that state.LastBlockHeight == 0
in addition to blockStore.Height() == 0
?
Something like:
if blockStore.Height() == 0 || !stateSync { | |
if (blockStore.Height() == 0 && state.LastBlockHeight == 0) || !stateSync { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is worth discussing since there are gaps in knowledge that need to be filled. There are three cases where the comet height is zero.
- Actually genesis
- Statesync completely over the network as we use today
- LocalStatesync where the application.db is already restored, but comet is at 0
One of the cosmos-sdk to comet integration challenges for local state sync is communicating the height of appstate/application.db
In the initial implementation, an extra applicationd.db height parameter was passed from cosmos-sdk to comet. However a previous code review tried to eliminate that extra parameter and simply check if comet height was 0, and start RPC statesync.
However that actually causes some issues because a normal network statesync also has comet height == 0, but application.db is also 0.
We need an agreeable way to determine the height of application.db from cosmos-sdk. Since a previous code review determined a parameter was not desirable, is there a better way to query the application height? Is there an object we should look at, or make an ABCI call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for the detailed notes and history here, I wasn't aware of the subtlety here!
Is there an object we should look at, or make an ABCI call?
An ABCI call (info
) would indeed provide the app height I believe. But not sure that's the best way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally it was added as a variable to configure when cosmos-sdk created the Node{}, but it was asked to be removed during code-review, probably to avoid creating a new dependency between cosmos-sdk and comet
Now that we know the removal causes an issue, it's up to us to pick a new way to compare the heights of comet vs cosmos-sdk app.
Is there general guidance on how to communicate variables between comet and cosmos-sdk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, for context, a valid CometBFT state requires:
blockStore.Height() - 1 <= state.LastBlockHeight <= blockStore.Height()
So, if the commit of block at height blockStore.Height()
has been completed, the state height should be the same. If it hasn't, as the node crashed in the while, the state height should be 1 unit below.
Having said so, if the state of CometBFT is valid and blockStore.Height() == 0
then state.LastBlockHeight == 0
is mandatory.
// Optimistically build new state, so we don't discover any light client failures at the end. | ||
state, err = h.stateProvider.State(pctx, appBlockHeight) | ||
if err != nil { | ||
h.logger.Error("failed to fetch and verify tendermint state", "err", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Should we use "comet" instead of "tendermint" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
if err == light.ErrNoWitnesses { | ||
return sm.State{}, nil, err | ||
} | ||
return sm.State{}, nil, errors.New("snapshot was rejected") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we be more specific in this error message? It seems to me that what happened here was that Comet's light client verified the app hash present in the local snapshot of app.db
, and this app hash was incorrect. Is this accurate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Referencing func (s *lightClientStateProvider) State(ctx context.Context, height uint64) (sm.State, error) {
The error conditions are failure to VerifyLightBlockAtHeight , RPC Client Creation Failure (Bad URL), and failure to fetch consensus parameters for the height.
What do you think of wrapping the error to add detail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a good idea. Let's wrap the err
to save the detail and return it to the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is not snapshot involved here.
The error should say that the application state was rejected, instead.
if err == light.ErrNoWitnesses { | ||
return sm.State{}, nil, err | ||
} | ||
return sm.State{}, nil, errors.New("snapshot was rejected") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above: Would it be good to have more explicit error message here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of wrapping the error to add detail?
Haven't looked into detail, but find an unexpected behavior in test: when the app height is 0, statesync enabled, it still try to init with genesis.json, which don't happen with normal version. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
if err == light.ErrNoWitnesses { | ||
return sm.State{}, nil, err | ||
} | ||
return sm.State{}, nil, errors.New("snapshot was rejected") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is not snapshot involved here.
The error should say that the application state was rejected, instead.
if err := h.stateStore.Save(state); err != nil { | ||
return sm.State{}, nil, err | ||
} | ||
if err := h.store.(*store.BlockStore).SaveSeenCommit(int64(appBlockHeight), commit); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the original implementation, which is Node.startStateSync
(I don't know the branch to refer, so no link, sorry), you should use state.LastBlockHeight
here for compatibility. It should be equal to appBlockHeight
, except for uint/int stuff, but for coherence it is better to follow the same approach.
h.logger.Info("localSync resulting state.Version.Consensus.Block", "state", state.Version.Consensus.Block) | ||
|
||
// Assume Heights Restored, Update Heights for edge cases and constraints on the storeBlockHeight and storeBlockBase. | ||
storeBlockHeight = appBlockHeight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that these values should be read from the stores, not blindly updated here.
var err error | ||
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) | ||
defer cancel() | ||
stateProvider, err := statesync.NewLightClientStateProvider( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are not using this "local sync", this would error, right?
We have to still support the case in which the node is simply recovering, and will restore all state from disk.
@@ -989,8 +1007,9 @@ func (n *Node) OnStart() error { | |||
return fmt.Errorf("could not dial peers from persistent_peers field: %w", err) | |||
} | |||
|
|||
// Run state sync | |||
// Statesync reqeusting all data from P2P |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Statesync reqeusting all data from P2P | |
// State sync fetching snapshots from peers |
if n.stateSync { | ||
// If state and blockStore.Height are both at the same height, skip the P2P Statesync and immediately enter consensus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't understand this comment.
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Are we proceeding with this approach/solution attempt? If so, we should add the |
I think no need to pursue |
Closing, a different solution was chosen. |
Which other solution was chosen? Just had a break in schedule to work on the Juno v14 local state sync demo today. |
|
Thank you, will review |
nice code @yihuang :) |
Closes: #29
Original PR: tendermint/tendermint#9541
Sync state in Handshake via RPC
CV Cleanup dead code
PR checklist
.changelog
(we use unclog to manage our changelog)docs/
orspec/
) and code comments