Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Watchtower fails to fall back to archive EC in generateRewardsTree when primary ec isn't useful #397

Closed
jshufro opened this issue Oct 23, 2023 · 0 comments · Fixed by #453

Comments

@jshufro
Copy link
Contributor

jshufro commented Oct 23, 2023

Reported by @Kweiss on discord:

rocketpool_watchtower  | 2023/10/23 15:41:19 [Interval 14 Tree] Starting generation of Merkle rewards tree for interval 14.
rocketpool_watchtower  | 2023/10/23 15:41:19 [Interval 14 Tree] Found snapshot event: Beacon block 7421279, execution block 18232197
rocketpool_watchtower  | 2023/10/23 15:41:19 [Interval 14 Tree] Error getting state for block 18232197: missing trie node e134d1cdbed48ce32c7bd3466af8b787ecb4a90da03c1a908c5418f5d4d16fd6 (path ) state 0xe134d1cdbed48ce32c7bd3466af8b787ecb4a90da03c1a908c5418f5d4d16fd6 is not available
rocketpool_watchtower  | 2023/10/23 15:41:19 [Interval 14 Tree] Primary EC cannot retrieve state for historical block 18232197, using archive EC [https://mainnet.infura.io/v3/ABC]
rocketpool_watchtower  | 2023/10/23 15:41:20 Getting network state for EL block 18232197, Beacon slot 7421279
rocketpool_watchtower  | 2023/10/23 15:41:20 [Interval 14 Tree] error getting state for beacon slot 7421279: error getting network contracts: error executing multicall for contract retrieval: missing trie node e134d1cdbed48ce32c7bd3466af8b787ecb4a90da03c1a908c5418f5d4d16fd6 (path ) state 0xe134d1cdbed48ce32c7bd3466af8b787ecb4a90da03c1a908c5418f5d4d16fd6 is not available
rocketpool_watchtower  | 2023/10/23 15:41:20 *** Rewards tree generation failed. ***

The first error is uninteresting, in that we know the local EC is not an archive EC, and the state being queried is 25 days old.

However, kbw is using an infura as an archive ec and has configured it correctly in the TUI.

Inspecting generateRewardsTree shows us that, despite the archive EC passing its validity checks, we have some subsequent validation that fails to use the new client.

Specifically:

address, err := client.RocketStorage.GetAddress(opts, crypto.Keccak256Hash([]byte("contract.addressrocketTokenRETH")))
if err != nil {
errMessage := err.Error()
t.log.Printlnf("%s Error getting state for block %d: %s", generationPrefix, elBlockHeader.Number.Uint64(), errMessage)
if strings.Contains(errMessage, "missing trie node") || // Geth
strings.Contains(errMessage, "No state available for block") || // Nethermind
strings.Contains(errMessage, "Internal error") { // Besu
// The state was missing so fall back to the archive node
archiveEcUrl := t.cfg.Smartnode.ArchiveECUrl.Value.(string)
if archiveEcUrl != "" {
t.log.Printlnf("%s Primary EC cannot retrieve state for historical block %d, using archive EC [%s]", generationPrefix, elBlockHeader.Number.Uint64(), archiveEcUrl)
ec, err := ethclient.Dial(archiveEcUrl)
if err != nil {
t.handleError(fmt.Errorf("Error connecting to archive EC: %w", err))
return
}
client, err = rocketpool.NewRocketPool(ec, common.HexToAddress(t.cfg.Smartnode.GetStorageAddress()))
if err != nil {
t.handleError(fmt.Errorf("%s Error creating Rocket Pool client connected to archive EC: %w", err))
return
}
// Get the rETH address from the archive EC
address, err = client.RocketStorage.GetAddress(opts, crypto.Keccak256Hash([]byte("contract.addressrocketTokenRETH")))
if err != nil {
t.handleError(fmt.Errorf("%s Error verifying rETH address with Archive EC: %w", err))
return
}
} else {
// No archive node specified
t.handleError(fmt.Errorf("***ERROR*** Primary EC cannot retrieve state for historical block %d and the Archive EC is not specified.", elBlockHeader.Number.Uint64()))
return
}
}
}
// Sanity check the rETH address to make sure the client is working right
if address != t.cfg.Smartnode.GetRethAddress() {
t.handleError(fmt.Errorf("***ERROR*** Your Primary EC provided %s as the rETH address, but it should have been %s!", address.Hex(), t.cfg.Smartnode.GetRethAddress().Hex()))
return
}
// Get the state for the target slot
state, err := t.m.GetStateForSlot(rewardsEvent.ConsensusBlock.Uint64())
if err != nil {
t.handleError(fmt.Errorf("%s error getting state for beacon slot %d: %w", generationPrefix, rewardsEvent.ConsensusBlock.Uint64(), err))
return
}

The first error in the logs, which again is uninteresting, comes from

t.log.Printlnf("%s Error getting state for block %d: %s", generationPrefix, elBlockHeader.Number.Uint64(), errMessage)

We correctly identify that it's a missing trie node error and detect the presence of an archive ec. since we do not see the error here:

t.handleError(fmt.Errorf("***ERROR*** Primary EC cannot retrieve state for historical block %d and the Archive EC is not specified.", elBlockHeader.Number.Uint64()))

However, the inner scope that 'replaces' the client instance with a new one (the archive client) only overwrites outer-scoped variables address and client.

Therefore, when we check the rETH address on L203, we succeed, but then this fails:

state, err := t.m.GetStateForSlot(rewardsEvent.ConsensusBlock.Uint64())
if err != nil {
t.handleError(fmt.Errorf("%s error getting state for beacon slot %d: %w", generationPrefix, rewardsEvent.ConsensusBlock.Uint64(), err))
return
}

t.m.GetStateForSlot(...) has no references to either of the updated variables, so it is clear that this validation will always fail when the original EC doesn't have the appropriate state.

t.m is a state.NetworkStateManager passed in when newGenerateRewardsTree(...) is called. It is created here:

m, err := state.NewNetworkStateManager(rp, cfg, rp.Client, bc, &updateLog)
if err != nil {
return err
}

and the client that it uses is the result of the call here:

rp, err := services.GetRocketPool(c)

ultimately calling
func getEthClient(c *cli.Context, cfg *config.RocketPoolConfig) (*ExecutionClientManager, error) {
var err error
initECManager.Do(func() {
// Create a new client manager
ecManager, err = NewExecutionClientManager(cfg)
if err == nil {
// Check if the manager should ignore sync checks and/or default to using the fallback (used by the API container when driven by the CLI)
if c.GlobalBool("ignore-sync-check") {
ecManager.ignoreSyncCheck = true
}
if c.GlobalBool("force-fallbacks") {
ecManager.primaryReady = false
}
}
})
return ecManager, err
}

which is the 'generic' "i need an EC, please check the primary and the fallback" function, but does not account for watchtower's archive fallback.

A very naive solution would be to simply replace t.m in generateRewardsTree, eg:

diff --git a/rocketpool/watchtower/generate-rewards-tree.go b/rocketpool/watchtower/generate-rewards-tree.go
index e0207089..0f87db33 100644
--- a/rocketpool/watchtower/generate-rewards-tree.go
+++ b/rocketpool/watchtower/generate-rewards-tree.go
@@ -200,6 +200,13 @@ func (t *generateRewardsTree) generateRewardsTree(index uint64) {
                                        t.handleError(fmt.Errorf("%s Error verifying rETH address with Archive EC: %w", err))
                                        return
                                }
+
+                               // Make sure we use the new client
+                               t.m, err = state.NewNetworkStateManager(client, t.cfg, ec, t.bc, &t.log)
+                               if err != nil {
+                                       t.handleError(fmt.Errorf("%s Error creating new NetworkStateManager with Archive EC: %w", err))
+                                       return
+                               }
                        } else {
                                // No archive node specified
                                t.handleError(fmt.Errorf("***ERROR*** Primary EC cannot retrieve state for historical block %d and the Archive EC is not specified.", elBlockHeader.Number.Uint64()))

BUT I really don't like the side-effects here. The function purity is damaged.

Instead I'd suggest that t.m need not exist- it appears to only be used in this function. We should remove it from the struct, and have generateRewardsTree(...) unconditionally create a NetworkStateManager according to which client is available and error-free:

diff --git a/rocketpool/watchtower/generate-rewards-tree.go b/rocketpool/watchtower/generate-rewards-tree.go
index e0207089..6485e0e9 100644
--- a/rocketpool/watchtower/generate-rewards-tree.go
+++ b/rocketpool/watchtower/generate-rewards-tree.go
@@ -38,11 +38,10 @@ type generateRewardsTree struct {
        bc        beacon.Client
        lock      *sync.Mutex
        isRunning bool
-       m         *state.NetworkStateManager
 }
 
 // Create generate rewards Merkle Tree task
-func newGenerateRewardsTree(c *cli.Context, logger log.ColorLogger, errorLogger log.ColorLogger, m *state.NetworkStateManager) (*generateRewardsTree, error) {
+func newGenerateRewardsTree(c *cli.Context, logger log.ColorLogger, errorLogger log.ColorLogger) (*generateRewardsTree, error) {
 
        // Get services
        cfg, err := services.GetConfig(c)
@@ -82,7 +81,6 @@ func newGenerateRewardsTree(c *cli.Context, logger log.ColorLogger, errorLogger
                rp:        rp,
                lock:      lock,
                isRunning: false,
-               m:         m,
        }
 
        return generator, nil
@@ -166,13 +164,24 @@ func (t *generateRewardsTree) generateRewardsTree(index uint64) {
                return
        }
 
+       var stateManager *state.NetworkStateManager
+
        // Try getting the rETH address as a canary to see if the block is available
        client := t.rp
        opts := &bind.CallOpts{
                BlockNumber: elBlockHeader.Number,
        }
        address, err := client.RocketStorage.GetAddress(opts, crypto.Keccak256Hash([]byte("contract.addressrocketTokenRETH")))

-       if err != nil {
+       if err == nil {
+
+               // Create the state manager with using the primary or fallback (not necessarily archive) EC
+               stateManager, err = state.NewNetworkStateManager(client, t.cfg, t.rp.Client, t.bc, &t.log)
+               if err != nil {
+                       t.handleError(fmt.Errorf("%s Error creating new NetworkStateManager with ARchive EC: %w", err))
+                       return
+               }
+       } else {
+               // Check if an Archive EC is provided, and if using it would potentially resolve the error
                errMessage := err.Error()
                t.log.Printlnf("%s Error getting state for block %d: %s", generationPrefix, elBlockHeader.Number.Uint64(), errMessage)
                if strings.Contains(errMessage, "missing trie node") || // Geth
@@ -200,6 +209,13 @@ func (t *generateRewardsTree) generateRewardsTree(index uint64) {
                                        t.handleError(fmt.Errorf("%s Error verifying rETH address with Archive EC: %w", err))
                                        return
                                }
+
+                               // Create the state manager with the archive EC
+                               stateManager, err = state.NewNetworkStateManager(client, t.cfg, ec, t.bc, &t.log)
+                               if err != nil {
+                                       t.handleError(fmt.Errorf("%s Error creating new NetworkStateManager with ARchive EC: %w", err))
+                                       return
+                               }
                        } else {
                                // No archive node specified
                                t.handleError(fmt.Errorf("***ERROR*** Primary EC cannot retrieve state for historical block %d and the Archive EC is not specified.", elBlockHeader.Number.Uint64()))
@@ -216,7 +232,7 @@ func (t *generateRewardsTree) generateRewardsTree(index uint64) {
        }
 
        // Get the state for the target slot
-       state, err := t.m.GetStateForSlot(rewardsEvent.ConsensusBlock.Uint64())
+       state, err := stateManager.GetStateForSlot(rewardsEvent.ConsensusBlock.Uint64())
        if err != nil {
                t.handleError(fmt.Errorf("%s error getting state for beacon slot %d: %w", generationPrefix, rewardsEvent.ConsensusBlock.Uint64(), err))
                return
diff --git a/rocketpool/watchtower/watchtower.go b/rocketpool/watchtower/watchtower.go
index 9d05f1f4..acf94b80 100644
--- a/rocketpool/watchtower/watchtower.go
+++ b/rocketpool/watchtower/watchtower.go
@@ -171,7 +171,7 @@ func run(c *cli.Context) error {
        if err != nil {
                return fmt.Errorf("error during penalties check: %w", err)
        }*/
-       generateRewardsTree, err := newGenerateRewardsTree(c, log.NewColorLogger(SubmitRewardsTreeColor), errorLog, m)
+       generateRewardsTree, err := newGenerateRewardsTree(c, log.NewColorLogger(SubmitRewardsTreeColor), errorLog)
        if err != nil {
                return fmt.Errorf("error during manual tree generation check: %w", err)
        }

which cleanly eliminates the reason this bug appeared in the first place- we persisted too much configuration to pseudo-global variables instead of passing local variables through a functional chain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant