Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backward syncing issues when restarting Nimbus #5411

Closed
siladu opened this issue May 1, 2023 · 21 comments
Closed

Backward syncing issues when restarting Nimbus #5411

siladu opened this issue May 1, 2023 · 21 comments
Labels
bug Something isn't working mainnet P3 Medium (ex: JSON-RPC request not working with a specific client library due to loose spec assumtion) TeamChupa GH issues worked on by Chupacabara Team TeamGroot GH issues worked on by Groot Team

Comments

@siladu
Copy link
Contributor

siladu commented May 1, 2023

While testing out an unrelated fix with the unstable branch of nimbus, I observed at least three separate backward sync issues on our sepolia validating node. Note, I've only seen this issue on Sepolia and it was probably triggered by nimbus not getting peers before falling > 64 blocks behind the beacon chain which switches it into "far wall" sync mode, rather than lock-step sync.

It's worth noting that when nimbus is restarted, it requests a lot of eth_getLogs until it refills its deposit cache, which may be having some impact. When I enabled DEBUG during the backward sync issues, the eth_getLogs calls were ongoing so it seems related.

There's a thread discussing it with tersec from nimbus, starting about here https://discord.com/channels/595666850260713488/1093963476948496455/1101258732190572676

Observations

(1) Node offline during upgrade so I'd expect a small backwards sync (or lock-step sync depending on client). During this expected backwards sync though, we got lots of messages like this...

{"@timestamp":"2023-04-27T00:11:12,580","level":"INFO","thread":"vert.x-worker-thread-0","class":"BackwardSyncContext","message":"Starting a new backward sync session","throwable":""}
{"@timestamp":"2023-04-27T00:11:12,903","level":"INFO","thread":"nioEventLoopGroup-3-9","class":"BackwardSyncStep","message":"Backward sync phase 1 of 2 completed, downloaded a total of 200 headers. Peers: 14","throwable":""}
{"@timestamp":"2023-04-27T00:11:14,769","level":"INFO","thread":"nioEventLoopGroup-3-9","class":"BackwardSyncContext","message":"Backward sync phase 2 of 2, 1.45% completed, imported 1 blocks of at least 69 (current head 3368103, target head 3368171). Peers: 12","throwable":""}
{"@timestamp":"2023-04-27T00:11:17,887","level":"INFO","thread":"nioEventLoopGroup-3-9","class":"BackwardSyncContext","message":"Backward sync phase 2 of 2 completed, imported a total of 66 blocks. Peers: 14","throwable":""}
{"@timestamp":"2023-04-27T00:11:17,916","level":"INFO","thread":"nioEventLoopGroup-3-9","class":"BackwardSyncContext","message":"Backward sync phase 2 of 2 completed, imported a total of 67 blocks. Peers: 14","throwable":""}
{"@timestamp":"2023-04-27T00:11:17,953","level":"INFO","thread":"nioEventLoopGroup-3-9","class":"BackwardSyncContext","message":"Backward sync phase 2 of 2 completed, imported a total of 68 blocks. Peers: 12","throwable":""}

Need to confirm, but I believe teku will give us the latest hash they have once they are in sync and we get a consistent set of % complete logs for phase 2 of 2 with one completed message at the end. Something about the pattern of requests nimbus is sending is causing multiple completed logs.

(2) Besu is "in sync" as evidenced by being able to propose but then is frequently being switched to backwards sync mode presumably by being given a hash that we don't have a parent for.

(3) The reorg logs, which suggests it's being thrashed between forks potentially...

(2) and (3) both could be same root cause, or maybe all 3 are same cause.

relevant logs for (2) and (3)...

{"@timestamp":"2023-04-27T00:47:24,089","level":"INFO","thread":"vert.x-worker-thread-0","class":"AbstractEngineGetPayload","message":"Fetch block proposal by identifier: 0x0068b3ffaea1c01c, hash: 0xd1fb6c37bf03b2ed56bfd07699e9097119f143b600f4a33e4010edee3261a75a, number: 3368337, coinbase: 0x3826539cbd8d68dcf119e80b994557b4278cec9f, transaction count: 1","throwable":""}
{"@timestamp":"2023-04-27T00:47:24,107","level":"INFO","thread":"vert.x-worker-thread-0","class":"AbstractEngineNewPayload","message":"Imported #3,368,337 / 1 tx / 16 ws / base fee 7 wei / 82,582 (0.3%) gas / (0xd1fb6c37bf03b2ed56bfd07699e9097119f143b600f4a33e4010edee3261a75a) in 0.003s. Peers: 13","throwable":""}
{"@timestamp":"2023-04-27T00:47:24,241","level":"WARN","thread":"vert.x-worker-thread-0","class":"DefaultBlockchain","message":"Chain Reorganization +192 new / -0 old
       Old - hash: 0xbca29e861aabf1b975cde3ee6f147a86ad8de3bfab1a39bfb2d98594a72eb63d, height: 3368145
       New - hash: 0xd1fb6c37bf03b2ed56bfd07699e9097119f143b600f4a33e4010edee3261a75a, height: 3368337
  Ancestor - hash: 0xbca29e861aabf1b975cde3ee6f147a86ad8de3bfab1a39bfb2d98594a72eb63d, height: 3368145","throwable":""}
{"@timestamp":"2023-04-27T00:47:24,670","level":"WARN","thread":"nioEventLoopGroup-3-9","class":"DefaultBlockchain","message":"Chain Reorganization +0 new / -192 old
       Old - hash: 0xd1fb6c37bf03b2ed56bfd07699e9097119f143b600f4a33e4010edee3261a75a, height: 3368337
       New - hash: 0xbca29e861aabf1b975cde3ee6f147a86ad8de3bfab1a39bfb2d98594a72eb63d, height: 3368145
  Ancestor - hash: 0xbca29e861aabf1b975cde3ee6f147a86ad8de3bfab1a39bfb2d98594a72eb63d, height: 3368145","throwable":""}
{"@timestamp":"2023-04-27T00:47:26,555","level":"INFO","thread":"nioEventLoopGroup-3-9","class":"BackwardSyncContext","message":"Backward sync phase 2 of 2 completed, imported a total of 116 blocks. Peers: 12","throwable":""}
{"@timestamp":"2023-04-27T00:47:27,713","level":"INFO","thread":"nioEventLoopGroup-3-9","class":"BackwardSyncContext","message":"Backward sync phase 2 of 2, -53.45% completed, imported -62 blocks of at least 116 (current head 3368154, target head 3368332). Peers: 13","throwable":""}

Full logs

besu-23.1.2.log
nimbus-ea8ef5-2.log

Nimbus versions differing behaviour

  • The unstable branch I tested was what produced these logs
  • v23.4.0 seems to show similar behaviour
  • v23.3.2 didn't seem to have as many problems

Update 06/05/2023

Further testing with targeted besu debug logging enabled, using the following log RPC:

curl --retry-delay 50 --retry-connrefused --location \
     --request POST 'http://127.0.0.1:8545' \
     --header 'Content-Type: application/json' \
     --data-raw '{
    "jsonrpc": "2.0",
    "method": "admin_changeLogLevel",
    "params": [
        "DEBUG", [
          "org.hyperledger.besu.ethereum.eth.sync.backwardsync",
          "org.hyperledger.besu.ethereum.api.jsonrpc.internal.methods.engine.AbstractEngineNewPayload",
          "org.hyperledger.besu.ethereum.api.jsonrpc.internal.methods.engine.AbstractEngineForkchoiceUpdated"
        ]
    ],
    "id": 1
}'

Besu was version 23.4.0 for all three nimbus versions.
Ran the same test on the three different nimbus versions on non-validating sepolia node, the test was just to restart nimbus:

Logs attached here: https://we.tl/t-RZMmscEG1A

In all cases, there was less noise than on the validating node which suggests validating (with lots of validators) may exacerbate the problems.

@siladu siladu added bug Something isn't working mainnet labels May 1, 2023
@non-fungible-nelson non-fungible-nelson added P3 Medium (ex: JSON-RPC request not working with a specific client library due to loose spec assumtion) TeamGroot GH issues worked on by Groot Team labels May 1, 2023
@SeaMonkey82
Copy link

SeaMonkey82 commented May 2, 2023

Some details:

If Nimbus is restarted while Besu is already in the process of syncing, Besu goes into backward sync, which never completes and returns negative values for percentage complete while Nimbus reports receiving blocks from an unviable fork. Stopping both clients, starting Besu, and allowing it to reach the point where it is waiting for a connection from the CL prior to starting Nimbus allows sync to resume normally.

Here's what Besu returns when it is stopped in the middle of the failed backward sync:

2023-05-02 08:45:19.497-04:00 | nioEventLoopGroup-3-5 | INFO  | BackwardSyncContext | Backward sync phase 2 of 2, -60.40% completed, imported -90 blocks of at least 149 (current head 8928631, target head 8928870). Peers: 4
^C2023-05-02 08:45:21.952-04:00 | BesuCommand-Shutdown-Hook | INFO  | NetworkRunner | Stopping Network.
2023-05-02 08:45:21.955-04:00 | BesuCommand-Shutdown-Hook | INFO  | EthProtocolManager | Stopping eth Subprotocol.
2023-05-02 08:45:21.956-04:00 | BesuCommand-Shutdown-Hook | INFO  | EthProtocolManager | eth Subprotocol stopped.
2023-05-02 08:45:21.956-04:00 | BesuCommand-Shutdown-Hook | INFO  | NetworkRunner | Network stopped.
2023-05-02 08:45:21.956-04:00 | BesuCommand-Shutdown-Hook | INFO  | AutoTransactionLogBloomCachingService | Shutting down Auto transaction logs caching service.
2023-05-02 08:45:21.956-04:00 | BesuCommand-Shutdown-Hook | INFO  | NatService | No NAT environment detected so no service could be stopped
2023-05-02 08:45:21.956-04:00 | nioEventLoopGroup-3-5 | ERROR | RocksDBColumnarKeyValueStorage | Attempting to use a closed RocksDbKeyValueStorage
2023-05-02 08:45:21.956-04:00 | ForkJoinPool.commonPool-worker-25 | ERROR | RocksDBColumnarKeyValueStorage | Attempting to use a closed RocksDbKeyValueStorage
2023-05-02 08:45:21.957-04:00 | nioEventLoopGroup-3-5 | ERROR | MainnetTransactionProcessor | Critical Exception Processing Transaction
java.lang.IllegalStateException: Storage has been closed
        at org.hyperledger.besu.plugin.services.storage.rocksdb.segmented.RocksDBColumnarKeyValueStorage.throwIfClosed(RocksDBColumnarKeyValueStorage.java:336)
        at org.hyperledger.besu.plugin.services.storage.rocksdb.segmented.RocksDBColumnarKeyValueStorage.get(RocksDBColumnarKeyValueStorage.java:256)
        at org.hyperledger.besu.plugin.services.storage.rocksdb.segmented.RocksDBColumnarKeyValueStorage.get(RocksDBColumnarKeyValueStorage.java:69)
        at org.hyperledger.besu.services.kvstore.SegmentedKeyValueStorageAdapter.get(SegmentedKeyValueStorageAdapter.java:64)
        at org.hyperledger.besu.ethereum.bonsai.storage.BonsaiWorldStateKeyValueStorage.getStorageValueByStorageSlotKey(BonsaiWorldStateKeyValueStorage.java:257)
        at org.hyperledger.besu.ethereum.bonsai.worldview.BonsaiWorldState.getStorageValueByStorageSlotKey(BonsaiWorldState.java:522)
        at org.hyperledger.besu.ethereum.bonsai.worldview.BonsaiWorldStateUpdateAccumulator.getStorageValueByStorageSlotKey(BonsaiWorldStateUpdateAccumulator.java:422)
        at org.hyperledger.besu.ethereum.bonsai.worldview.BonsaiWorldStateUpdateAccumulator.getStorageValue(BonsaiWorldStateUpdateAccumulator.java:404)
        at org.hyperledger.besu.ethereum.bonsai.BonsaiAccount.getStorageValue(BonsaiAccount.java:211)
        at org.hyperledger.besu.evm.worldstate.UpdateTrackingAccount.getStorageValue(UpdateTrackingAccount.java:231)
        at org.hyperledger.besu.evm.worldstate.UpdateTrackingAccount.getStorageValue(UpdateTrackingAccount.java:231)
        at org.hyperledger.besu.evm.worldstate.UpdateTrackingAccount.getStorageValue(UpdateTrackingAccount.java:231)
        at org.hyperledger.besu.evm.worldstate.UpdateTrackingAccount.getStorageValue(UpdateTrackingAccount.java:231)
        at org.hyperledger.besu.evm.worldstate.UpdateTrackingAccount.getStorageValue(UpdateTrackingAccount.java:231)
        at org.hyperledger.besu.evm.operation.SLoadOperation.execute(SLoadOperation.java:64)
        at org.hyperledger.besu.evm.EVM.runToHalt(EVM.java:265)
        at org.hyperledger.besu.evm.processor.AbstractMessageProcessor.codeExecute(AbstractMessageProcessor.java:174)
        at org.hyperledger.besu.evm.processor.AbstractMessageProcessor.process(AbstractMessageProcessor.java:192)
        at org.hyperledger.besu.ethereum.mainnet.MainnetTransactionProcessor.process(MainnetTransactionProcessor.java:508)
        at org.hyperledger.besu.ethereum.mainnet.MainnetTransactionProcessor.processTransaction(MainnetTransactionProcessor.java:419)
        at org.hyperledger.besu.ethereum.mainnet.AbstractBlockProcessor.processBlock(AbstractBlockProcessor.java:125)
        at org.hyperledger.besu.ethereum.mainnet.BlockProcessor.processBlock(BlockProcessor.java:79)
        at org.hyperledger.besu.ethereum.MainnetBlockValidator.processBlock(MainnetBlockValidator.java:207)
        at org.hyperledger.besu.ethereum.MainnetBlockValidator.validateAndProcessBlock(MainnetBlockValidator.java:135)
        at org.hyperledger.besu.ethereum.MainnetBlockValidator.validateAndProcessBlock(MainnetBlockValidator.java:72)
        at org.hyperledger.besu.ethereum.eth.sync.backwardsync.BackwardSyncContext.saveBlock(BackwardSyncContext.java:303)
        at org.hyperledger.besu.ethereum.eth.sync.backwardsync.ForwardSyncStep.saveBlocks(ForwardSyncStep.java:124)
        at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646)
        at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
        at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
        at org.hyperledger.besu.ethereum.eth.manager.task.RetryingGetBlocksFromPeersTask.lambda$executeTaskOnCurrentPeer$0(RetryingGetBlocksFromPeersTask.java:86)
        at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646)
        at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
        at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
        at org.hyperledger.besu.ethereum.eth.manager.task.AbstractPeerRequestTask.lambda$executeTask$2(AbstractPeerRequestTask.java:83)
        at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
        at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
        at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
        at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
        at org.hyperledger.besu.ethereum.eth.manager.task.AbstractPeerRequestTask.lambda$handleMessage$3(AbstractPeerRequestTask.java:106)
        at java.base/java.util.Optional.ifPresent(Optional.java:178)
        at org.hyperledger.besu.ethereum.eth.manager.task.AbstractPeerRequestTask.handleMessage(AbstractPeerRequestTask.java:104)
        at org.hyperledger.besu.ethereum.eth.manager.task.AbstractPeerRequestTask.lambda$executeTask$0(AbstractPeerRequestTask.java:67)
        at org.hyperledger.besu.ethereum.eth.manager.RequestManager$ResponseStream.dispatchBufferedResponses(RequestManager.java:211)
        at org.hyperledger.besu.ethereum.eth.manager.RequestManager$ResponseStream.processMessage(RequestManager.java:202)
        at org.hyperledger.besu.ethereum.eth.manager.RequestManager.lambda$dispatchResponse$0(RequestManager.java:82)
        at java.base/java.util.Optional.ifPresentOrElse(Optional.java:196)
        at org.hyperledger.besu.ethereum.eth.manager.RequestManager.dispatchResponse(RequestManager.java:81)
        at org.hyperledger.besu.ethereum.eth.manager.EthPeer.lambda$dispatch$7(EthPeer.java:411)
        at java.base/java.util.Optional.ifPresentOrElse(Optional.java:196)
        at org.hyperledger.besu.ethereum.eth.manager.EthPeer.dispatch(EthPeer.java:410)
        at org.hyperledger.besu.ethereum.eth.manager.EthPeers.dispatchMessage(EthPeers.java:256)
        at org.hyperledger.besu.ethereum.eth.manager.EthProtocolManager.processMessage(EthProtocolManager.java:320)
        at org.hyperledger.besu.ethereum.p2p.network.NetworkRunner.lambda$setupHandlers$1(NetworkRunner.java:142)
        at org.hyperledger.besu.ethereum.p2p.rlpx.connections.PeerConnectionEvents.lambda$dispatchMessage$1(PeerConnectionEvents.java:64)
        at org.hyperledger.besu.util.Subscribers.lambda$forEach$0(Subscribers.java:131)
        at java.base/java.lang.Iterable.forEach(Iterable.java:75)
        at org.hyperledger.besu.util.Subscribers.forEach(Subscribers.java:128)
        at org.hyperledger.besu.ethereum.p2p.rlpx.connections.PeerConnectionEvents.dispatchMessage(PeerConnectionEvents.java:64)
        at org.hyperledger.besu.ethereum.p2p.rlpx.connections.netty.ApiHandler.channelRead0(ApiHandler.java:99)
        at org.hyperledger.besu.ethereum.p2p.rlpx.connections.netty.ApiHandler.channelRead0(ApiHandler.java:33)
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
        at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:833)
2023-05-02 08:45:21.958-04:00 | nioEventLoopGroup-3-5 | INFO  | AbstractBlockProcessor | Block processing error: transaction invalid Internal Error in Besu - java.lang.IllegalStateException: Storage has been closed. Block 0x93bbbafad4ec34467c662e1d3cd0371febcc9fc60de7498d41fa841c2acfbb5b Transaction 0x791c4a4e4b7154da487d4f3822a2cc834c103b9931667d931e7e118629f805e0
2023-05-02 08:45:21.959-04:00 | nioEventLoopGroup-3-5 | INFO  | MainnetBlockValidator | Invalid block 8928642 (0x93bbbafad4ec34467c662e1d3cd0371febcc9fc60de7498d41fa841c2acfbb5b): Optional[Block processing error: transaction invalid Internal Error in Besu - java.lang.IllegalStateException: Storage has been closed. Block 0x93bbbafad4ec34467c662e1d3cd0371febcc9fc60de7498d41fa841c2acfbb5b Transaction 0x791c4a4e4b7154da487d4f3822a2cc834c103b9931667d931e7e118629f805e0]
2023-05-02 08:45:21.959-04:00 | nioEventLoopGroup-3-5 | ERROR | RocksDBColumnarKeyValueStorage | Attempting to use a closed RocksDbKeyValueStorage
2023-05-02 08:45:21.959-04:00 | nioEventLoopGroup-3-5 | ERROR | RocksDBColumnarKeyValueStorage | Attempting to use a closed RocksDbKeyValueStorage
2023-05-02 08:45:21.979-04:00 | nioEventLoopGroup-3-5 | INFO  | BackwardSyncContext | Current backward sync session failed, it will be restarted

and here is an example of the unviable fork messages that repeat in Nimbus:

NOT 2023-05-02 08:45:19.236-04:00 Received blocks from an unviable fork      topics="syncman" request=5543552:32@350 peer=16U*W8RdP5 direction=forward blockRoot=5d0ae092 blockSlot=5543552 blocks_count=22 blocks_map=xxxx..xxxx.x.x..xx.xxxxxxxx...xx sync_ident=main

@non-fungible-nelson non-fungible-nelson added the TeamChupa GH issues worked on by Chupacabara Team label May 2, 2023
@tersec
Copy link

tersec commented May 6, 2023

while Nimbus reports receiving blocks from an unviable fork

Generally, this is because of incorrect-INVALID results from Besu, such as in the logs you show: https://prater.beaconcha.in/block/8928642 is the block in question. It's perfectly fine and Besu should not have reported it invalid. Once this happens, Nimbus is entirely correct in treating everything past that as INVALID because https://github.com/ethereum/consensus-specs/blob/dev/sync/optimistic.md#transitioning-from-valid---invalidated-or-invalidated---valid states

Transitioning from VALID -> INVALIDATED or INVALIDATED -> VALID

These operations are purposefully omitted. It is outside of the scope of the specification since it's only possible with a faulty EE.

Such a scenario requires manual intervention.

No descendent of an INVALID block can ever be VALID. Besu appears to be treating a closed storage system as a reason to return

2023-05-02 08:45:21.956-04:00 | nioEventLoopGroup-3-5 | ERROR | RocksDBColumnarKeyValueStorage | Attempting to use a closed RocksDbKeyValueStorage
2023-05-02 08:45:21.956-04:00 | ForkJoinPool.commonPool-worker-25 | ERROR | RocksDBColumnarKeyValueStorage | Attempting to use a closed RocksDbKeyValueStorage
2023-05-02 08:45:21.957-04:00 | nioEventLoopGroup-3-5 | ERROR | MainnetTransactionProcessor | Critical Exception Processing Transaction
java.lang.IllegalStateException: Storage has been closed

INVALID.

Whether that's the actual cause, or it's something else, INVALID is simply never a valid default return value here. It's not "the EL doesn't know, maybe it's INVALID?". The EL has to be sure the block is invalid.

And, the fallout of that is that it just stalls Nimbus syncing, again, correctly, per spec.

@siladu
Copy link
Contributor Author

siladu commented May 6, 2023

Updated the description with some targeted debug logging from some test runs.

@tersec
Copy link

tersec commented May 7, 2023

While testing out an unrelated fix with the unstable branch of nimbus, I observed at least three separate backward sync issues on our sepolia validating node. Note, I've only seen this issue on Sepolia and it was probably triggered by nimbus not getting peers before falling > 64 blocks behind the beacon chain which switches it into "far wall" sync mode, rather than lock-step sync.

...

besu-23.1.2.log nimbus-ea8ef5-2.log

Nimbus versions differing behaviour

* The [unstable](https://github.com/status-im/nimbus-eth2/tree/unstable) branch I tested was what produced these logs

* [v23.4.0](https://github.com/status-im/nimbus-eth2/releases/tag/v23.4.0) seems to show similar behaviour

* [v23.3.2](https://github.com/status-im/nimbus-eth2/releases/tag/v23.3.2) didn't seem to have as many problems

Examples of incorrect-INVALID in those logs:

1:
Nimbus:

WRN 2023-04-27 00:12:17.548+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=8cf50d9a safeBlockHash=0267e06f finalizedBlockHash=19588bc2 receivedBlock="(blck: (slot: 2235058, proposer_index: 15, parent_root: "69b483c3", state_root: "9d065bb6", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: 52f93be4726098aba5e18ea6282274182e5c2d902d6550852897409d3c201701), graffiti: "teku-geth", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 2, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368175, fee_recipient: "0xf97e180c050e5ab072211ad2c213eb5aee4df134"), signature: "89d17655")"

Besu:

{"@timestamp":"2023-04-27T00:20:36,081","level":"INFO","thread":"vert.x-worker-thread-0","class":"AbstractEngineForkchoiceUpdated","message":"INVALID for fork-choice-update: head: 0x8cf50d9a8331ce7637977c0284a9fe83f6c526f9e7f15abc119afcebd8ffc342, finalized: 0x19588bc2f191f4debcf82c84fb1a7be67cc6ed128fa25369fa58e97c9c858f97, safeBlockHash: 0x0267e06f6118049d930361a6a3632b96f7abeb49a792f2167cbbabd131133595","throwable":""}

Block is valid: https://sepolia.beaconcha.in/slot/2235058

2:
Nimbus:

WRN 2023-04-27 00:25:51.944+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=bd28bfd1 safeBlockHash=0267e06f finalizedBlockHash=19588bc2 receivedBlock="(blck: (slot: 2235059, proposer_index: 173, parent_root: \"94ada9d0\", state_root: \"e797fb26\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: ce76841f95090ea8bab2c02c352c96af743547e7a8022fd5dd1cfd665fe2a0ce), graffiti: \"Nethermind\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368176, fee_recipient: \"0x670b24610df99b1685aeac0dfd5307b92e0cf4d7\"), signature: \"ab13d979\")"
WRN 2023-04-27 00:25:52.008+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=8be964af safeBlockHash=0267e06f finalizedBlockHash=19588bc2 receivedBlock="(blck: (slot: 2235060, proposer_index: 1229, parent_root: \"058b662e\", state_root: \"a1f74c72\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: 52f93be4726098aba5e18ea6282274182e5c2d902d6550852897409d3c201701), graffiti: \"0xf09fa7b1f09f94a5204b696c6e2e666920f09f94a5f09fa7b100000000000000\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368177, fee_recipient: \"0x1268ad189526ac0b386faf06effc46779c340ee6\"), signature: \"82c6d608\")"
WRN 2023-04-27 00:25:52.068+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=ec9cad4d safeBlockHash=0267e06f finalizedBlockHash=19588bc2 receivedBlock="(blck: (slot: 2235061, proposer_index: 1779, parent_root: \"1101b3a5\", state_root: \"56383a9d\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: 52f93be4726098aba5e18ea6282274182e5c2d902d6550852897409d3c201701), graffiti: \"teku/v23.3.1\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368178, fee_recipient: \"0x8b0c2c4c8eb078bc6c01f48523764c8942c0c6c4\"), signature: \"b053cf59\")"

Besu:

{"@timestamp":"2023-04-27T00:25:51,944","level":"INFO","thread":"vert.x-worker-thread-0","class":"AbstractEngineForkchoiceUpdated","message":"INVALID for fork-choice-update: head: 0xbd28bfd13d3ddf2792e42c98bc048b8fcc89c918a028af49d519f285a5c321eb, finalized: 0x19588bc2f191f4debcf82c84fb1a7be67cc6ed128fa25369fa58e97c9c858f97, safeBlockHash: 0x0267e06f6118049d930361a6a3632b96f7abeb49a792f2167cbbabd131133595","throwable":""}
{"@timestamp":"2023-04-27T00:25:52,145","level":"WARN","thread":"vert.x-worker-thread-0","class":"DefaultBlockchain","message":"Chain Reorganization +34 new / -0 old
Old - hash: 0xbca29e861aabf1b975cde3ee6f147a86ad8de3bfab1a39bfb2d98594a72eb63d, height: 3368145
New - hash: 0xc08142d51de91ea6b34bc92f6358b87f838ada7c2e3a7b9dd94253c7cfdb0acb, height: 3368179
Ancestor - hash: 0xbca29e861aabf1b975cde3ee6f147a86ad8de3bfab1a39bfb2d98594a72eb63d, height: 3368145","throwable":""}
{"@timestamp":"2023-04-27T00:25:52,229","level":"WARN","thread":"nioEventLoopGroup-3-9","class":"DefaultBlockchain","message":"Chain Reorganization +0 new / -34 old
Old - hash: 0xc08142d51de91ea6b34bc92f6358b87f838ada7c2e3a7b9dd94253c7cfdb0acb, height: 3368179
New - hash: 0xbca29e861aabf1b975cde3ee6f147a86ad8de3bfab1a39bfb2d98594a72eb63d, height: 3368145
Ancestor - hash: 0xbca29e861aabf1b975cde3ee6f147a86ad8de3bfab1a39bfb2d98594a72eb63d, height: 3368145","throwable":""}
{"@timestamp":"2023-04-27T00:25:52,371","level":"WARN","thread":"vert.x-worker-thread-0","class":"DefaultBlockchain","message":"Chain Reorganization +36 new / -0 old
Old - hash: 0xbca29e861aabf1b975cde3ee6f147a86ad8de3bfab1a39bfb2d98594a72eb63d, height: 3368145
New - hash: 0x2a0d21df2f93d6a3675fff83c3c605cf471ea6250c35286a941dac71af85d31b, height: 3368181
Ancestor - hash: 0xbca29e861aabf1b975cde3ee6f147a86ad8de3bfab1a39bfb2d98594a72eb63d, height: 3368145","throwable":""}
{"@timestamp":"2023-04-27T00:25:52,484","level":"WARN","thread":"nioEventLoopGroup-3-9","class":"DefaultBlockchain","message":"Chain Reorganization +0 new / -36 old
Old - hash: 0x2a0d21df2f93d6a3675fff83c3c605cf471ea6250c35286a941dac71af85d31b, height: 3368181
New - hash: 0xbca29e861aabf1b975cde3ee6f147a86ad8de3bfab1a39bfb2d98594a72eb63d, height: 3368145
Ancestor - hash: 0xbca29e861aabf1b975cde3ee6f147a86ad8de3bfab1a39bfb2d98594a72eb63d, height: 3368145","throwable":""}
{"@timestamp":"2023-04-27T00:25:52,611","level":"WARN","thread":"vert.x-worker-thread-0","class":"DefaultBlockchain","message":"Chain Reorganization +38 new / -0 old
Old - hash: 0xbca29e861aabf1b975cde3ee6f147a86ad8de3bfab1a39bfb2d98594a72eb63d, height: 3368145
New - hash: 0xfd4c23bcf746c6e566cf6ba6c2d8554f28808f7266dbed622b5a5d8ac5b3a38e, height: 3368183
Ancestor - hash: 0xbca29e861aabf1b975cde3ee6f147a86ad8de3bfab1a39bfb2d98594a72eb63d, height: 3368145","throwable":""}
{"@timestamp":"2023-04-27T00:25:52,690","level":"WARN","thread":"nioEventLoopGroup-3-9","class":"DefaultBlockchain","message":"Chain Reorganization +0 new / -38 old
Old - hash: 0xfd4c23bcf746c6e566cf6ba6c2d8554f28808f7266dbed622b5a5d8ac5b3a38e, height: 3368183
New - hash: 0xbca29e861aabf1b975cde3ee6f147a86ad8de3bfab1a39bfb2d98594a72eb63d, height: 3368145
Ancestor - hash: 0xbca29e861aabf1b975cde3ee6f147a86ad8de3bfab1a39bfb2d98594a72eb63d, height: 3368145","throwable":""}

All of these supposedly INVALID blocks aren't, and same point as before applies to Nimbus correctly stalling syncing as a consequence: https://sepolia.beaconcha.in/slot/2235059 https://sepolia.beaconcha.in/slot/2235060 https://sepolia.beaconcha.in/slot/2235061

3:

WRN 2023-04-27 00:12:13.989+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=c34ff7b5 safeBlockHash=0267e06f finalizedBlockHash=054894b7 receivedBlock="(blck: (slot: 2235032, proposer_index: 1423, parent_root: \"28efccce\", state_root: \"9336298e\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: 52f93be4726098aba5e18ea6282274182e5c2d902d6550852897409d3c201701), graffiti: \"Simply Staking\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368151, fee_recipient: \"0x0000000000000000000000000000000000000000\"), signature: \"ab3dc5a7\")"
WRN 2023-04-27 00:12:14.048+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=db594049 safeBlockHash=0267e06f finalizedBlockHash=054894b7 receivedBlock="(blck: (slot: 2235033, proposer_index: 114, parent_root: \"660716ac\", state_root: \"8d7b362d\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: ce76841f95090ea8bab2c02c352c96af743547e7a8022fd5dd1cfd665fe2a0ce), graffiti: \"Nethermind\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368152, fee_recipient: \"0x670b24610df99b1685aeac0dfd5307b92e0cf4d7\"), signature: \"a2c9bcca\")"
WRN 2023-04-27 00:12:14.112+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=43d30cc0 safeBlockHash=0267e06f finalizedBlockHash=054894b7 receivedBlock="(blck: (slot: 2235034, proposer_index: 1249, parent_root: \"811a3060\", state_root: \"0e04acaa\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: ce76841f95090ea8bab2c02c352c96af743547e7a8022fd5dd1cfd665fe2a0ce), graffiti: \"0xf09fa7b1f09f94a5204b696c6e2e666920f09f94a5f09fa7b100000000000000\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368153, fee_recipient: \"0x1268ad189526ac0b386faf06effc46779c340ee6\"), signature: \"8881eea9\")"
WRN 2023-04-27 00:12:14.176+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=adb70910 safeBlockHash=0267e06f finalizedBlockHash=054894b7 receivedBlock="(blck: (slot: 2235035, proposer_index: 1012, parent_root: \"da45258b\", state_root: \"cfe9fb70\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: 52f93be4726098aba5e18ea6282274182e5c2d902d6550852897409d3c201701), graffiti: \"Goerli Nunki @q9f Lodestar\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 2, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368154, fee_recipient: \"0xff58d746a67c2e42bcc07d6b3f58406e8837e883\"), signature: \"8b435ae4\")"
WRN 2023-04-27 00:12:14.236+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=23d7c1eb safeBlockHash=0267e06f finalizedBlockHash=054894b7 receivedBlock="(blck: (slot: 2235037, proposer_index: 1417, parent_root: \"900797c2\", state_root: \"fe655c16\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: 52f93be4726098aba5e18ea6282274182e5c2d902d6550852897409d3c201701), graffiti: \"Simply Staking\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 2, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368155, fee_recipient: \"0x0000000000000000000000000000000000000000\"), signature: \"875b36b8\")"
WRN 2023-04-27 00:12:14.300+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=64fe5dd3 safeBlockHash=0267e06f finalizedBlockHash=054894b7 receivedBlock="(blck: (slot: 2235038, proposer_index: 1834, parent_root: \"6fa89010\", state_root: \"9f14fba1\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: 52f93be4726098aba5e18ea6282274182e5c2d902d6550852897409d3c201701), graffiti: \"Obol Distributed Validator\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368156, fee_recipient: \"0x9a6034c84cd431409ac1a35278c7da36ffda53e5\"), signature: \"81407816\")"
WRN 2023-04-27 00:12:14.360+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=c247ecf0 safeBlockHash=0267e06f finalizedBlockHash=054894b7 receivedBlock="(blck: (slot: 2235039, proposer_index: 1181, parent_root: \"371db712\", state_root: \"8baee00b\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: ce76841f95090ea8bab2c02c352c96af743547e7a8022fd5dd1cfd665fe2a0ce), graffiti: \"grand rising\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368157, fee_recipient: \"0x0000006916a87b82333f4245046623b23794c65c\"), signature: \"981a29c4\")"
WRN 2023-04-27 00:12:14.432+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=a155d43f safeBlockHash=0267e06f finalizedBlockHash=054894b7 receivedBlock="(blck: (slot: 2235040, proposer_index: 381, parent_root: \"e9975ac1\", state_root: \"69dabd40\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: 52f93be4726098aba5e18ea6282274182e5c2d902d6550852897409d3c201701), graffiti: \"Lighthouse/v3.5.0-0fb58a6\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 450, block_number: 3368158, fee_recipient: \"0x008b3b2f992c0e14edaa6e2c662bec549caa8df1\"), signature: \"a7ddf133\")"
WRN 2023-04-27 00:12:14.503+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=7905e3bf safeBlockHash=0267e06f finalizedBlockHash=19588bc2 receivedBlock="(blck: (slot: 2235041, proposer_index: 1423, parent_root: \"9083b0e1\", state_root: \"a2587192\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: 52f93be4726098aba5e18ea6282274182e5c2d902d6550852897409d3c201701), graffiti: \"Simply Staking\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 450, block_number: 3368159, fee_recipient: \"0x0000000000000000000000000000000000000000\"), signature: \"b2b4742b\")"
WRN 2023-04-27 00:12:14.561+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=f214c0cf safeBlockHash=0267e06f finalizedBlockHash=19588bc2 receivedBlock="(blck: (slot: 2235042, proposer_index: 806, parent_root: \"55c17d1f\", state_root: \"abacaa06\", eth1data: (deposit_root: d70a234731285c6804c2a4f56711ddb8c82c99740f207854891028af34e27e5e, deposit_count: 0, block_hash: 491ebac1b7f9c0eb426047a495dc577140cb3e09036cd3f7266eda86b635d9fa), graffiti: \"\\\"prysm-k8s\\\"\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368160, fee_recipient: \"0xd9a5179f091d85051d3c982785efd1455cec8699\"), signature: \"8abe189e\")"
WRN 2023-04-27 00:12:14.616+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=2a71cdcc safeBlockHash=0267e06f finalizedBlockHash=19588bc2 receivedBlock="(blck: (slot: 2235043, proposer_index: 841, parent_root: \"eaa479f0\", state_root: \"82b660cb\", eth1data: (deposit_root: d70a234731285c6804c2a4f56711ddb8c82c99740f207854891028af34e27e5e, deposit_count: 0, block_hash: 491ebac1b7f9c0eb426047a495dc577140cb3e09036cd3f7266eda86b635d9fa), graffiti: \"\\\"prysm-k8s\\\"\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368161, fee_recipient: \"0xd9a5179f091d85051d3c982785efd1455cec8699\"), signature: \"b25cfffa\")"
WRN 2023-04-27 00:12:14.676+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=12bcbdce safeBlockHash=0267e06f finalizedBlockHash=19588bc2 receivedBlock="(blck: (slot: 2235044, proposer_index: 948, parent_root: \"a584be5e\", state_root: \"ba3354b8\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: 52f93be4726098aba5e18ea6282274182e5c2d902d6550852897409d3c201701), graffiti: \"Attestant/Vouch\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368162, fee_recipient: \"0x320715b08bcf4cac1df2c55288a6bad79da1566b\"), signature: \"aeb42efd\")"
WRN 2023-04-27 00:12:14.732+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=feb805ad safeBlockHash=0267e06f finalizedBlockHash=19588bc2 receivedBlock="(blck: (slot: 2235045, proposer_index: 1519, parent_root: \"d70e7956\", state_root: \"ad179ba8\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: 52f93be4726098aba5e18ea6282274182e5c2d902d6550852897409d3c201701), graffiti: \"teku/v23.2.0+14-g596a9acf9\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368163, fee_recipient: \"0x6a7aa9b882d50bb7bc5da1a244719c99f12f06a3\"), signature: \"a8847012\")"
WRN 2023-04-27 00:12:14.784+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=5e574a2b safeBlockHash=0267e06f finalizedBlockHash=19588bc2 receivedBlock="(blck: (slot: 2235046, proposer_index: 1427, parent_root: \"89f2001c\", state_root: \"f6fb3a71\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: 52f93be4726098aba5e18ea6282274182e5c2d902d6550852897409d3c201701), graffiti: \"Simply Staking\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368164, fee_recipient: \"0x0000000000000000000000000000000000000000\"), signature: \"b508956a\")"
WRN 2023-04-27 00:12:14.844+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=eb7267c6 safeBlockHash=0267e06f finalizedBlockHash=19588bc2 receivedBlock="(blck: (slot: 2235047, proposer_index: 1398, parent_root: \"1ed94cdf\", state_root: \"174fe7b9\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: 52f93be4726098aba5e18ea6282274182e5c2d902d6550852897409d3c201701), graffiti: \"Lighthouse/v4.1.0\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368165, fee_recipient: \"0xc4bfccb1668d6e464f33a76badd8c8d7d341e04a\"), signature: \"915552c3\")"
WRN 2023-04-27 00:12:14.900+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=46e86e3b safeBlockHash=0267e06f finalizedBlockHash=19588bc2 receivedBlock="(blck: (slot: 2235048, proposer_index: 322, parent_root: \"9f79cd08\", state_root: \"48177e0f\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: 52f93be4726098aba5e18ea6282274182e5c2d902d6550852897409d3c201701), graffiti: \"Lighthouse/v3.5.0-0fb58a6\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368166, fee_recipient: \"0x008b3b2f992c0e14edaa6e2c662bec549caa8df1\"), signature: \"a731a376\")"
WRN 2023-04-27 00:12:14.960+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=2208824f safeBlockHash=0267e06f finalizedBlockHash=19588bc2 receivedBlock="(blck: (slot: 2235049, proposer_index: 1077, parent_root: \"11cabd31\", state_root: \"d1dbd15e\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: 52f93be4726098aba5e18ea6282274182e5c2d902d6550852897409d3c201701), graffiti: \"Goerli Nunki @q9f Lodestar\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 2, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368167, fee_recipient: \"0xff58d746a67c2e42bcc07d6b3f58406e8837e883\"), signature: \"917fafb1\")"
WRN 2023-04-27 00:12:15.016+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=b71e5d6e safeBlockHash=0267e06f finalizedBlockHash=19588bc2 receivedBlock="(blck: (slot: 2235050, proposer_index: 1275, parent_root: \"eff7f854\", state_root: \"af69154a\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: ce76841f95090ea8bab2c02c352c96af743547e7a8022fd5dd1cfd665fe2a0ce), graffiti: \"0xf09fa7b1f09f94a5204b696c6e2e666920f09f94a5f09fa7b100000000000000\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368168, fee_recipient: \"0x1268ad189526ac0b386faf06effc46779c340ee6\"), signature: \"82198e3c\")"
WRN 2023-04-27 00:12:15.076+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=9c124f54 safeBlockHash=0267e06f finalizedBlockHash=19588bc2 receivedBlock="(blck: (slot: 2235051, proposer_index: 823, parent_root: \"c479a69e\", state_root: \"626e5b22\", eth1data: (deposit_root: d70a234731285c6804c2a4f56711ddb8c82c99740f207854891028af34e27e5e, deposit_count: 0, block_hash: 491ebac1b7f9c0eb426047a495dc577140cb3e09036cd3f7266eda86b635d9fa), graffiti: \"\\\"prysm-k8s\\\"\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368169, fee_recipient: \"0xd9a5179f091d85051d3c982785efd1455cec8699\"), signature: \"8eefda0f\")"
WRN 2023-04-27 00:12:15.132+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=5f9f7758 safeBlockHash=0267e06f finalizedBlockHash=19588bc2 receivedBlock="(blck: (slot: 2235052, proposer_index: 1158, parent_root: \"42b48afe\", state_root: \"352b8316\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: ce76841f95090ea8bab2c02c352c96af743547e7a8022fd5dd1cfd665fe2a0ce), graffiti: \"grand rising\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368170, fee_recipient: \"0x0000006916a87b82333f4245046623b23794c65c\"), signature: \"b5518fb2\")"
WRN 2023-04-27 00:12:15.188+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=36d6ac37 safeBlockHash=0267e06f finalizedBlockHash=19588bc2 receivedBlock="(blck: (slot: 2235053, proposer_index: 1945, parent_root: \"53bdba3a\", state_root: \"4c014413\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: 52f93be4726098aba5e18ea6282274182e5c2d902d6550852897409d3c201701), graffiti: \"\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368171, fee_recipient: \"0x4ff58ed19a4a4eb0440804b89cb6a1ec12e05979\"), signature: \"9035a6ac\")"
WRN 2023-04-27 00:12:15.248+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=22e56550 safeBlockHash=0267e06f finalizedBlockHash=19588bc2 receivedBlock="(blck: (slot: 2235054, proposer_index: 803, parent_root: \"8613c18c\", state_root: \"e52700ec\", eth1data: (deposit_root: d70a234731285c6804c2a4f56711ddb8c82c99740f207854891028af34e27e5e, deposit_count: 0, block_hash: 491ebac1b7f9c0eb426047a495dc577140cb3e09036cd3f7266eda86b635d9fa), graffiti: \"\\\"prysm-k8s\\\"\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368172, fee_recipient: \"0xd9a5179f091d85051d3c982785efd1455cec8699\"), signature: \"b9724e84\")"
WRN 2023-04-27 00:12:15.304+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=5aa9ba86 safeBlockHash=0267e06f finalizedBlockHash=19588bc2 receivedBlock="(blck: (slot: 2235055, proposer_index: 1866, parent_root: \"44dc04b6\", state_root: \"e4dc7890\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: 52f93be4726098aba5e18ea6282274182e5c2d902d6550852897409d3c201701), graffiti: \"Obol Distributed Validator\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368173, fee_recipient: \"0x9a6034c84cd431409ac1a35278c7da36ffda53e5\"), signature: \"81542417\")"
WRN 2023-04-27 00:12:15.364+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=cac157b4 safeBlockHash=0267e06f finalizedBlockHash=19588bc2 receivedBlock="(blck: (slot: 2235056, proposer_index: 469, parent_root: \"46c1c5d4\", state_root: \"7a68e763\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: 52f93be4726098aba5e18ea6282274182e5c2d902d6550852897409d3c201701), graffiti: \"Lighthouse/v4.1.0-be8231b\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368174, fee_recipient: \"0xc6e2459991bfe27cca6d86722f35da23a1e4cb97\"), signature: \"8dbbb1c3\")"

Which are https://sepolia.beaconcha.in/slot/2235032 https://sepolia.beaconcha.in/slot/2235033 https://sepolia.beaconcha.in/slot/2235034 https://sepolia.beaconcha.in/slot/2235035 https://sepolia.beaconcha.in/slot/2235037 https://sepolia.beaconcha.in/slot/2235038 https://sepolia.beaconcha.in/slot/2235039 https://sepolia.beaconcha.in/slot/2235040 https://sepolia.beaconcha.in/slot/2235041 https://sepolia.beaconcha.in/slot/2235042 https://sepolia.beaconcha.in/slot/2235043 https://sepolia.beaconcha.in/slot/2235044 https://sepolia.beaconcha.in/slot/2235045 https://sepolia.beaconcha.in/slot/2235046 https://sepolia.beaconcha.in/slot/2235047 https://sepolia.beaconcha.in/slot/2235048 https://sepolia.beaconcha.in/slot/2235049 https://sepolia.beaconcha.in/slot/2235050 https://sepolia.beaconcha.in/slot/2235051 https://sepolia.beaconcha.in/slot/2235052 https://sepolia.beaconcha.in/slot/2235053 https://sepolia.beaconcha.in/slot/2235054 https://sepolia.beaconcha.in/slot/2235055 https://sepolia.beaconcha.in/slot/2235056

all of which are are valid, despite Besu claiming otherwise in these logs.

4:

WRN 2023-04-27 00:25:51.944+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=bd28bfd1 safeBlockHash=0267e06f finalizedBlockHash=19588bc2 receivedBlock="(blck: (slot: 2235059, proposer_index: 173, parent_root: \"94ada9d0\", state_root: \"e797fb26\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: ce76841f95090ea8bab2c02c352c96af743547e7a8022fd5dd1cfd665fe2a0ce), graffiti: \"Nethermind\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368176, fee_recipient: \"0x670b24610df99b1685aeac0dfd5307b92e0cf4d7\"), signature: \"ab13d979\")"
WRN 2023-04-27 00:25:52.008+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=8be964af safeBlockHash=0267e06f finalizedBlockHash=19588bc2 receivedBlock="(blck: (slot: 2235060, proposer_index: 1229, parent_root: \"058b662e\", state_root: \"a1f74c72\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: 52f93be4726098aba5e18ea6282274182e5c2d902d6550852897409d3c201701), graffiti: \"0xf09fa7b1f09f94a5204b696c6e2e666920f09f94a5f09fa7b100000000000000\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368177, fee_recipient: \"0x1268ad189526ac0b386faf06effc46779c340ee6\"), signature: \"82c6d608\")"
WRN 2023-04-27 00:25:52.068+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=ec9cad4d safeBlockHash=0267e06f finalizedBlockHash=19588bc2 receivedBlock="(blck: (slot: 2235061, proposer_index: 1779, parent_root: \"1101b3a5\", state_root: \"56383a9d\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: 52f93be4726098aba5e18ea6282274182e5c2d902d6550852897409d3c201701), graffiti: \"teku/v23.3.1\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368178, fee_recipient: \"0x8b0c2c4c8eb078bc6c01f48523764c8942c0c6c4\"), signature: \"b053cf59\")"

which are https://sepolia.beaconcha.in/slot/2235059 https://sepolia.beaconcha.in/slot/2235060 https://sepolia.beaconcha.in/slot/2235061 also all valid.

5:

WRN 2023-04-27 00:26:29.832+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=4da34cad safeBlockHash=a155d43f finalizedBlockHash=0267e06f receivedBlock="(blck: (slot: 2235090, proposer_index: 1552, parent_root: \"8197deb2\", state_root: \"655e2df8\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: 52f93be4726098aba5e18ea6282274182e5c2d902d6550852897409d3c201701), graffiti: \"teku-geth\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 462, block_number: 3368205, fee_recipient: \"0xf97e180c050e5ab072211ad2c213eb5aee4df134\"), signature: \"8fe23b29\")"

which is valid: https://sepolia.beaconcha.in/slot/2235090

6:

{"@timestamp":"2023-04-27T00:12:13,931","level":"INFO","thread":"nioEventLoopGroup-3-9","class":"MainnetBlockValidator","message":"Invalid block 3368146 (0x58bddc61c3dc566ef5b915919bed392bb1e012e40e984b8d7f5b8459caa2165e): Optional[World State Root does not match expected value, header 0x15344d06d38b3f5b2b00c9ddd8ac204ade8ca995c712fa08dd1a9db058fa2c17 calculated 0x640d163f6c61bef433dfb58a1ea6e7f9bdf0cd3354e292c8aa0a18a25e9356ad], caused by java.lang.RuntimeException: World State Root does not match expected value, header 0x15344d06d38b3f5b2b00c9ddd8ac204ade8ca995c712fa08dd1a9db058fa2c17 calculated 0x640d163f6c61bef433dfb58a1ea6e7f9bdf0cd3354e292c8aa0a18a25e9356ad","throwable":""}
...
{"@timestamp":"2023-04-27T00:26:29,771","level":"INFO","thread":"nioEventLoopGroup-3-9","class":"MainnetBlockValidator","message":"Invalid block 3368146 (0x58bddc61c3dc566ef5b915919bed392bb1e012e40e984b8d7f5b8459caa2165e): Optional[World State Root does not match expected value, header 0x15344d06d38b3f5b2b00c9ddd8ac204ade8ca995c712fa08dd1a9db058fa2c17 calculated 0x5f595bb7897ce568bef824bc849e91628faf22deca81844d8291d162ac7f333d], caused by java.lang.RuntimeException: World State Root does not match expected value, header 0x15344d06d38b3f5b2b00c9ddd8ac204ade8ca995c712fa08dd1a9db058fa2c17 calculated 0x5f595bb7897ce568bef824bc849e91628faf22deca81844d8291d162ac7f333d","throwable":""}
...
{"@timestamp":"2023-04-27T00:26:30,573","level":"INFO","thread":"nioEventLoopGroup-3-9","class":"MainnetBlockValidator","message":"Invalid block 3368146 (0x58bddc61c3dc566ef5b915919bed392bb1e012e40e984b8d7f5b8459caa2165e): Optional[World State Root does not match expected value, header 0x15344d06d38b3f5b2b00c9ddd8ac204ade8ca995c712fa08dd1a9db058fa2c17 calculated 0xf6eeac46dd9f1a5c7ccf923d78f27644eacb4a126f16a686d7ec46b427a03f4b], caused by java.lang.RuntimeException: World State Root does not match expected value, header 0x15344d06d38b3f5b2b00c9ddd8ac204ade8ca995c712fa08dd1a9db058fa2c17 calculated 0xf6eeac46dd9f1a5c7ccf923d78f27644eacb4a126f16a686d7ec46b427a03f4b","throwable":""}

which corresponds to https://sepolia.beaconcha.in/block/3368146

In all of these cases, Besu is still effectively closing off the rest of the actually-valid/finalized chain from Nimbus via https://github.com/ethereum/consensus-specs/blob/dev/sync/optimistic.md#transitioning-from-valid---invalidated-or-invalidated---valid and claiming that VALID blocks are INVALID.

Nimbus forgets these (not actually...) "unviable" forks every restart, as it does not persist this particular cache, so restarting Nimbus appears to help briefly, but soon enough, these logs show Besu finding some other block to falsely label as INVALID.

@tersec
Copy link

tersec commented May 7, 2023

(1) Node offline during upgrade so I'd expect a small backwards sync (or lock-step sync depending on client). During this expected backwards sync though, we got lots of messages like this...
(example logs)
Need to confirm, but I believe teku will give us the latest hash they have once they are in sync and we get a consistent set of % complete logs for phase 2 of 2 with one completed message at the end. Something about the pattern of requests nimbus is sending is causing multiple completed logs.

Neither is a spec-required behavior according to https://github.com/ethereum/execution-apis/blob/main/src/engine/shanghai.md#engine_newpayloadv2 and https://github.com/ethereum/execution-apis/blob/main/src/engine/shanghai.md#engine_forkchoiceupdatedv2

(2) Besu is "in sync" as evidenced by being able to propose but then is frequently being switched to backwards sync mode presumably by being given a hash that we don't have a parent for.

(3) The reorg logs, which suggests it's being thrashed between forks potentially...

(2) and (3) both could be same root cause, or maybe all 3 are same cause.

Separating replies somewhat for easier readability between the enumerating-of-examples one and this: when Nimbus is told, repeatedly, that the (CL) blocks the rest of the Sepolia chain is claiming are correct aren't by Besu by way of INVALID execution payloads, it's prone to do something which looks a bit from the EL side like a sequence of reorgs as it tries to find a non-INVALID chain to follow.

@tersec
Copy link

tersec commented May 8, 2023

Update 06/05/2023

...

Ran the same test on the three different nimbus versions on non-validating sepolia node, the test was just to restart nimbus:

* [v23.4.0-ea8ef579](https://github.com/status-im/nimbus-eth2/commit/ea8ef579) (from unstable branch)

* [v23.4.0](https://github.com/status-im/nimbus-eth2/releases/tag/v23.4.0)

* [v23.3.2](https://github.com/status-im/nimbus-eth2/releases/tag/v23.3.2)

Logs attached here: https://we.tl/t-RZMmscEG1A

In all cases, there was less noise than on the validating node which suggests validating (with lots of validators) may exacerbate the problems.

Looking through the v23.3.2-based logs:

{"@timestamp":"2023-05-06T09:35:53,687","level":"INFO","thread":"ForkJoinPool.commonPool-worker-31","class":"AbstractBlockProcessor","message":"Block processing error: transaction invalid transaction nonce 76513 below sender account nonce 76514. Block 0xb499a29588fe47f551212a1470c2c93440c81bdaaedd7f5cf075f70e8230e7c1 Transaction 0xd0bc8ce38682992a4989135e805e6c7dad1a853e996517d63f0a00dd058724a5","throwable":""}
{"@timestamp":"2023-05-06T09:35:53,688","level":"INFO","thread":"ForkJoinPool.commonPool-worker-31","class":"MainnetBlockValidator","message":"Invalid block 3429506 (0xb499a29588fe47f551212a1470c2c93440c81bdaaedd7f5cf075f70e8230e7c1): Optional[Block processing error: transaction invalid transaction nonce 76513 below sender account nonce 76514. Block 0xb499a29588fe47f551212a1470c2c93440c81bdaaedd7f5cf075f70e8230e7c1 Transaction 0xd0bc8ce38682992a4989135e805e6c7dad1a853e996517d63f0a00dd058724a5]","throwable":""}
{"@timestamp":"2023-05-06T09:35:53,691","level":"DEBUG","thread":"ForkJoinPool.commonPool-worker-31","class":"BackwardSyncContext","message":"Not recoverable backward sync exception org.hyperledger.besu.ethereum.eth.sync.backwardsync.BackwardSyncException: Cannot save block 3429506 (0xb499a29588fe47f551212a1470c2c93440c81bdaaedd7f5cf075f70e8230e7c1) because of Block processing error: transaction invalid transaction nonce 76513 below sender account nonce 76514. Block 0xb499a29588fe47f551212a1470c2c93440c81bdaaedd7f5cf075f70e8230e7c1 Transaction 0xd0bc8ce38682992a4989135e805e6c7dad1a853e996517d63f0a00dd058724a5","throwable":""}

https://sepolia.beaconcha.in/block/0xb499a29588fe47f551212a1470c2c93440c81bdaaedd7f5cf075f70e8230e7c1 is block and https://sepolia.beaconcha.in/tx/0xd0bc8ce38682992a4989135e805e6c7dad1a853e996517d63f0a00dd058724a5 is the transaction identified. Apparently the nonce was not in fact below the sender account nonce after all.

See usual remarks about what happens with INVALIDs.

{"@timestamp":"2023-05-06T09:48:54,622","level":"INFO","thread":"ForkJoinPool.commonPool-worker-33","class":"MainnetBlockValidator","message":"Invalid block 3429536 (0x468ae2859df900ef9827b76f0b4ed7c163d055d6d41d44288fe21dd7fb238514): Optional[World State Root does not match expected value, header 0x57858c645fafcefa27005819eae52094ffe9c1a211a29e99ab9ed06c863fbec3 calculated 0x4ca397ab50c7dca592489edfb891086550e3c5866d8bc9e1b936da8fdb073b0e], caused by java.lang.RuntimeException: World State Root does not match expected value, header 0x57858c645fafcefa27005819eae52094ffe9c1a211a29e99ab9ed06c863fbec3 calculated 0x4ca397ab50c7dca592489edfb891086550e3c5866d8bc9e1b936da8fdb073b0e","throwable":""}
{"@timestamp":"2023-05-06T09:48:55,645","level":"INFO","thread":"nioEventLoopGroup-3-1","class":"AbstractBlockProcessor","message":"Block processing error: transaction invalid transaction nonce 289 does not match sender account nonce 279.. Block 0x3f9ad73497e9c8e65377ac2401009349ae8765e6699e21aec86573c3a6ec418b Transaction 0x8a909a3ab8c35f1322c1954e47e3d06b1ed4ec6379c138ff8db4067c3af54422","throwable":""}
{"@timestamp":"2023-05-06T09:48:55,645","level":"INFO","thread":"nioEventLoopGroup-3-1","class":"MainnetBlockValidator","message":"Invalid block 3429547 (0x3f9ad73497e9c8e65377ac2401009349ae8765e6699e21aec86573c3a6ec418b): Optional[Block processing error: transaction invalid transaction nonce 289 does not match sender account nonce 279.. Block 0x3f9ad73497e9c8e65377ac2401009349ae8765e6699e21aec86573c3a6ec418b Transaction 0x8a909a3ab8c35f1322c1954e47e3d06b1ed4ec6379c138ff8db4067c3af54422]","throwable":""}
{"@timestamp":"2023-05-06T09:48:55,655","level":"DEBUG","thread":"nioEventLoopGroup-3-1","class":"ForwardSyncStep","message":"Getting 6 blocks from peers failed with reason org.hyperledger.besu.ethereum.eth.sync.backwardsync.BackwardSyncException: Cannot save block 3429547 (0x3f9ad73497e9c8e65377ac2401009349ae8765e6699e21aec86573c3a6ec418b) because of Block processing error: transaction invalid transaction nonce 289 does not match sender account nonce 279.. Block 0x3f9ad73497e9c8e65377ac2401009349ae8765e6699e21aec86573c3a6ec418b Transaction 0x8a909a3ab8c35f1322c1954e47e3d06b1ed4ec6379c138ff8db4067c3af54422, reducing batch size to 51","throwable":""}
{"@timestamp":"2023-05-06T09:48:55,736","level":"DEBUG","thread":"vert.x-worker-thread-0","class":"AbstractEngineNewPayload","message":"Invalid new payload: number: 3429549, hash: 0x7624cb3036332114dcc2a90460d49c613510955c8da06870d122a9dcb80a49ca, parentHash: 0x716d26613fb844ac07b8e4310aa0622e2bb6a5d72f2a55fac622399186aa4182, latestValidHash: 0x768b18c2ec0498ca0496a7636a8fb13eabcb2f6175494ba384f506c5f466c535, status: INVALID, validationError: Block already present in bad block manager.","throwable":""}

which are https://sepolia.beaconcha.in/block/0x468ae2859df900ef9827b76f0b4ed7c163d055d6d41d44288fe21dd7fb238514, https://sepolia.beaconcha.in/block/0x3f9ad73497e9c8e65377ac2401009349ae8765e6699e21aec86573c3a6ec418b (tx is https://sepolia.beaconcha.in/tx/0x8a909a3ab8c35f1322c1954e47e3d06b1ed4ec6379c138ff8db4067c3af54422), and https://sepolia.beaconcha.in/block/3429549, all of which are evidently VALID.

INF 2023-05-06 09:12:32.391+00:00 execution payload forkChoiceUpdated status ACCEPTED/SYNCING, but was previously VALID payloadExecutionStatus=SYNCING headBlockHash=a6ce55e8 safeBlockHash=ed9eacbe finalizedBlockHash=77de788d receivedBlock="(blck: (slot: 2302562, proposer_index: 504, parent_root: \"593c3b0f\", state_root: \"f3a89bd9\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: 9849c7175de8856952042065f4471c7ac5da043bab18e0d9adedc18438a94382), graffiti: \"teku/v23.3.0+53-ga3df3ae1e\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 2, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 471, block_number: 3429434, fee_recipient: \"0xf24a01ae29dec4629dfb4170647c4ed4efc392cd\"), signature: \"a9d50b22\")"

is allowed by the engine API but kind of strange -- https://sepolia.beaconcha.in/block/3429434 had been by newPayload marked as VALID, but then when Nimbus tried to fcU to it, Besu said it was now ACCEPTED or SYNCING (which Nimbus treats identically).

For ea8ef579:
4.

{"@timestamp":"2023-05-06T09:35:53,687","level":"INFO","thread":"ForkJoinPool.commonPool-worker-31","class":"AbstractBlockProcessor","message":"Block processing error: transaction invalid transaction nonce 76513 below sender account nonce 76514. Block 0xb499a29588fe47f551212a1470c2c93440c81bdaaedd7f5cf075f70e8230e7c1 Transaction 0xd0bc8ce38682992a4989135e805e6c7dad1a853e996517d63f0a00dd058724a5","throwable":""}
{"@timestamp":"2023-05-06T09:35:53,688","level":"INFO","thread":"ForkJoinPool.commonPool-worker-31","class":"MainnetBlockValidator","message":"Invalid block 3429506 (0xb499a29588fe47f551212a1470c2c93440c81bdaaedd7f5cf075f70e8230e7c1): Optional[Block processing error: transaction invalid transaction nonce 76513 below sender account nonce 76514. Block 0xb499a29588fe47f551212a1470c2c93440c81bdaaedd7f5cf075f70e8230e7c1 Transaction 0xd0bc8ce38682992a4989135e805e6c7dad1a853e996517d63f0a00dd058724a5]","throwable":""}
{"@timestamp":"2023-05-06T09:35:53,691","level":"DEBUG","thread":"ForkJoinPool.commonPool-worker-31","class":"BackwardSyncContext","message":"Not recoverable backward sync exception org.hyperledger.besu.ethereum.eth.sync.backwardsync.BackwardSyncException: Cannot save block 3429506 (0xb499a29588fe47f551212a1470c2c93440c81bdaaedd7f5cf075f70e8230e7c1) because of Block processing error: transaction invalid transaction nonce 76513 below sender account nonce 76514. Block 0xb499a29588fe47f551212a1470c2c93440c81bdaaedd7f5cf075f70e8230e7c1 Transaction 0xd0bc8ce38682992a4989135e805e6c7dad1a853e996517d63f0a00dd058724a5","throwable":""}

In actuality valid) blocks involved are https://sepolia.beaconcha.in/block is /0xb499a29588fe47f551212a1470c2c93440c81bdaaedd7f5cf075f70e8230e7c1 and transaction is https://sepolia.beaconcha.in/tx/0xd0bc8ce38682992a4989135e805e6c7dad1a853e996517d63f0a00dd058724a5

Nimbus has some more interesting/informative logs here this time about how it responded (apparently via timestamp to this batch of spurious-INVALIDs):

INF 2023-05-06 09:35:48.000+00:00 Slot start                                 topics="beacnde" slot=2302679 epoch=71958 sync="00h00m (100.00%) 109802048057794960.0000slots/s (wwPQwwwwww:2302623)/opt" peers=2 head=420ed492:2302641 finalized=71955:593c3b0f delay=66us
INF 2023-05-06 09:35:48.003+00:00 Slot end                                   topics="beacnde" slot=2302679 nextActionWait=n/a nextAttestationSlot=-1 nextProposalSlot=-1 syncCommitteeDuties=none head=420ed492:2302641
WRN 2023-05-06 09:35:53.693+00:00 updateHeadWithExecution: attempting to recover from invalid payload attempts=1 maxAttempts=5 newHead="(blck: ..., safeExecutionPayloadHash: 70ef8f89c1f29f726ff3201455be223f93a83d9b808134ebf23fd01c5c6e410a, finalizedExecutionPayloadHash: 95484b7665367c145dd3ff5ea548dd3bf563ffc153e229112f94cf48a2ed12c4)" initialNewHead="(blck: ..., safeExecutionPayloadHash: 70ef8f89c1f29f726ff3201455be223f93a83d9b808134ebf23fd01c5c6e410a, finalizedExecutionPayloadHash: 95484b7665367c145dd3ff5ea548dd3bf563ffc153e229112f94cf48a2ed12c4)" nextHead="(blck: ..., safeExecutionPayloadHash: 70ef8f89c1f29f726ff3201455be223f93a83d9b808134ebf23fd01c5c6e410a, finalizedExecutionPayloadHash: 95484b7665367c145dd3ff5ea548dd3bf563ffc153e229112f94cf48a2ed12c4)"
WRN 2023-05-06 09:35:53.739+00:00 updateHeadWithExecution: attempting to recover from invalid payload attempts=2 maxAttempts=5 newHead="(blck: ..., safeExecutionPayloadHash: 70ef8f89c1f29f726ff3201455be223f93a83d9b808134ebf23fd01c5c6e410a, finalizedExecutionPayloadHash: 95484b7665367c145dd3ff5ea548dd3bf563ffc153e229112f94cf48a2ed12c4)" initialNewHead="(blck: ..., safeExecutionPayloadHash: 70ef8f89c1f29f726ff3201455be223f93a83d9b808134ebf23fd01c5c6e410a, finalizedExecutionPayloadHash: 95484b7665367c145dd3ff5ea548dd3bf563ffc153e229112f94cf48a2ed12c4)" nextHead="(blck: ..., safeExecutionPayloadHash: 70ef8f89c1f29f726ff3201455be223f93a83d9b808134ebf23fd01c5c6e410a, finalizedExecutionPayloadHash: 95484b7665367c145dd3ff5ea548dd3bf563ffc153e229112f94cf48a2ed12c4)"
WRN 2023-05-06 09:35:53.783+00:00 updateHeadWithExecution: attempting to recover from invalid payload attempts=3 maxAttempts=5 newHead="(blck: ..., safeExecutionPayloadHash: 70ef8f89c1f29f726ff3201455be223f93a83d9b808134ebf23fd01c5c6e410a, finalizedExecutionPayloadHash: 95484b7665367c145dd3ff5ea548dd3bf563ffc153e229112f94cf48a2ed12c4)" initialNewHead="(blck: ..., safeExecutionPayloadHash: 70ef8f89c1f29f726ff3201455be223f93a83d9b808134ebf23fd01c5c6e410a, finalizedExecutionPayloadHash: 95484b7665367c145dd3ff5ea548dd3bf563ffc153e229112f94cf48a2ed12c4)" nextHead="(blck: ..., safeExecutionPayloadHash: 70ef8f89c1f29f726ff3201455be223f93a83d9b808134ebf23fd01c5c6e410a, finalizedExecutionPayloadHash: 95484b7665367c145dd3ff5ea548dd3bf563ffc153e229112f94cf48a2ed12c4)"
WRN 2023-05-06 09:35:53.827+00:00 updateHeadWithExecution: attempting to recover from invalid payload attempts=4 maxAttempts=5 newHead="(blck: ..., safeExecutionPayloadHash: 70ef8f89c1f29f726ff3201455be223f93a83d9b808134ebf23fd01c5c6e410a, finalizedExecutionPayloadHash: 95484b7665367c145dd3ff5ea548dd3bf563ffc153e229112f94cf48a2ed12c4)" initialNewHead="(blck: ..., safeExecutionPayloadHash: 70ef8f89c1f29f726ff3201455be223f93a83d9b808134ebf23fd01c5c6e410a, finalizedExecutionPayloadHash: 95484b7665367c145dd3ff5ea548dd3bf563ffc153e229112f94cf48a2ed12c4)" nextHead="(blck: ..., safeExecutionPayloadHash: 70ef8f89c1f29f726ff3201455be223f93a83d9b808134ebf23fd01c5c6e410a, finalizedExecutionPayloadHash: 95484b7665367c145dd3ff5ea548dd3bf563ffc153e229112f94cf48a2ed12c4)"
NOT 2023-05-06 09:35:53.893+00:00 Updated head block with chain reorg        topics="chaindag" headParent=beaef498:2302636 stateRoot=c763229b justified=71956:3aae05a1 finalized=71955:593c3b0f isOptHead=false newHead=ef73dd83:2302637 lastHead=420ed492:2302641
WRN 2023-05-06 09:35:53.955+00:00 execution payload forkChoiceUpdated status INVALID, but was previously VALID payloadExecutionStatus=INVALID headBlockHash=91bec559 safeBlockHash=70ef8f89 finalizedBlockHash=95484b76 receivedBlock="(blck: (slot: 2302643, proposer_index: 1780, parent_root: \"f619d29f\", state_root: \"fc3f28d7\", eth1data: (deposit_root: 9df92d765b5aa041fd4bbe8d5878eb89290efa78e444c1a603eecfae2ea05fa4, deposit_count: 403, block_hash: 9849c7175de8856952042065f4471c7ac5da043bab18e0d9adedc18438a94382), graffiti: \"charon/v0.14.1-c3fe208\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 1, deposits_len: 0, voluntary_exits_len: 0, sync_committee_participants: 471, block_number: 3429510, fee_recipient: \"0x9a6034c84cd431409ac1a35278c7da36ffda53e5\"), signature: \"857ab9a7\")"
WRN 2023-05-06 09:35:59.320+00:00 Peer count low, no new peers discovered    topics="networking" discovered_nodes=0 new_peers=@[] current_peers=0 wanted_peers=160
INF 2023-05-06 09:36:00.000+00:00 Slot start                                 topics="beacnde" slot=2302680 epoch=71958 sync=synced peers=0 head=955caf69:2302655 finalized=71955:593c3b0f delay=131us342ns

Nimbus kept trying to find a valid fork choice head, and to do so, it checked with the EL each time, but each time, Besu told it the block it asked about was invalid, so it simply didn't update the head, and had no viable way to proceed.

Also as a result, this it lost peers: Besu effectively told it its (CL/libp2p) peers were giving it garbage (INVALID) blocks, so it dropped those peers via libp2p gossip scoring. It didn't have a lot of peers before, but before/after the slot in question where Besu provides the spurious INVALIDs, it goes from

INF 2023-05-06 09:35:48.000+00:00 Slot start                                 topics="beacnde" slot=2302679 epoch=71958 sync="00h00m (100.00%) 109802048057794960.0000slots/s (wwPQwwwwww:2302623)/opt" peers=2 head=420ed492:2302641 finalized=71955:593c3b0f delay=66us

with 2 peers to

WRN 2023-05-06 09:35:59.320+00:00 Peer count low, no new peers discovered    topics="networking" discovered_nodes=0 new_peers=@[] current_peers=0 wanted_peers=160
INF 2023-05-06 09:36:00.000+00:00 Slot start                                 topics="beacnde" slot=2302680 epoch=71958 sync=synced peers=0 head=955caf69:2302655 finalized=71955:593c3b0f delay=131us342ns

with 0 peers, which hinders recovery.

This is not required by spec, but it's a reasonable heuristic -- in general, staying connected to a CL peer providing INVALID blocks is a good way to waste resources with spam.

{"@timestamp":"2023-05-06T09:48:54,622","level":"INFO","thread":"ForkJoinPool.commonPool-worker-33","class":"MainnetBlockValidator","message":"Invalid block 3429536 (0x468ae2859df900ef9827b76f0b4ed7c163d055d6d41d44288fe21dd7fb238514): Optional[World State Root does not match expected value, header 0x57858c645fafcefa27005819eae52094ffe9c1a211a29e99ab9ed06c863fbec3 calculated 0x4ca397ab50c7dca592489edfb891086550e3c5866d8bc9e1b936da8fdb073b0e], caused by java.lang.RuntimeException: World State Root does not match expected value, header 0x57858c645fafcefa27005819eae52094ffe9c1a211a29e99ab9ed06c863fbec3 calculated 0x4ca397ab50c7dca592489edfb891086550e3c5866d8bc9e1b936da8fdb073b0e","throwable":""}
{"@timestamp":"2023-05-06T09:48:54,624","level":"DEBUG","thread":"ForkJoinPool.commonPool-worker-33","class":"BackwardSyncContext","message":"Not recoverable backward sync exception org.hyperledger.besu.ethereum.eth.sync.backwardsync.BackwardSyncException: Cannot save block 3429536 (0x468ae2859df900ef9827b76f0b4ed7c163d055d6d41d44288fe21dd7fb238514) because of World State Root does not match expected value, header 0x57858c645fafcefa27005819eae52094ffe9c1a211a29e99ab9ed06c863fbec3 calculated 0x4ca397ab50c7dca592489edfb891086550e3c5866d8bc9e1b936da8fdb073b0e","throwable":""}
{"@timestamp":"2023-05-06T09:48:54,624","level":"INFO","thread":"ForkJoinPool.commonPool-worker-33","class":"BackwardSyncContext","message":"Current backward sync session failed, it will be restarted","throwable":""}

Block is https://sepolia.beaconcha.in/block/3429536 and, slightly interesting specifically because it notes this is why that backwards sync failed.

{"@timestamp":"2023-05-06T09:48:55,643","level":"ERROR","thread":"vert.x-worker-thread-0","class":"MergeCoordinator","message":"Could not persist world for root hash 0x8f3fc09fdd800e9f7cc5b318a3e096b016c106f22b61fce86dc3694ec47aa79f and block hash 0x3f9ad73497e9c8e65377ac2401009349ae8765e6699e21aec86573c3a6ec418b","throwable":""}
{"@timestamp":"2023-05-06T09:48:55,645","level":"INFO","thread":"nioEventLoopGroup-3-1","class":"AbstractBlockProcessor","message":"Block processing error: transaction invalid transaction nonce 289 does not match sender account nonce 279.. Block 0x3f9ad73497e9c8e65377ac2401009349ae8765e6699e21aec86573c3a6ec418b Transaction 0x8a909a3ab8c35f1322c1954e47e3d06b1ed4ec6379c138ff8db4067c3af54422","throwable":""}
{"@timestamp":"2023-05-06T09:48:55,645","level":"INFO","thread":"nioEventLoopGroup-3-1","class":"MainnetBlockValidator","message":"Invalid block 3429547 (0x3f9ad73497e9c8e65377ac2401009349ae8765e6699e21aec86573c3a6ec418b): Optional[Block processing error: transaction invalid transaction nonce 289 does not match sender account nonce 279.. Block 0x3f9ad73497e9c8e65377ac2401009349ae8765e6699e21aec86573c3a6ec418b Transaction 0x8a909a3ab8c35f1322c1954e47e3d06b1ed4ec6379c138ff8db4067c3af54422]","throwable":""}
{"@timestamp":"2023-05-06T09:48:55,655","level":"DEBUG","thread":"nioEventLoopGroup-3-1","class":"ForwardSyncStep","message":"Getting 6 blocks from peers failed with reason org.hyperledger.besu.ethereum.eth.sync.backwardsync.BackwardSyncException: Cannot save block 3429547 (0x3f9ad73497e9c8e65377ac2401009349ae8765e6699e21aec86573c3a6ec418b) because of Block processing error: transaction invalid transaction nonce 289 does not match sender account nonce 279.. Block 0x3f9ad73497e9c8e65377ac2401009349ae8765e6699e21aec86573c3a6ec418b Transaction 0x8a909a3ab8c35f1322c1954e47e3d06b1ed4ec6379c138ff8db4067c3af54422, reducing batch size to 51","throwable":""}

https://sepolia.beaconcha.in/block/0x3f9ad73497e9c8e65377ac2401009349ae8765e6699e21aec86573c3a6ec418b as usual is block and https://sepolia.beaconcha.in/tx/0x8a909a3ab8c35f1322c1954e47e3d06b1ed4ec6379c138ff8db4067c3af54422 is transaction.

{"@timestamp":"2023-05-06T09:48:55,734","level":"DEBUG","thread":"vert.x-worker-thread-0","class":"AbstractEngineNewPayload","message":"New payload: number: 3429548, hash: 0x716d26613fb844ac07b8e4310aa0622e2bb6a5d72f2a55fac622399186aa4182, parentHash: 0x3f9ad73497e9c8e65377ac2401009349ae8765e6699e21aec86573c3a6ec418b, latestValidHash: 0x716d26613fb844ac07b8e4310aa0622e2bb6a5d72f2a55fac622399186aa4182, status: VALID","throwable":""}
{"@timestamp":"2023-05-06T09:48:55,735","level":"DEBUG","thread":"vert.x-worker-thread-0","class":"AbstractEngineForkchoiceUpdated","message":"Forkchoice parameters headBlockHash=0x716d26613fb844ac07b8e4310aa0622e2bb6a5d72f2a55fac622399186aa4182, safeBlockHash=0x735a92e16783076d005fc9f78094cfcdd001ae52db2ac8a3213af0860e8d486f, finalizedBlockHash=0x70ef8f89c1f29f726ff3201455be223f93a83d9b808134ebf23fd01c5c6e410a","throwable":""}
{"@timestamp":"2023-05-06T09:48:55,736","level":"DEBUG","thread":"vert.x-worker-thread-0","class":"AbstractEngineNewPayload","message":"Invalid new payload: number: 3429549, hash: 0x7624cb3036332114dcc2a90460d49c613510955c8da06870d122a9dcb80a49ca, parentHash: 0x716d26613fb844ac07b8e4310aa0622e2bb6a5d72f2a55fac622399186aa4182, latestValidHash: 0x768b18c2ec0498ca0496a7636a8fb13eabcb2f6175494ba384f506c5f466c535, status: INVALID, validationError: Block already present in bad block manager.","throwable":""}

Spurious-invalid block here is https://sepolia.beaconcha.in/block/3429549 even though Besu marked its parent explicitly as VALID. No reason given beyond "block already present in bad block manager", but not sure why.

@siladu
Copy link
Contributor Author

siladu commented Jun 20, 2023

Hi @tersec, I agree with your analysis of besu's INVALID responses.

Besu marking blocks as INVALID due to World State Root does not match or invalid transaction nonce error is most likely a Besu data inconsistency error. In certain cases, it may even indicate that the database was corrupted and may not be recoverable. I think in most of my tests, the situation was recoverable but still caused some havoc in the syncing process. Essentially, once we start incorrectly marking blocks as INVALID all bets are off. We need to address the root cause in Besu for this class of errors.

For the tests linked in this issue, I can't explain the exact cause that led to INVALID, but after performing tests with my PR, I have not seen the "invalid transaction nonce" error and have only seen one instance of the "World State Root does not match" error so an improvement has been made with PR (and should be released this week).

My PR focused on fixing the "reorg" warnings during backwards sync (Observation (3) above), which indicated Besu was moving the head by a large number of blocks. This bug was due to some errant block hash that we were holding onto in our state and trying to move the head to it.

I have created an issue for the remaining Besu error: #5622

@siladu
Copy link
Contributor Author

siladu commented Jun 20, 2023

@tersec While testing my PR, I have been collecting engine api logs via a proxy and have some analysis to share.

Summary

The most important finding shows that nimbus sometimes sends batches of out of order newPayloads before sending a FCU.
The batches are often in reverse order but with the FCU associated with the batch matching the final payload (with lowest blockNumber).
Some other clients send newPayload batches too but in sequential order, so I believe the newPayload ordering is responsible for a lot of the issues I'm still seeing. I can provide some detailed examples from logs if required.

Next steps for the original Observations raised:
❓ (1) I believe Nimbus sending newPayloads in sequential order should resolve this.
❓ (2) I believe not triggering overlapping backwards sync sessions would solve this (maybe besu could do optimisations to detect these scenarios and ignore them too)
✅ (3) I believe my PR fixes the large reorg logs

Besu Backwards Sync

For clarity I'll start with defining "backwards sync" from a besu point of view...
When besu receives an EL block hash from a newPayload or an FCU and it doesn't have that EL block's parent in its database, it starts a backwards sync session.

The backward sync session will attempt to fill the gap in the chain between the latest imported EL block and the hash it has just received from the CL.

It fills this gap by querying other EL peers first for the headers "Backwards sync phase 1 of 2" (the actual backwards part, using the parent hash to traverse up the chain).
Once it has filled the gap with headers, it queries the EL peers for the EL block bodies and executes them in a "ForwardSyncStep", logged as "Backwards sync phase 2 of 2".

Once Besu has received this first hash, Besu does not require intervention from the CL in order to fill this gap, however it may ignore subsequent newPayload/FCUs while it is syncing if it already has that block imported, or it has already started a backwards sync session due to it.

Observation (1) - backwards sync triggered instead of lock-step

As stated in the summary, we receive batches of newPayload (with 1 FCU per batch) in reverse order by block number.
For example, when besu's latest block is 7, nimbus sent in this order:

  • newPayload[blockHash_10],
  • newPayload[blockHash_9],
  • newPayload[blockHash_8],
  • FCU[blockHash_8]

We don't have the parent of block_10 in our chain so we start a backwards sync session upon receipt of that first newPayload.

I had previously stated on Discord that other clients send besu a single hash and wait for it to sync; this appears to only be the case during initial sync so I was mistaken w.r.t. the scenario being discussed here: client restarts/outages/recovery rather than initial sync behaviour.

In fact, when either EL or CL is restarted, other clients have a range of strategies that are some variant of "lock-step".
Sometimes these also cause besu to backwards sync, but newPayloads are sent in order and we don't get the overlapping backwards sync sessions.

Observation (2) - overlapping backwards sync sessions

Sometimes, before besu completes its current batch of backwards sync, nimbus gives it a different hash to sync. I observe that nimbus first sends hash_n then after some time sends hash_n-1, then hash_n-2, etc.
This may also be related to the reverse order newPayloads (although could happen with any out of order "future" hashes sent via newPayload or FCU).

Since besu's backward sync algorithm is multithreaded, we end up with multiple threads of work with overlapping batches of blocks. I think this explains the odd and out of order "percentage completed" logs.

For example, let's say besu is on block 5 out of 10, I've seen behaviour like this...

  • nimbus --- newPayload[blockHash_8] ---> besu starts a backwards sync session to fill the gap from block 5 - 8
  • nimbus --- newPayload[blockHash_7] ---> besu starts a concurrent backwards sync session to fill the gap from block 5 - 7

So nimbus isn't giving besu a chance to complete the backwards sync session before it requests a new one.

Couple of ways besu would behave better in this scenario:

(i) sending newPayloads in order may avoid the majority of these cases.

(ii) If nimbus resends newPayload[blockHash_8] then besu will disregard since it can detect it has already started a backwards sync session to cover all blocks up to this hash. Not sure there's any merit to doing that from a nimbus POV.

(iii) If nimbus checks which block besu is on by using eth_blockNumber/eth_getBlockByHash or similar then maybe it can send a more appropriate hash for besu to sync to. I know you'd mentioned safety issue with this before but if nimbus knows it has sent an FCU to besu then isn't it reasonable to wait until besu has at least reached that hash before sending another?

eth_getLogs

Finally, there's the adjacent issue of lots of eth_getLogs request, which appeared to coincide with the issues observed.

Out of all the CL clients, nimbus sends us the most of these, I think because it rebuilds them from scratch every time either besu or nimbus is restarted (which is maybe a knock-on effect of besu being in "syncing" mode when it receives out of order newPayload).
I believe other clients must be persisting this information across restarts.
I don't think it's a problem in itself, other than the potential to cause besu resource constraints if it's having to do lots of other work at the same time.

@tersec
Copy link

tersec commented Jun 20, 2023

eth_getLogs

Finally, there's the adjacent issue of lots of eth_getLogs request, which appeared to coincide with the issues observed.

Out of all the CL clients, nimbus sends us the most of these, I think because it rebuilds them from scratch every time either besu or nimbus is restarted (which is maybe a knock-on effect of besu being in "syncing" mode when it receives out of order newPayload). I believe other clients must be persisting this information across restarts. I don't think it's a problem in itself, other than the potential to cause besu resource constraints if it's having to do lots of other work at the same time.

This is true for Nimbus, yeah. It's possible that https://eips.ethereum.org/EIPS/eip-4881 will change this at some point, but that's how it currently works.

@siladu
Copy link
Contributor Author

siladu commented Jun 21, 2023

Something else we could try in besu is not triggering backwards sync when we receive a newPayload (only when receiving FCU)

@siladu
Copy link
Contributor Author

siladu commented Jun 26, 2023

nimbus sometimes sends batches of out of order newPayloads before sending a FCU.
The batches are often in reverse order but with the FCU associated with the batch matching the final payload (with lowest blockNumber).

Hi @tersec what do you think about this ^?
Would it be difficult to send the batches newPayloads in ascending order?

I understand the order of newPayloads isn't mandated in the spec, but it seems like a bug to me to send a batch and then the FCU for the oldest rather than the latest one of that batch.

@kiss81
Copy link

kiss81 commented Jul 2, 2023

Is this issue causing what I experience here:
status-im/nimbus-eth2#5145
?

@siladu
Copy link
Contributor Author

siladu commented Jul 3, 2023

Is this issue causing what I experience here: status-im/nimbus-eth2#5145 ?

@kiss81 IIUC your issue is with nimbus reporting corrupted deposit logs? I have not experienced that issue, so not sure it's related.

The "backward syncing" referred to in this issue is about nimbus helping besu get its blocks back into sync after an outage.
Note, this is similar but different to Nimbus' "backfilling" since they refer to two different blockchains' data.

The deposit logs (eth_getLogs) are relevant because nimbus requests them every time either client is restarted, but I don't believe they are the cause of any backward syncing issues.

Feel free to post more of your besu logs in a new issue and we can see if we can spot anything that might help.

@siladu
Copy link
Contributor Author

siladu commented Jul 4, 2023

Something else we could try in besu is not triggering backwards sync when we receive a newPayload (only when receiving FCU)

I tested this out in #5666 and while it did eventually get back into sync, it took 2 hours following a simple restart of besu, so I don't think this is a silver bullet, and arguably worse than before.

Going to close this for now as I think any further progress in the short term may need to come from the nimbus side. We have a medium term goal of improving/rewriting our backward sync procedure, so not syncing on newPayload is certainly something we could explore further as part of that.

Correction, I think my first test was invalid because I have realised it actually wasn't properly in sync before I started the test.
Two subsequent "restart besu" tests have worked a lot more smoothly! I will feed this back to the team to work out next steps.

@siladu
Copy link
Contributor Author

siladu commented Aug 3, 2023

Reopening this since it is still an issue, at least on Sepolia.

This is one solution that will solve it however: #5687

@siladu siladu reopened this Aug 3, 2023
@siladu
Copy link
Contributor Author

siladu commented Aug 3, 2023

There's also been some more recent testing #5699 (comment)
which has shown some changes in how nimbus send newPayloads, but the outcome is the same, for a simple restart at least.

When restarting Besu with Nimbus as a CL, Besu receives new payload calls on chainhead+2 blocks. This triggers a backward sync for each new payload because the parent of the block doesn't exist in the blockchain. Eventually, Besu is able to recover from this loop of backward syncs with one block (see logs below).

@tersec
Copy link

tersec commented Aug 11, 2023

status-im/nimbus-eth2#5248 should help here, in response to status-im/nimbus-eth2#5242

@siladu
Copy link
Contributor Author

siladu commented Aug 14, 2023

status-im/nimbus-eth2#5248 should help here, in response to status-im/nimbus-eth2#5242

cc @ahamlat

@tersec
Copy link

tersec commented Dec 1, 2023

status-im/nimbus-eth2#5635 avoids some duplicate forkchoiceUpdated around proposals.

@macfarla macfarla added this to the 2024 Q1 Besu team milestone Feb 2, 2024
@non-fungible-nelson
Copy link
Contributor

have we been able to reproduce on besu latest?

@siladu
Copy link
Contributor Author

siladu commented Aug 27, 2024

I have not noticed this issue for some time - e.g. testnet nodes seem to recover OK after a restart/update, so suspect it's resolved. Can reopen if necessary

@siladu siladu closed this as completed Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working mainnet P3 Medium (ex: JSON-RPC request not working with a specific client library due to loose spec assumtion) TeamChupa GH issues worked on by Chupacabara Team TeamGroot GH issues worked on by Groot Team
Projects
None yet
Development

No branches or pull requests

6 participants