Skip to content

Commit

Permalink
Beacon sync fix stalling while block processing (#3037)
Browse files Browse the repository at this point in the history
* Correct docu

why:
  `T` is mentioned on the metrics table but not explained

* Update state sync ticker

why:
  Print last named state for debugging unexpected states.

* Rename `nec_consensus_head`=> `nec_sync_consensus_head`

why:
  This variable is syncer local, derived from what would be vaguely be
  the consensus head. In fact, at some point it is the consensus head
  but often will keep that value while the consensus head advances.

* Handle legit system state when block processing is cancelled

why:
  This state context was previously missing. It happens with problematic
  blocks (i.e. corrupt or missing.) Rather than trying to remedy the batch
  queue, all will be cancelled and the batch queue rebuilt from scratch.

* Update block queue with unexpectedly missing blocks

why:
  Concurrently serving `RPC` requests might cause a reset of the `FC`
  module data area. This in turn might produce a gap between expected `FC`
  module top and the beginning of the already downloaded blocks list.

  Currently this led to a deadlock situation because the missing blocks
  were never downloaded by the syncer, neither installed into `FC` module
  via `RFC`.

* Fix copyright year
  • Loading branch information
mjfh authored Jan 29, 2025
1 parent cdd71ca commit 8bc8094
Show file tree
Hide file tree
Showing 7 changed files with 119 additions and 39 deletions.
41 changes: 23 additions & 18 deletions nimbus/sync/beacon/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ where

* *C* -- coupler, parent of the left endpoint of the chain of headers or blocks
to be fetched and imported. The block number of *C* is somewhere
between the ones of *B* and *C* inclusive.
between the ones of *B* and *C* inclusive.

* *L* -- **latest**, current value of this entity (with the same name) of the
**FC** module (i.e. the current value when looked up.) *L* need not
Expand All @@ -92,8 +92,12 @@ where
state eventually reaching *Y* (for notation *D<<H* see clause *(6)*
below.)

* *H* -- head, sync target header which typically was the value of a *sync to
new head* request (via RPC)
* *H* -- head, sync scrum target header which locks the value of *T* (see
below) while processing.

* *T* -- cached value of the last *consensus head* request (interpreted as
*sync to new head* instruction) sent from the **CL** via RPC (this
symbol is used in code comments of the implementation.)

The internal sync state (as opposed to the general state also including the
state of **FC**) is defined by the triple *(C,D,H)*. Other parameters like *L*
Expand Down Expand Up @@ -239,11 +243,13 @@ It will take a while for *nimbus_beacon_node* to catch up (see the
### Starting `nimbus` for syncing

As the syncing process is quite slow, it makes sense to pre-load the database
from an *Era1* archive (if available) before starting the real sync process.
The command for importing an *Era1* reproitory would be something like
from an *Era1* and *Era* archives (if available) before starting the real sync
process. The command for importing an *Era1* and *Era* reprositories would be
something like

./build/nimbus_execution_client import \
--era1-dir:/path/to/main-era1/repo \
--era-dir:/path/to/main-era/repo \
...

which will take its time for the full *MainNet* Era1 repository (but way faster
Expand All @@ -254,7 +260,6 @@ started on the same machine where the beacon node runs with the command


./build/nimbus_execution_client \
--network=mainnet \
--engine-api=true \
--engine-api-port=8551 \
--engine-api-ws=true \
Expand All @@ -266,10 +271,10 @@ the corresponding options from the *nimbus-eth2* beacon source example.

### Syncing on a low memory machine

On a system with memory with *8GiB* the following additional options proved
On a system with memory around *8GiB* the following additional options proved
useful for *nimbus* to reduce the memory footprint.

For the *Era1* pre-load (if any) the following extra options apply to
For the *Era1*/*Era* pre-load (if any) the following extra options apply to
"*nimbus import*":

--chunk-size=1024
Expand Down Expand Up @@ -307,16 +312,16 @@ The following metrics are defined in *worker/update/metrics.nim* which will
be available if *nimbus* is compiled with the additional make flags
*NIMFLAGS="-d:metrics \-\-threads:on"*:

| *Variable* | *Logic type* | *Short description* |
|:-----------------------------|:------------:|:--------------------|
| | | |
| nec_base | block height | **B**, *increasing* |
| nec_execution_head | block height | **L**, *increasing* |
| nec_sync_coupler | block height | **C**, *increasing* |
| nec_sync_dangling | block height | **D** |
| nec_sync_head | block height | **H**, *increasing* |
| nec_consensus_head | block height | **T**, *increasing* |
| | | |
| *Variable* | *Logic type* | *Short description* |
|:-----------------------------|:------------:|:---------------------|
| | | |
| nec_base | block height | **B**, *increasing* |
| nec_execution_head | block height | **L**, *increasing* |
| nec_sync_coupler | block height | **C**, *0 when idle* |
| nec_sync_dangling | block height | **D**, *0 when idle* |
| nec_sync_head | block height | **H**, *0 when idle* |
| nec_sync_consensus_head | block height | **T**, *increasing* |
| | | |
| nec_sync_header_lists_staged | size | # of staged header list records |
| nec_sync_headers_unprocessed | size | # of accumulated header block numbers|
| nec_sync_block_lists_staged | size | # of staged block list records |
Expand Down
1 change: 1 addition & 0 deletions nimbus/sync/beacon/worker.nim
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,7 @@ proc runPool*(
## Note that this function does not run in `async` mode.
##
buddy.ctx.headersStagedReorg info
buddy.ctx.blocksStagedReorg info
true # stop


Expand Down
52 changes: 48 additions & 4 deletions nimbus/sync/beacon/worker/blocks_staged.nim
Original file line number Diff line number Diff line change
Expand Up @@ -58,14 +58,20 @@ proc fetchAndCheck(
for n in 1u ..< ivReq.len:
let header = ctx.dbHeaderPeek(ivReq.minPt + n).valueOr:
# There is nothing one can do here
raiseAssert info & " stashed header missing: n=" & $n &
" ivReq=" & $ivReq & " nth=" & (ivReq.minPt + n).bnStr
info "Block header missing, requesting reorg", ivReq, n,
nth=(ivReq.minPt + n).bnStr
# So require reorg
ctx.poolMode = true
return false
blockHash[n - 1] = header.parentHash
blk.blocks[offset + n].header = header
blk.blocks[offset].header = ctx.dbHeaderPeek(ivReq.minPt).valueOr:
# There is nothing one can do here
raiseAssert info & " stashed header missing: n=0" &
" ivReq=" & $ivReq & " nth=" & ivReq.minPt.bnStr
info "Block header missing, requesting reorg", ivReq, n=0,
nth=ivReq.minPt.bnStr
# So require reorg
ctx.poolMode = true
return false
blockHash[ivReq.len - 1] =
rlp.encode(blk.blocks[offset + ivReq.len - 1].header).keccak256

Expand Down Expand Up @@ -195,6 +201,11 @@ proc blocksStagedCollect*(

# Fetch and extend staging record
if not await buddy.fetchAndCheck(ivReq, blk, info):
if ctx.poolMode:
# Reorg requested?
ctx.blocksUnprocCommit(iv.len, iv)
return false

haveError = true

# Throw away first time block fetch data. Keep other data for a
Expand Down Expand Up @@ -271,6 +282,11 @@ proc blocksStagedImport*(
block:
let imported = ctx.chain.latestNumber()
if imported + 1 < qItem.key:
# If there is a gap, the `FC` module data area might have been re-set (or
# some problem occured due to concurrent collection.) In any case, the
# missing block numbers are added to the range of blocks that need to be
# fetched.
ctx.blocksUnprocAmend(imported + 1, qItem.key - 1)
trace info & ": there is a gap L vs. staged",
B=ctx.chain.baseNumber.bnStr, L=imported.bnStr, staged=qItem.key.bnStr,
C=ctx.layout.coupler.bnStr
Expand Down Expand Up @@ -352,6 +368,34 @@ proc blocksStagedImport*(
head=ctx.chain.latestNumber.bnStr, target=ctx.layout.final.bnStr
return true


proc blocksStagedReorg*(ctx: BeaconCtxRef; info: static[string]) =
## Some pool mode intervention.
##
## One scenario is that some blocks do not have a matching header available.
## The main reson might be that the queue of block lists had a gap so that
## some blocks could not be imported. This in turn can happen when the `FC`
## module was reset (e.g. by `CL` via RPC.)
##
## A reset by `CL` via RPC would mostly happen if the syncer is near the
## top of the block chain anyway. So the savest way to re-org is to flush
## the block queues as there won't be mant data cached, then.
##
if ctx.blk.staged.len == 0 and
ctx.blocksUnprocChunks() == 0:
# nothing to do
return

# Update counter
ctx.pool.nReorg.inc

# Reset block queues
trace info & ": Flushing Block queues", nUnproc=ctx.blocksUnprocTotal(),
nStaged=ctx.blk.staged.len

ctx.blocksUnprocClear()
ctx.blk.staged.clear()

# ------------------------------------------------------------------------------
# End
# ------------------------------------------------------------------------------
12 changes: 8 additions & 4 deletions nimbus/sync/beacon/worker/blocks_unproc.nim
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Nimbus
# Copyright (c) 2023-2024 Status Research & Development GmbH
# Copyright (c) 2023-2025 Status Research & Development GmbH
# Licensed and distributed under either of
# * MIT license (license terms in the root directory or at
# https://opensource.org/licenses/MIT).
Expand Down Expand Up @@ -112,13 +112,17 @@ proc blocksUnprocClear*(ctx: BeaconCtxRef) =
ctx.blk.unprocessed.clear()
ctx.blk.borrowed = 0u

proc blocksUnprocSet*(ctx: BeaconCtxRef; minPt, maxPt: BlockNumber) =
## Set up new unprocessed range
ctx.blocksUnprocClear()
proc blocksUnprocAmend*(ctx: BeaconCtxRef; minPt, maxPt: BlockNumber) =
## Add some unprocessed range
# Argument `maxPt` would be internally adjusted to `max(minPt,maxPt)`
if minPt <= maxPt:
discard ctx.blk.unprocessed.merge(minPt, maxPt)

proc blocksUnprocSet*(ctx: BeaconCtxRef; minPt, maxPt: BlockNumber) =
## Set up new unprocessed range
ctx.blocksUnprocClear()
ctx.blocksUnprocAmend(minPt, maxPt)

# ------------------------------------------------------------------------------
# End
# ------------------------------------------------------------------------------
37 changes: 28 additions & 9 deletions nimbus/sync/beacon/worker/update.nim
Original file line number Diff line number Diff line change
Expand Up @@ -264,19 +264,38 @@ proc setupProcessingBlocks(ctx: BeaconCtxRef; info: static[string]) =

proc updateSyncState*(ctx: BeaconCtxRef; info: static[string]) =
## Update internal state when needed
let
prevState = ctx.layout.lastState # previous state
thisState = ctx.syncState info # currently observed state
let prevState = ctx.layout.lastState # previous state
var thisState = ctx.syncState info # currently observed state

if thisState == prevState:
# Check whether the system has been idle and a new header download
# session can be set up
if prevState == idleSyncState and
ctx.target.changed and # and there is a new target from CL
ctx.target.final != 0: # .. ditto
ctx.setupCollectingHeaders info # set up new header sync
return
# Notreached
case prevState:
of idleSyncState:
if ctx.target.changed and # and there is a new target from CL
ctx.target.final != 0: # .. ditto
ctx.setupCollectingHeaders info # set up new header sync
return
of processingBlocks:
if not ctx.blocksStagedQueueIsEmpty() or
0 < ctx.blocksUnprocChunks() or
0 < ctx.blocksUnprocBorrowed:
return
# Set to idle
debug info & ": blocks processing cancelled",
B=(if ctx.chain.baseNumber() == ctx.layout.coupler: "C"
else: ctx.chain.baseNumber().bnStr),
C=(if ctx.layout.coupler == ctx.chain.latestNumber(): "L"
else: ctx.layout.coupler.bnStr),
L=(if ctx.chain.latestNumber() == ctx.layout.dangling: "D"
else: ctx.chain.latestNumber().bnStr),
D=(if ctx.layout.dangling == ctx.layout.head: "H"
else: ctx.layout.dangling.bnStr),
H=ctx.layout.head.bnStr
thisState = idleSyncState
# proceed
else:
return

info "Sync state changed", prevState, thisState,
head=ctx.chain.latestNumber.bnStr,
Expand Down
4 changes: 2 additions & 2 deletions nimbus/sync/beacon/worker/update/metrics.nim
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ declareGauge nec_sync_dangling, "" &
declareGauge nec_sync_head, "" &
"Ending/max block number of higher up headers chain"

declareGauge nec_consensus_head, "" &
declareGauge nec_sync_consensus_head, "" &
"Block number of sync target (would be consensus header)"


Expand Down Expand Up @@ -63,7 +63,7 @@ template updateMetricsImpl(ctx: BeaconCtxRef) =

# Show last valid state.
if 0 < ctx.target.consHead.number:
metrics.set(nec_consensus_head, ctx.target.consHead.number.int64)
metrics.set(nec_sync_consensus_head, ctx.target.consHead.number.int64)

metrics.set(nec_sync_header_lists_staged, ctx.headersStagedQueueLen())
metrics.set(nec_sync_headers_unprocessed,
Expand Down
11 changes: 9 additions & 2 deletions nimbus/sync/beacon/worker/update/ticker.nim
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ type
nBlkStaged: int
blkStagedBottom: BlockNumber

state: SyncLayoutState
reorg: int
nBuddies: int

Expand Down Expand Up @@ -89,6 +90,7 @@ proc updater(ctx: BeaconCtxRef): TickerStats =
nBlkUnprocessed: ctx.blocksUnprocTotal() + ctx.blocksUnprocBorrowed(),
nBlkUnprocFragm: ctx.blocksUnprocChunks(),

state: ctx.layout.lastState,
reorg: ctx.pool.nReorg,
nBuddies: ctx.pool.nBuddies)

Expand Down Expand Up @@ -130,8 +132,13 @@ proc tickerLogger(t: TickerRef; ctx: BeaconCtxRef) =
else: data.blkUnprocBottom.bnStr & "(" &
data.nBlkUnprocessed.toSI & "," & $data.nBlkUnprocFragm & ")"

st = case data.state
of idleSyncState: "0"
of collectingHeaders: "h"
of finishedHeaders: "f"
of processingBlocks: "b"
rrg = data.reorg
peers = data.nBuddies
nP = data.nBuddies

# With `int64`, there are more than 29*10^10 years range for seconds
up = (now - t.started).seconds.uint64.toSI
Expand All @@ -140,7 +147,7 @@ proc tickerLogger(t: TickerRef; ctx: BeaconCtxRef) =
t.lastStats = data
t.visited = now

debug "Sync state", up, peers, B, L, C, D, H, T, hS, hU, bS, bU, rrg, mem
debug "Sync state", up, nP, st, B, L, C, D, H, T, hS, hU, bS, bU, rrg, mem

# ------------------------------------------------------------------------------
# Public function
Expand Down

0 comments on commit 8bc8094

Please sign in to comment.