refactor: made attestation start from consensus genesis instead of the last sealed batch #2539

pompon0 · 2024-07-30T09:12:58Z

Stable starting point prevents race conditions.

…ync-era into gprusak-endpoint2

aakoshh

Just some conceptual questions.

core/lib/dal/src/consensus_dal.rs

core/node/api_server/src/web3/namespaces/en.rs

core/lib/dal/src/consensus/proto/mod.proto

## What ❔ Poll the main node for which batch height to vote on. - [x] Return an `Option` rather than a `Vec` of `BatchQC` from `BatchVotes::find_quorum`: we decided not to implement voting on multiple heights for now, which makes it just confusing that a vector is returned. - [x] Create an `AttestationStatusClient` trait with a method to poll the next `BatchNumber` to vote on, which can be injected as a `dyn` dependency - [x] Create an `AttestationStatusWatch` that contains the next `BatchNumber` to vote on; this is updated by a single background task and listened to by multiple subscribers - [x] Create an `AttestationStatusRunner` which is supposed to be started along with the `BatchStore` in `zksync-era` and polls the client; the point of this is to push out the poll interval configuration to the edge and make the `AttestationStatusWatch` the integration point - [x] Change `AttesterRunner` to wait for the `AttestationStatusWatch` to change, then the payload to appear, then cast the vote; it doesn't need to worry about reverts because the node will restart if that happens - [x] Change `run_batch_qc_finder` to wait for the `AttestationStatusWatch` to change, then wait for the next available QC, and save it; this might produce gaps which is normal on the external nodes, but undersireable on the main node - on the main node it only produces gaps if the attesters vote on higher heights despite the main node telling them not to, which if they are majority honest should not happen, and if they are not then what can we do anyway - [x] Rename `PersistentBatchStore::earliest_batch_number_to_sign` to `next_batch_to_attest` in accordance with the new method on `consensus_dal`. - [ ] ~~Initialise `BatchNumber::min_batch_number` to by asking the main node where to resume from~~ This initialisation isn't necessary because 1) we decided that we'll only allow 1 vote per attester, so there is no attack vector of casting votes from batch 0 and 2) `run_batch_qc_finder` first waits for the API status to appear and then prunes all preceding data, so an older QC will not be saved. It was also undesirable to have to initialise from an API that might return nothing (or errors) for an unknown amount of time. ### Poll interval The `AttesterStatusRunner` polls the API at fixed intervals without any exponential backoff. I thought 5s would be a reasonable default value. With 60s batches this seems fine because the *next* batch to sign will be signalled as soon as the current batch QC is inserted into the database, which is well ahead of time of when the next batch and its commitment will be available. If we think we'll need to catch up with historical batch signatures it might be a bit slow. We generally still have to account for gossiping around the votes though, which in a large enough network could be on the order of several seconds as well. ### Potential deadlock There is a pathological sequence of events that could prevent the feature from starting on a fresh database with no prior batch QCs: 1. Say we start from batch 0, and a QC is gathered, but saving it to the database (which is an async process) takes a long time on the main node. 2. Say it takes so long that batch 1 and batch 2 are created, but none of them have a QC yet. 3. Say we have two attesters A and B, and A polled the API when it showed batch 2, and the main node set its minimum batch number in the vote registry to batch 2 as well. 4. Now the QC for batch 0 is saved to the database and the API indicates the next one to vote on is batch 1. 5. Attester B casts its vote on batch 1 but it goes ignored by the main node vote registry because its looking for a quorum with at least batch number 2. 6. Having missed an attestation the main node will never save a QC and never move on until it's restarted. Note that even if the registry minimum batch number was adjusted _down_ that might happen _after_ the vote for batch 1 has already been discarded, and because new votes are only pushed once through gossip there is no guarantee that it will get it again. The solution is coming in the form of fixing the starting point of gossip to be where the genesis starts, even if that means filling a potentially long history of batches in the beginning. See matter-labs/zksync-era#2539 ## Why ❔ We want to collect batch QCs without gaps, and for now put the main node in charge of what to vote on. The API to serve this information is in matter-labs/zksync-era#2480 and this PR is the follow up to make polling that information part of the signing process.

## What ❔ Injects an `AttestationStatusWatch` into the `Executor` which on the main node is backed by an `AttestationStatusRunner` polling the `BatchStore::next_batch_to_attest` and external nodes call the `MainNodeApi::fetch_attestation_status`. TODO: - [x] Rebase after #2539 is merged - [x] Update the `era-consensus` dependency once matter-labs/era-consensus#161 is merged and `0.1.0-rc.5` is published ### Test failure - pruning the main node The following test never finished: ```shell zk test rust -p zksync_node_consensus tests::test_with_pruning::case_0 --no-capture ``` A little extra logging revealed that the `AttesterStatusRunner` never gets initialised, because the blocks get pruned earlier than it could read the result, ie. `ConsensusDal::batch_of_block(genesis.first_block)` probably returns `None` because it's not found, and never will because it's pruned. We discussed in #2480 that the main node is not supposed to be pruned, which is why the SQL that looks for what to attest on doesn't look for pruning records any more. Yet this test prunes the main node, and if I just try to prune the external node instead it panics because it doesn't have that block yet. Either the SQL should take pruning into account after all, or we have to figure out a way to wait in the test until the main node executor is running, or change the test to prune the external node; I did the latter. ## Why ❔ This is coordinating attesters to cast their votes on the batch which the main node tells them to do to produce a quorum certificate for all L1 batches without gaps. ## Checklist   - [x] PR title corresponds to the body of PR (we generate changelog entries from PRs). - [x] Tests for the changes have been added / updated. - [x] Documentation comments have been added / updated. - [x] Code has been formatted via `zk fmt` and `zk lint`.

pompon0 added 2 commits July 29, 2024 18:31

WIP

b97598a

stable endpoint

d61d98f

pompon0 requested review from aakoshh and RomanBrodetski July 30, 2024 09:13

pompon0 added 4 commits July 30, 2024 11:14

Merge branch 'main' into gprusak-endpoint2

ac94531

readonly transaction

44f7ad9

Merge branch 'gprusak-endpoint2' of http://github.com/matter-labs/zks…

7469f20

…ync-era into gprusak-endpoint2

Merge remote-tracking branch 'origin/main' into gprusak-endpoint2

e12f147

aakoshh mentioned this pull request Jul 30, 2024

feat: Poll the main node for batch to vote on (BFT-496) matter-labs/era-consensus#161

Merged

8 tasks

aakoshh reviewed Jul 30, 2024

View reviewed changes

core/lib/dal/src/consensus_dal.rs Outdated Show resolved Hide resolved

core/node/api_server/src/web3/namespaces/en.rs Show resolved Hide resolved

core/lib/dal/src/consensus/proto/mod.proto Show resolved Hide resolved

aakoshh mentioned this pull request Jul 30, 2024

feat: Poll the main node for the next batch to sign (BFT-496) #2544

Merged

6 tasks

pompon0 added 2 commits July 31, 2024 13:04

more complex query

d814fac

stupid formatting

c6e1294

pompon0 requested a review from aakoshh July 31, 2024 11:45

aakoshh approved these changes Jul 31, 2024

View reviewed changes

pompon0 added this pull request to the merge queue Aug 1, 2024

Merged via the queue into main with commit bac1082 Aug 1, 2024
47 checks passed

pompon0 deleted the gprusak-endpoint2 branch August 1, 2024 07:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: made attestation start from consensus genesis instead of the last sealed batch #2539

refactor: made attestation start from consensus genesis instead of the last sealed batch #2539

pompon0 commented Jul 30, 2024

aakoshh left a comment

refactor: made attestation start from consensus genesis instead of the last sealed batch #2539

refactor: made attestation start from consensus genesis instead of the last sealed batch #2539

Conversation

pompon0 commented Jul 30, 2024

aakoshh left a comment

Choose a reason for hiding this comment