Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: made attestation start from consensus genesis instead of the last sealed batch #2539

Merged
merged 8 commits into from
Aug 1, 2024

Conversation

pompon0
Copy link
Contributor

@pompon0 pompon0 commented Jul 30, 2024

Stable starting point prevents race conditions.

Copy link
Contributor

@aakoshh aakoshh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some conceptual questions.

core/lib/dal/src/consensus_dal.rs Outdated Show resolved Hide resolved
core/lib/dal/src/consensus/proto/mod.proto Show resolved Hide resolved
@pompon0 pompon0 requested a review from aakoshh July 31, 2024 11:45
aakoshh added a commit to matter-labs/era-consensus that referenced this pull request Jul 31, 2024
## What ❔

Poll the main node for which batch height to vote on.

- [x] Return an `Option` rather than a `Vec` of `BatchQC` from
`BatchVotes::find_quorum`: we decided not to implement voting on
multiple heights for now, which makes it just confusing that a vector is
returned.
- [x] Create an `AttestationStatusClient` trait with a method to poll
the next `BatchNumber` to vote on, which can be injected as a `dyn`
dependency
- [x] Create an `AttestationStatusWatch` that contains the next
`BatchNumber` to vote on; this is updated by a single background task
and listened to by multiple subscribers
- [x] Create an `AttestationStatusRunner` which is supposed to be
started along with the `BatchStore` in `zksync-era` and polls the
client; the point of this is to push out the poll interval configuration
to the edge and make the `AttestationStatusWatch` the integration point
- [x] Change `AttesterRunner` to wait for the `AttestationStatusWatch`
to change, then the payload to appear, then cast the vote; it doesn't
need to worry about reverts because the node will restart if that
happens
- [x] Change `run_batch_qc_finder` to wait for the
`AttestationStatusWatch` to change, then wait for the next available QC,
and save it; this might produce gaps which is normal on the external
nodes, but undersireable on the main node - on the main node it only
produces gaps if the attesters vote on higher heights despite the main
node telling them not to, which if they are majority honest should not
happen, and if they are not then what can we do anyway
- [x] Rename `PersistentBatchStore::earliest_batch_number_to_sign` to
`next_batch_to_attest` in accordance with the new method on
`consensus_dal`.
- [ ] ~~Initialise `BatchNumber::min_batch_number` to by asking the main
node where to resume from~~ This initialisation isn't necessary because
1) we decided that we'll only allow 1 vote per attester, so there is no
attack vector of casting votes from batch 0 and 2) `run_batch_qc_finder`
first waits for the API status to appear and then prunes all preceding
data, so an older QC will not be saved. It was also undesirable to have
to initialise from an API that might return nothing (or errors) for an
unknown amount of time.

### Poll interval

The `AttesterStatusRunner` polls the API at fixed intervals without any
exponential backoff. I thought 5s would be a reasonable default value.
With 60s batches this seems fine because the *next* batch to sign will
be signalled as soon as the current batch QC is inserted into the
database, which is well ahead of time of when the next batch and its
commitment will be available. If we think we'll need to catch up with
historical batch signatures it might be a bit slow. We generally still
have to account for gossiping around the votes though, which in a large
enough network could be on the order of several seconds as well.

### Potential deadlock

There is a pathological sequence of events that could prevent the
feature from starting on a fresh database with no prior batch QCs:
1. Say we start from batch 0, and a QC is gathered, but saving it to the
database (which is an async process) takes a long time on the main node.
2. Say it takes so long that batch 1 and batch 2 are created, but none
of them have a QC yet.
3. Say we have two attesters A and B, and A polled the API when it
showed batch 2, and the main node set its minimum batch number in the
vote registry to batch 2 as well.
4. Now the QC for batch 0 is saved to the database and the API indicates
the next one to vote on is batch 1.
5. Attester B casts its vote on batch 1 but it goes ignored by the main
node vote registry because its looking for a quorum with at least batch
number 2.
6. Having missed an attestation the main node will never save a QC and
never move on until it's restarted.

Note that even if the registry minimum batch number was adjusted _down_
that might happen _after_ the vote for batch 1 has already been
discarded, and because new votes are only pushed once through gossip
there is no guarantee that it will get it again.

The solution is coming in the form of fixing the starting point of
gossip to be where the genesis starts, even if that means filling a
potentially long history of batches in the beginning. See
matter-labs/zksync-era#2539

## Why ❔

We want to collect batch QCs without gaps, and for now put the main node
in charge of what to vote on. The API to serve this information is in
matter-labs/zksync-era#2480 and this PR is the
follow up to make polling that information part of the signing process.
@pompon0 pompon0 added this pull request to the merge queue Aug 1, 2024
Merged via the queue into main with commit bac1082 Aug 1, 2024
47 checks passed
@pompon0 pompon0 deleted the gprusak-endpoint2 branch August 1, 2024 07:38
github-merge-queue bot pushed a commit that referenced this pull request Aug 1, 2024
## What ❔

Injects an `AttestationStatusWatch` into the `Executor` which on the
main node is backed by an `AttestationStatusRunner` polling the
`BatchStore::next_batch_to_attest` and external nodes call the
`MainNodeApi::fetch_attestation_status`.

TODO: 
- [x] Rebase after #2539
is merged
- [x] Update the `era-consensus` dependency once
matter-labs/era-consensus#161 is merged and
`0.1.0-rc.5` is published

### Test failure - pruning the main node

The following test never finished:
```shell
zk test rust -p zksync_node_consensus tests::test_with_pruning::case_0 --no-capture
```
A little extra logging revealed that the `AttesterStatusRunner` never
gets initialised, because the blocks get pruned earlier than it could
read the result, ie. `ConsensusDal::batch_of_block(genesis.first_block)`
probably returns `None` because it's not found, and never will because
it's pruned.

We discussed in #2480 that
the main node is not supposed to be pruned, which is why the SQL that
looks for what to attest on doesn't look for pruning records any more.
Yet this test prunes the main node, and if I just try to prune the
external node instead it panics because it doesn't have that block yet.

Either the SQL should take pruning into account after all, or we have to
figure out a way to wait in the test until the main node executor is
running, or change the test to prune the external node; I did the
latter.
 
## Why ❔

This is coordinating attesters to cast their votes on the batch which
the main node tells them to do to produce a quorum certificate for all
L1 batches without gaps.

## Checklist

<!-- Check your PR fulfills the following items. -->
<!-- For draft PRs check the boxes as you complete them. -->

- [x] PR title corresponds to the body of PR (we generate changelog
entries from PRs).
- [x] Tests for the changes have been added / updated.
- [x] Documentation comments have been added / updated.
- [x] Code has been formatted via `zk fmt` and `zk lint`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants