-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: made attestation start from consensus genesis instead of the last sealed batch #2539
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
8 tasks
aakoshh
reviewed
Jul 30, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some conceptual questions.
6 tasks
aakoshh
added a commit
to matter-labs/era-consensus
that referenced
this pull request
Jul 31, 2024
## What ❔ Poll the main node for which batch height to vote on. - [x] Return an `Option` rather than a `Vec` of `BatchQC` from `BatchVotes::find_quorum`: we decided not to implement voting on multiple heights for now, which makes it just confusing that a vector is returned. - [x] Create an `AttestationStatusClient` trait with a method to poll the next `BatchNumber` to vote on, which can be injected as a `dyn` dependency - [x] Create an `AttestationStatusWatch` that contains the next `BatchNumber` to vote on; this is updated by a single background task and listened to by multiple subscribers - [x] Create an `AttestationStatusRunner` which is supposed to be started along with the `BatchStore` in `zksync-era` and polls the client; the point of this is to push out the poll interval configuration to the edge and make the `AttestationStatusWatch` the integration point - [x] Change `AttesterRunner` to wait for the `AttestationStatusWatch` to change, then the payload to appear, then cast the vote; it doesn't need to worry about reverts because the node will restart if that happens - [x] Change `run_batch_qc_finder` to wait for the `AttestationStatusWatch` to change, then wait for the next available QC, and save it; this might produce gaps which is normal on the external nodes, but undersireable on the main node - on the main node it only produces gaps if the attesters vote on higher heights despite the main node telling them not to, which if they are majority honest should not happen, and if they are not then what can we do anyway - [x] Rename `PersistentBatchStore::earliest_batch_number_to_sign` to `next_batch_to_attest` in accordance with the new method on `consensus_dal`. - [ ] ~~Initialise `BatchNumber::min_batch_number` to by asking the main node where to resume from~~ This initialisation isn't necessary because 1) we decided that we'll only allow 1 vote per attester, so there is no attack vector of casting votes from batch 0 and 2) `run_batch_qc_finder` first waits for the API status to appear and then prunes all preceding data, so an older QC will not be saved. It was also undesirable to have to initialise from an API that might return nothing (or errors) for an unknown amount of time. ### Poll interval The `AttesterStatusRunner` polls the API at fixed intervals without any exponential backoff. I thought 5s would be a reasonable default value. With 60s batches this seems fine because the *next* batch to sign will be signalled as soon as the current batch QC is inserted into the database, which is well ahead of time of when the next batch and its commitment will be available. If we think we'll need to catch up with historical batch signatures it might be a bit slow. We generally still have to account for gossiping around the votes though, which in a large enough network could be on the order of several seconds as well. ### Potential deadlock There is a pathological sequence of events that could prevent the feature from starting on a fresh database with no prior batch QCs: 1. Say we start from batch 0, and a QC is gathered, but saving it to the database (which is an async process) takes a long time on the main node. 2. Say it takes so long that batch 1 and batch 2 are created, but none of them have a QC yet. 3. Say we have two attesters A and B, and A polled the API when it showed batch 2, and the main node set its minimum batch number in the vote registry to batch 2 as well. 4. Now the QC for batch 0 is saved to the database and the API indicates the next one to vote on is batch 1. 5. Attester B casts its vote on batch 1 but it goes ignored by the main node vote registry because its looking for a quorum with at least batch number 2. 6. Having missed an attestation the main node will never save a QC and never move on until it's restarted. Note that even if the registry minimum batch number was adjusted _down_ that might happen _after_ the vote for batch 1 has already been discarded, and because new votes are only pushed once through gossip there is no guarantee that it will get it again. The solution is coming in the form of fixing the starting point of gossip to be where the genesis starts, even if that means filling a potentially long history of batches in the beginning. See matter-labs/zksync-era#2539 ## Why ❔ We want to collect batch QCs without gaps, and for now put the main node in charge of what to vote on. The API to serve this information is in matter-labs/zksync-era#2480 and this PR is the follow up to make polling that information part of the signing process.
aakoshh
approved these changes
Jul 31, 2024
github-merge-queue bot
pushed a commit
that referenced
this pull request
Aug 1, 2024
## What ❔ Injects an `AttestationStatusWatch` into the `Executor` which on the main node is backed by an `AttestationStatusRunner` polling the `BatchStore::next_batch_to_attest` and external nodes call the `MainNodeApi::fetch_attestation_status`. TODO: - [x] Rebase after #2539 is merged - [x] Update the `era-consensus` dependency once matter-labs/era-consensus#161 is merged and `0.1.0-rc.5` is published ### Test failure - pruning the main node The following test never finished: ```shell zk test rust -p zksync_node_consensus tests::test_with_pruning::case_0 --no-capture ``` A little extra logging revealed that the `AttesterStatusRunner` never gets initialised, because the blocks get pruned earlier than it could read the result, ie. `ConsensusDal::batch_of_block(genesis.first_block)` probably returns `None` because it's not found, and never will because it's pruned. We discussed in #2480 that the main node is not supposed to be pruned, which is why the SQL that looks for what to attest on doesn't look for pruning records any more. Yet this test prunes the main node, and if I just try to prune the external node instead it panics because it doesn't have that block yet. Either the SQL should take pruning into account after all, or we have to figure out a way to wait in the test until the main node executor is running, or change the test to prune the external node; I did the latter. ## Why ❔ This is coordinating attesters to cast their votes on the batch which the main node tells them to do to produce a quorum certificate for all L1 batches without gaps. ## Checklist <!-- Check your PR fulfills the following items. --> <!-- For draft PRs check the boxes as you complete them. --> - [x] PR title corresponds to the body of PR (we generate changelog entries from PRs). - [x] Tests for the changes have been added / updated. - [x] Documentation comments have been added / updated. - [x] Code has been formatted via `zk fmt` and `zk lint`.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Stable starting point prevents race conditions.