Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider old finalized data available? #3141

Closed
wants to merge 2 commits into from

Conversation

dapplion
Copy link
Member

@dapplion dapplion commented Nov 29, 2022

When a node range syncs it can't know the head state's finalized checkpoint. When performing range sync, two scenarios:

Network is finalizing

According to current p2p spec, blobs older than MIN_EPOCHS_FOR_BLOBS_SIDECARS_REQUESTS are pruned thus un-available via p2p. The node must consider data_is_available == true, to sync to the head and not deadlock.

Network is not finalizing for longer than MIN_EPOCHS_FOR_BLOBS_SIDECARS_REQUESTS

A full node can't guarantee than a block's blobsSidecar was available for un-finalized blocks. It must request the blobsSidecar from the network, which it can since it's available according to current p2p spec. However this conflicts with the above scenario.


Optimistic sync again?

A node may consider its peer's status finalized checkpoint as correct. Then range sync till that point and, request blobs since are un-finalized. If a peer responds with no blobs for that epoch: the node can't differentiate between the peer withholding data (dishonest) or that epoch being finalized (honest).

To be sure that a blob at epoch < clock_epoch - MIN_EPOCHS_FOR_BLOBS_SIDECARS_REQUESTS is available the node has check that no epoch has been finalized up until current chain tip. To assert that condition, all blocks up to the head must be processed and imported. Whoever the head is unsafe until all un-finalized blobsSidecar are imported.

The node could optimistically import all blocks until a past epoch is finalized, then mark those blocks as is_data_available == true if older than MIN_EPOCHS_FOR_BLOBS_SIDECARS_REQUESTS. Else attempt to request blobsSidecar for each block.

Suggested sync sequence:

  1. Sync blocks from start_epoch until clock_epoch - MIN_EPOCHS_FOR_BLOBS_SIDECARS_REQUESTS
  2. Sync blocks + blobs from clock_epoch - MIN_EPOCHS_FOR_BLOBS_SIDECARS_REQUESTS until head
  3. Then, if head's finalized_epoch < clock_epoch - MIN_EPOCHS_FOR_BLOBS_SIDECARS_REQUESTS fill blobsSidecars in that range

To sum-up If 3. true, the head is unsafe until all blobs are proven to be available.

Side-point, for nodes to be able to efficiently request that range, should they use by_root requests, or extend the serving range for by_range requests?

@hwwhww hwwhww added the Deneb was called: eip-4844 label Dec 1, 2022
@terencechain
Copy link
Contributor

Suggested sync sequence:

Wouldn't /eth2/beacon_chain/req/status/ be helpful here? So that we can avoid backfilling blobs in the event FINALIZED_EPOCH is lower than CURRENT_EPOCH - MIN_EPOCHS_FOR_BLOBS_SIDECARS_REQUESTS which avoid step 3 of your sequence.

More like:

  1. Get the FINALIZED_EPOCH from the peer. Calculate which is lower, FINALIZED_EPOCH vs.CURRENT_EPOCH - MIN_EPOCHS_FOR_BLOBS_SIDECARS_REQUESTS and call that epochN
  2. Sync blocks from START_EPOCH to N
  3. Sync blocks and blobs from N to HEAD_EPOCH

Would this work?

@realbigsean
Copy link
Contributor

Wouldn't /eth2/beacon_chain/req/status/ be helpful here

The issue with this is that you would have to trust what your peers tell you for the finalized epoch. So you can use this during sync as a best-guess, but if you get to the head and realize your peers were lying to you, now you need to reprocess the segment of the chain you thought was finalized but now know was not actually finalized. Because these blocks now have an additional validity condition (is data available).

@terencechain
Copy link
Contributor

The issue with this is that you would have to trust what your peers tell you for the finalized epoch.

This is the same trust assumption as today, no? We are trusting peers to tell us the finalized epoch and will support us with valid blocks till that. So the only difference is we are trusting peers that will support us with valid blocks and blobs

@realbigsean
Copy link
Contributor

We are trusting peers to tell us the finalized epoch and will support us with valid blocks till that.

Well today we're able to tell if a block a peer gives us is valid regardless of what they tell us the finalized epoch is, right?. The difference now, is that if is_data_available only happens on unfinalized blocks, block validity would become dependent on what peers tell us finalized epoch is. I think this is something we can't have, hence the need for some sort of optimistic sync strategy like dapplion describes, where we sync to the head to find the finalized epoch ourselves. Then use the info about the finalized epoch to go back and fully validate blocks.

@ethDreamer
Copy link
Contributor

@realbigsean & I were talking about this proposal earlier and we thought there might be additional considerations that need to be discussed.

BACKGROUND

  1. After the 4844 fork we must sync both blocks and blobs
  2. All existing validity conditions for blocks remain in place, but 4844 introduces additional validity conditions on the blob. The protocol must ensure that all canonical blocks have passed both sets of validity conditions.
  3. Blobs for which blob.epoch() > min(finalized_epoch, current_epoch - MIN_EPOCHS_FOR_BLOBS_SIDECARS_REQUESTS) must be kept around and served to the network. It's probably easier for the purposes of this discussion if we just pretend the blob retention period is the range [finalized_epoch, current_epoch] (we can assume this without loss of generality).
  4. Combining points 2 & 3, a general rule for any post-4844 block emerges:
fn is_block_valid(block) {
    if block.is_finalized() {
        # finalized blocks DO NOT have any data availability considerations
        return block.is_valid_under_pre_4844_rules()
    } else {
        # unfinalized blocks DO have data availability considerations
        return block.is_valid_under_pre_4844_rules() && block.blob_is_available_and_valid()
    }
}

THE PROBLEM

Newly syncing nodes will only need to verify the pre-4844 block conditions until they come to an unfinalized block. At that point, they will need to also verify the blob validity conditions. But these nodes have no way of knowing whether or not a block is finalized until they import it (and other blocks built on top of it presumably).

This is where my understanding gets a bit fuzzy. I recall @paulhauner saying something about fork-choice poisoning attacks in the 4844 breakout session in Bogota when we were discussing whether or not we could decouple importing blocks and blobs. I've been trying to piece it together by looking at the optimistic sync spec. From what I've gathered, I think the following statement is correct @paulhauner?

When the validity of a block is dependent on external data, you cannot optimistically import the block and verify the availability & validity of the data later as you risk having your forkchoice poisoned by an invalid chain constructed by an attacker.

Nodes that are synced to the head of the chain are not susceptible to this attack because they will always enforce the blob validity conditions, but newly syncing nodes are. This seems very analogous to the problem in the original optimistic sync spec where we didn't want nodes to optimistically import the transition block until we were sure it was justified, again something we couldn't know during before importing it.

In the merge transition block situation we used SAFE_SLOTS_TO_IMPORT_OPTIMISTICALLY to mitigate this. The optimistic sync spec made the following statement:

One can imagine mechanisms to check that a block is justified before importing it. For example, just keep processing blocks without adding them to fork choice. However, there are still edge-cases here (e.g., when to halt and declare there was no justification?) and how to mitigate implementation complexity. At this point, it's important to reflect on the attack and how likely it is to happen. It requires some rather contrived circumstances and it seems very unlikely to occur. Therefore, we need to consider if adding complexity to avoid an unlikely attack increases or decreases our total risk. Presently, it appears that SAFE_SLOTS_TO_IMPORT_OPTIMISTICALLY sits in a sweet spot for this trade-off.

But in 4844 it seems as if this situation could occur at any time. So the question becomes whether this situation forces us to develop these mechanisms that will allow us to determine if a block is finalized without importing it and risking poisoning.

@paulhauner
Copy link
Contributor

A full node can't guarantee than a block's blobsSidecar was available for un-finalized blocks.

Why do we need to provide a guarantee that un-finalized blobs older than MIN_EPOCHS_FOR_BLOBS_SIDECARS_REQUESTS are available? The consensus and execution layers don't need blobs to faithfully verify the state transition; the blob_kzg_commitments are enough to verify all transactions and operations in the chain.

My understanding is that we (will) set MIN_EPOCHS_FOR_BLOBS_SIDECARS_REQUESTS such that it provides blob-consumers (e.g. L2s) enough time to ensure that all the participants in the network have had enough time to access the blobs. I don't see how non-finality impacts these assumptions.

@dapplion
Copy link
Member Author

dapplion commented Dec 5, 2022

Why do we need to provide a guarantee that un-finalized blobs older than MIN_EPOCHS_FOR_BLOBS_SIDECARS_REQUESTS are available?

If you have to import a new chain not yet seen before, how do you know that those block's blobs where ever available at any point? To provide full node safety guarantees you must ensure yourself that blobs have been available at some point. Or that's the assumption I'm working with here. Else you may deadlock L2's into a bad unavailable chain?

If the assumption above is correct, then you must download those blobs from p2p to prove to yourself data is available. So the network must retain and serve those blocks to allow any peer to converge on that chain.

@paulhauner
Copy link
Contributor

To provide full node safety guarantees you must ensure yourself that blobs have been available at some point.

With this PR, we use the assumption that if >2/3rds of validators have attested to a chain then all payloads must have been available at some point.

Without this PR, we use the assumption that the canonical chain does not contain a consecutive streak of MIN_EPOCHS_FOR_BLOBS_SIDECARS_REQUESTS epochs where:

  1. All blocks are produced by malicious/faulty, blob-hiding actors.
  2. All attestations in that range attesting to those blocks were produced by malicious/faulty actors.

In both scenarios we are never actually verifying that the entirety of the chain had available blobs. Rather, we're making assumptions about the availability of old blobs based on the behavior of other validators. I agree that with this PR we're operating under more reliable assumptions than without. However, this PR does bring with it theoretically unbounded blob storage and mandated optimistic sync.

Ultimately, the question here is whether or not L2s are comfortable under the assumption that there will always be a better chain to out-compete a malicious chain of MIN_EPOCHS_FOR_BLOBS_SIDECARS_REQUESTS length. Personally, I would be comfortable "betting the farm" that a malicious actor can't control the chain for ~18 days (the current retention period). However, my comfort would be reduced as we reduced the length of that period.

So, in summary, I think the trade-off here is whether L2s would rather (a) wait for everyone to implement blob-optimistic sync or (b) live with the malicious chain assumption. (We must also consider whether or not opt sync and unbounded blob storage are safe/feasible for the protocol, but it would be useful to know what 4844 users are expecting.)

@dapplion
Copy link
Member Author

dapplion commented Dec 5, 2022

I think the trade-off here is whether L2s would rather (a) wait for everyone to implement blob-optimistic sync or (b) live with the malicious chain assumption. (We must also consider whether or not opt sync and unbounded blob storage are safe/feasible for the protocol, but it would be useful to know what 4844 users are expecting.)

As an implementer I want optimistic sync as far away from me as possible. @protolambda can you comment on L2 needs?

@djrtwo
Copy link
Contributor

djrtwo commented Dec 5, 2022

My understanding is that we (will) set MIN_EPOCHS_FOR_BLOBS_SIDECARS_REQUESTS such that it provides blob-consumers (e.g. L2s) enough time to ensure that all the participants in the network have had enough time to access the blobs. I don't see how non-finality impacts these assumptions.

I agree with this generally. The worry would be if there is some sort of sustained fork, and you have to re-org to a different but unavailable blobs chain. That said, for there to be a chain of such depth, there has to be validators (and likely users) on that chain and thus the data would have been "made available" and us unlikely to have disappeared.

I think the failure mode is

  • majority of validators collude with a roll-up
  • validator majority has a hidden chain with an unavailable roll-up update at time T
  • hidden chain is revealed after time T + 18 days
  • validators and users jump over to the majority chain and don't validate the availability of the roll-up update at time T because outside of the 18 day window
  • validators finalize an unavailable roll-up update (allowing for invalid state transitions in opti-rollups or marooned funds in any type of roll-up)

The thing about 18 day time horizons of a full partition (chain A is hidden from chain B) is two-fold. (1) On this order, the non-hidden chain (depending on the split) is in the realm of finalizing and (2) 18 days is on the order of "we can fix any problem manually in this length of time"

My gut is that due to these two considerations, that having the pruning window simply be 18 days rather than the greater of (18_days, time_since_latest_finalized) provides essentially the same guarantee we expect/require


As for the consideration here to just consider anything past finalized as available, this strictly puts more power in the hands of a malicious majority validator set by putting a much much tighter bound on the online assumptions for full nodes. In the event that a node is offline for > 2 epochs and a malicious validator set finalizes unavailable data, the node would be able to be tricked into following an unavailable chain.

Similarly this truncates the onlinedness requirement to any sort of L2 policing node or node otherwise trying to get the data due to the p2p also truncating the serving.

The DA pruning period being on the order of the (desired) WS period length, the leak-to-majority period, and the "we can fix anything in this time frame" period, and the max optimistic roll-up fraud proof period (planned today) ensures that we don't introduce tighter onlinedness requirements to fully verify the chain than we have today. I'd be a strong no on making this as tight as latest-finalized

@ethDreamer
Copy link
Contributor

As for the consideration here to just consider anything past finalized as available

Just want to note, I wasn't actually suggesting that we consider anything finalized available. I should've been more explicit about this, but I was operating under the assumption that the forkchoice poisoning issues I brought up would only occur in the context of a chain that hadn't finalized within the DA pruning period. In such a case, we are already enforcing the blobs are valid and available for the entire DA pruning period.

Hence why I said:

It's probably easier for the purposes of this discussion if we just pretend the blob retention period is the range [finalized_epoch, current_epoch] (we can assume this without loss of generality).

@realbigsean
Copy link
Contributor

realbigsean commented Dec 5, 2022

Isn't the most realistic failure mode 51% of nodes are lazily validating blobs? In this case the chain is split by any proposer who withholds a blob, and a lazy validator has no incentive to rejoin the correct chain, they just wait 18 days and their chain becomes correct. Tying blob availability checks to finalization would give lazy validators incentive to join the correct chain.

@terencechain
Copy link
Contributor

terencechain commented Dec 5, 2022

Isn't the most realistic failure mode 51% of nodes are lazily validating blobs?

Validators can also lazily validate the chain, does blob weaken the assumption that much?

As for the optimistic rollup, I can speak for Arbitrum. What matters most is the blob retention period is greater than the fraud-proof challenge period. The fraud-proof challenge period is designed to be long enough so L2 validators can participate in challenges under the censorship threat model. The challenge period is 7 days and the blobs retention period is 18 days. I think we'll be fine.

The ultimate question should we be taking min(finalized_epoch, max(GENESIS_EPOCH, current_epoch - MIN_EPOCHS_FOR_BLOBS_SIDECARS_REQUESTS) or just max(GENESIS_EPOCH, current_epoch - MIN_EPOCHS_FOR_BLOBS_SIDECARS_REQUESTS), I agree with @paulhauner and @djrtwo's comments and I think the additional complexity is just not worth it

@realbigsean
Copy link
Contributor

Validators can also lazily validate the chain, does blob weaken the assumption that much?

If block validity changes after 18 days (which it would by having the is_data_available check expire), then something previously invalid becomes valid, so why would a lazy validator switch to the chain where it's being penalized? If they wait 18 days their chain becomes valid. Whereas today, the invalid chain always remains invalid so lazy validators would have to either switch to correct chain or bleed out.

@djrtwo
Copy link
Contributor

djrtwo commented Dec 5, 2022

so why would a lazy validator switch to the chain where it's being penalized?

because users also don't follow unavailable chains so if a lazy validator is on an unavailable chain because they were lazy, users/explorers/infra/exchanges/etc that do proper DA checks won't be on such a chain. That is Ethereum (and the community it supports) would not be in this false reality. Which is quite the incentive for a lazy validator to get back onto the actually available chain -- before social intervention or the available chain finalizing due to inactivity leak

EDIT: To be clear, places where a validator can be lazy are security issues and can/should be patched. If an attacker % plus lazy % is greater than 50% that's definitely bad -- even worse at 2/3. But fully validating nodes cannot be tricked and would not follow such chains. So the direct incentive to not be lazy is not there, but the second order incentive that you will not be on the actual users' chain is there. Proof-of-custody and proof-of-execution are both very important security upgrades to prioritize in the next few years.

@djrtwo
Copy link
Contributor

djrtwo commented Dec 6, 2022

An alternative here is to prune at the 18 day depth and to not consider blocks past that depth as available (unless you previously validated it yourself). This would avoid an automatic chain re-org in the attack scenarios we discussed, and instead would require social intervention if you really wanted to jump back.

Such a path

  • avoids the complexity of the variable prune depth
  • forces social intervention in the event of crazy scenarios that aren't (a) resolved via leak or (b) aren't dealt with by 18 days (which in most scenarios we expect to do)
  • defends by default against the L2+majority-collusion attack by default with manual intervention being the only way to jump back
  • [relatively large downside] in the most normal case of this happening -- a small subset of the network is misconfigured/disconnected for 18+ days -- the user would get stuck and have to manually intervene rather than being able to just catch up

This essentially puts us in a defacto local finalization at the prune depth. This is something we kind of accept today with the quadratic leak. One thing to consider is that if we kept this as the logic, greatly reducing the prune depth would change the "Defacto finality period" which I wouldn't be comfortable making much larger than the quadratic leak to majority period.

@dapplion
Copy link
Member Author

dapplion commented Dec 8, 2022

@djrtwo suggestion is sensible with the current 18 day depth. Can we accept the approach as sufficient for now, and revisit the topic once there's a need / interest to shorten the depth? Essentially kicking the can to the future with the goal of not complicating eip-4844 early 2023 version

@mkalinin
Copy link
Contributor

mkalinin commented Dec 8, 2022

I am echoing @djrtwo. In the edge case when the latest finalized checkpoint is more than 18 days ago, one can use a light version of social consensus to bootstrap a node, i.e. ask his/her friends, or EF, or whomever he/she trusts to get a state from within the 18 days period and bootstrap a node with it. Assuming that a trusted party have observed no DA issues for that [now - x_days; now - 18_days] period. All CL clients do currently support bootstrapping with an arbitrary state.

@realbigsean
Copy link
Contributor

Yes so the relationship with the quadratic inactivity leak does make me generally much more on board with just having MIN_EPOCHS_FOR_BLOBS_SIDECARS_REQUESTS completely independent of finalization, I hadn't really considered its role here. Increasing MIN_EPOCHS_FOR_BLOBS_SIDECARS_REQUESTS so that we can more heavily rely on the quadratic leak seems like it might even be reasonable (I know we just decreased it 😅). I think right now if 50% of stake offline, it'd take something like 22 days for the chain to finalize, can anyone more familiar with the leak check me here?

As far as whether to consider unfinalized blocks older than MIN_EPOCHS_FOR_BLOBS_SIDECARS_REQUESTS available or not, I think forcing a social recovery in this very bad case is nice feature but maybe isn't totally necessary if this makes social recovery more difficult.

@djrtwo
Copy link
Contributor

djrtwo commented Dec 12, 2022

I want to echo that a semi-major UX degregation here is that if you are offline for the prune window period, you wouldn't be able to sync to head anymore without manual intervention due to being behind where you can evaluate DA from the p2p network. It also has implications if a node wants to sync from genesis without providing a recent finalized root.

This isn't a crazy departure from the security model -- at those depths you are at risk to long range attacks without bringing in a recent piece of data from the network out of band, but currently, I would imagine most or all clients would still sync to the head (and usually be fine). So it doesn't change the security model but does change the practical UX if DA at such depths is enforced strictly. There are maybe spectrums of enforcement to balance the UX -- e.g. don't reorg to a chain you can't check DA of but you're allowed to extend the chain you already know of until you get into the DA window.

@MicahZoltu
Copy link
Contributor

you wouldn't be able to sync to head anymore without manual intervention due to being behind where you can evaluate DA from the p2p network. It also has implications if a node wants to sync from genesis without providing a recent finalized root.

This sounds like exactly how it should work. Users should not sync from genesis without a recent root of trust, and users should not have their nodes blindly follow the validators after being offline for an extended period of time.

I think if we want to address this, we should do so it a way that doesn't compromise individual node operator security via things like encouraging trusted root source lists that can automatically be compared against in these situations and the system will fail (until user intervenes) if they ever disagree.

@arnetheduck
Copy link
Contributor

Were there any more thoughts on anchoring the retention period in finality instead? the only downside was that a few more weeks of data must be kept in the case of non-finality, ie up to 3.

@MicahZoltu
Copy link
Contributor

the only downside was that a few more weeks of data must be kept in the case of non-finality, ie up to 3.

Technically non-finality can occur until all ETH has been burned. We certainly hope that the real world worst case finalization failure gets resolved after the inactivity leak. I don't have a strong argument against attaching data availability to finality, but I think that it should be made clear and be well understood that we are changing the upper bound on disk utilization from a hard limit to a soft/economic limit.

@dankrad
Copy link
Contributor

dankrad commented Dec 14, 2022

An alternative here is to prune at the 18 day depth and to not consider blocks past that depth as available (unless you previously validated it yourself). This would avoid an automatic chain re-org in the attack scenarios we discussed, and instead would require social intervention if you really wanted to jump back.

I also think this solution is the best.

Were there any more thoughts on anchoring the retention period in finality instead? the only downside was that a few more weeks of data must be kept in the case of non-finality, ie up to 3.

The downside to this is that it means that a finalized chain can re-org an unfinalized chain, even if it has unavailable data. Consider the following scenario:

  • Majority (>2/3) attacker starts building a hidden chain on day 0, withholds this chain
  • Minority chain (<1/3) continues being built
  • After 19 days, attacker releases attack chain, without the blobs for day 0
  • All other nodes accept this chain because despite the missing data (because it is outside the window for this chain)

@djrtwo
Copy link
Contributor

djrtwo commented Dec 19, 2022

closing in favor of #3169

@djrtwo djrtwo closed this Dec 19, 2022
@dapplion dapplion deleted the eip4844-prunning branch December 22, 2022 02:20
emhane added a commit to emhane/consensus-specs that referenced this pull request Jan 14, 2023
This issue is closed with conclusion that finalization should be ignored when considering blobs.
ethereum#3141
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deneb was called: eip-4844
Projects
None yet
Development

Successfully merging this pull request may close these issues.