Skip to content

Commit

Permalink
Handle processing results of non faulty batches (#3439)
Browse files Browse the repository at this point in the history
## Issue Addressed
Solves #3390 

So after checking some logs @pawanjay176 got, we conclude that this happened because we blacklisted a chain after trying it "too much". Now here, in all occurrences it seems that "too much" means we got too many download failures. This happened very slowly, exactly because the batch is allowed to stay alive for very long times after not counting penalties when the ee is offline. The error here then was not that the batch failed because of offline ee errors, but that we blacklisted a chain because of download errors, which we can't pin on the chain but on the peer. This PR fixes that.

## Proposed Changes

Adds a missing piece of logic so that if a chain fails for errors that can't be attributed to an objectively bad behavior from the peer, it is not blacklisted. The issue at hand occurred when new peers arrived claiming a head that had wrongfully blacklisted, even if the original peers participating in the chain were not penalized.

Another notable change is that we need to consider a batch invalid if it processed correctly but its next non empty batch fails processing. Now since a batch can fail processing in non empty ways, there is no need to mark as invalid previous batches.

Improves some logging as well.

## Additional Info

We should do this regardless of pausing sync on ee offline/unsynced state. This is because I think it's almost impossible to ensure a processing result will reach in a predictable order with a synced notification from the ee. Doing this handles what I think are inevitable data races when we actually pause sync

This also fixes a return that reports which batch failed and caused us some confusion checking the logs
  • Loading branch information
divagant-martian committed Aug 12, 2022
1 parent a476ae4 commit f4ffa9e
Show file tree
Hide file tree
Showing 12 changed files with 298 additions and 274 deletions.
4 changes: 1 addition & 3 deletions beacon_node/network/src/beacon_processor/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -76,9 +76,7 @@ mod work_reprocessing_queue;
mod worker;

use crate::beacon_processor::work_reprocessing_queue::QueuedGossipBlock;
pub use worker::{
ChainSegmentProcessId, FailureMode, GossipAggregatePackage, GossipAttestationPackage,
};
pub use worker::{ChainSegmentProcessId, GossipAggregatePackage, GossipAttestationPackage};

/// The maximum size of the channel for work events to the `BeaconProcessor`.
///
Expand Down
2 changes: 1 addition & 1 deletion beacon_node/network/src/beacon_processor/worker/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ mod rpc_methods;
mod sync_methods;

pub use gossip_methods::{GossipAggregatePackage, GossipAttestationPackage};
pub use sync_methods::{ChainSegmentProcessId, FailureMode};
pub use sync_methods::ChainSegmentProcessId;

pub(crate) const FUTURE_SLOT_TOLERANCE: u64 = 1;

Expand Down
65 changes: 27 additions & 38 deletions beacon_node/network/src/beacon_processor/worker/sync_methods.rs
Original file line number Diff line number Diff line change
Expand Up @@ -34,15 +34,6 @@ struct ChainSegmentFailed {
message: String,
/// Used to penalize peers.
peer_action: Option<PeerAction>,
/// Failure mode
mode: FailureMode,
}

/// Represents if a block processing failure was on the consensus or execution side.
#[derive(Debug)]
pub enum FailureMode {
ExecutionLayer { pause_sync: bool },
ConsensusLayer,
}

impl<T: BeaconChainTypes> Worker<T> {
Expand Down Expand Up @@ -150,7 +141,9 @@ impl<T: BeaconChainTypes> Worker<T> {
"last_block_slot" => end_slot,
"processed_blocks" => sent_blocks,
"service"=> "sync");
BatchProcessResult::Success(sent_blocks > 0)
BatchProcessResult::Success {
was_non_empty: sent_blocks > 0,
}
}
(imported_blocks, Err(e)) => {
debug!(self.log, "Batch processing failed";
Expand All @@ -161,11 +154,12 @@ impl<T: BeaconChainTypes> Worker<T> {
"imported_blocks" => imported_blocks,
"error" => %e.message,
"service" => "sync");

BatchProcessResult::Failed {
imported_blocks: imported_blocks > 0,
peer_action: e.peer_action,
mode: e.mode,
match e.peer_action {
Some(penalty) => BatchProcessResult::FaultyFailure {
imported_blocks: imported_blocks > 0,
penalty,
},
None => BatchProcessResult::NonFaultyFailure,
}
}
}
Expand All @@ -184,7 +178,9 @@ impl<T: BeaconChainTypes> Worker<T> {
"last_block_slot" => end_slot,
"processed_blocks" => sent_blocks,
"service"=> "sync");
BatchProcessResult::Success(sent_blocks > 0)
BatchProcessResult::Success {
was_non_empty: sent_blocks > 0,
}
}
(_, Err(e)) => {
debug!(self.log, "Backfill batch processing failed";
Expand All @@ -193,10 +189,12 @@ impl<T: BeaconChainTypes> Worker<T> {
"last_block_slot" => end_slot,
"error" => %e.message,
"service" => "sync");
BatchProcessResult::Failed {
imported_blocks: false,
peer_action: e.peer_action,
mode: e.mode,
match e.peer_action {
Some(penalty) => BatchProcessResult::FaultyFailure {
imported_blocks: false,
penalty,
},
None => BatchProcessResult::NonFaultyFailure,
}
}
}
Expand All @@ -216,15 +214,19 @@ impl<T: BeaconChainTypes> Worker<T> {
{
(imported_blocks, Err(e)) => {
debug!(self.log, "Parent lookup failed"; "error" => %e.message);
BatchProcessResult::Failed {
imported_blocks: imported_blocks > 0,
peer_action: e.peer_action,
mode: e.mode,
match e.peer_action {
Some(penalty) => BatchProcessResult::FaultyFailure {
imported_blocks: imported_blocks > 0,
penalty,
},
None => BatchProcessResult::NonFaultyFailure,
}
}
(imported_blocks, Ok(_)) => {
debug!(self.log, "Parent lookup processed successfully");
BatchProcessResult::Success(imported_blocks > 0)
BatchProcessResult::Success {
was_non_empty: imported_blocks > 0,
}
}
}
}
Expand Down Expand Up @@ -307,7 +309,6 @@ impl<T: BeaconChainTypes> Worker<T> {
message: String::from("mismatched_block_root"),
// The peer is faulty if they send blocks with bad roots.
peer_action: Some(PeerAction::LowToleranceError),
mode: FailureMode::ConsensusLayer,
}
}
HistoricalBlockError::InvalidSignature
Expand All @@ -322,7 +323,6 @@ impl<T: BeaconChainTypes> Worker<T> {
message: "invalid_signature".into(),
// The peer is faulty if they bad signatures.
peer_action: Some(PeerAction::LowToleranceError),
mode: FailureMode::ConsensusLayer,
}
}
HistoricalBlockError::ValidatorPubkeyCacheTimeout => {
Expand All @@ -336,7 +336,6 @@ impl<T: BeaconChainTypes> Worker<T> {
message: "pubkey_cache_timeout".into(),
// This is an internal error, do not penalize the peer.
peer_action: None,
mode: FailureMode::ConsensusLayer,
}
}
HistoricalBlockError::NoAnchorInfo => {
Expand All @@ -347,7 +346,6 @@ impl<T: BeaconChainTypes> Worker<T> {
// There is no need to do a historical sync, this is not a fault of
// the peer.
peer_action: None,
mode: FailureMode::ConsensusLayer,
}
}
HistoricalBlockError::IndexOutOfBounds => {
Expand All @@ -360,7 +358,6 @@ impl<T: BeaconChainTypes> Worker<T> {
message: String::from("logic_error"),
// This should never occur, don't penalize the peer.
peer_action: None,
mode: FailureMode::ConsensusLayer,
}
}
HistoricalBlockError::BlockOutOfRange { .. } => {
Expand All @@ -373,7 +370,6 @@ impl<T: BeaconChainTypes> Worker<T> {
message: String::from("unexpected_error"),
// This should never occur, don't penalize the peer.
peer_action: None,
mode: FailureMode::ConsensusLayer,
}
}
},
Expand All @@ -383,7 +379,6 @@ impl<T: BeaconChainTypes> Worker<T> {
message: format!("{:?}", other),
// This is an internal error, don't penalize the peer.
peer_action: None,
mode: FailureMode::ConsensusLayer,
}
}
};
Expand All @@ -404,7 +399,6 @@ impl<T: BeaconChainTypes> Worker<T> {
message: format!("Block has an unknown parent: {}", block.parent_root()),
// Peers are faulty if they send non-sequential blocks.
peer_action: Some(PeerAction::LowToleranceError),
mode: FailureMode::ConsensusLayer,
})
}
BlockError::BlockIsAlreadyKnown => {
Expand Down Expand Up @@ -442,7 +436,6 @@ impl<T: BeaconChainTypes> Worker<T> {
),
// Peers are faulty if they send blocks from the future.
peer_action: Some(PeerAction::LowToleranceError),
mode: FailureMode::ConsensusLayer,
})
}
BlockError::WouldRevertFinalizedSlot { .. } => {
Expand All @@ -464,7 +457,6 @@ impl<T: BeaconChainTypes> Worker<T> {
message: format!("Internal error whilst processing block: {:?}", e),
// Do not penalize peers for internal errors.
peer_action: None,
mode: FailureMode::ConsensusLayer,
})
}
ref err @ BlockError::ExecutionPayloadError(ref epe) => {
Expand All @@ -480,7 +472,6 @@ impl<T: BeaconChainTypes> Worker<T> {
message: format!("Execution layer offline. Reason: {:?}", err),
// Do not penalize peers for internal errors.
peer_action: None,
mode: FailureMode::ExecutionLayer { pause_sync: true },
})
} else {
debug!(self.log,
Expand All @@ -493,7 +484,6 @@ impl<T: BeaconChainTypes> Worker<T> {
err
),
peer_action: Some(PeerAction::LowToleranceError),
mode: FailureMode::ExecutionLayer { pause_sync: false },
})
}
}
Expand All @@ -508,7 +498,6 @@ impl<T: BeaconChainTypes> Worker<T> {
message: format!("Peer sent invalid block. Reason: {:?}", other),
// Do not penalize peers for internal errors.
peer_action: None,
mode: FailureMode::ConsensusLayer,
})
}
}
Expand Down
Loading

0 comments on commit f4ffa9e

Please sign in to comment.