-
Notifications
You must be signed in to change notification settings - Fork 791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Merged by Bors] - CI fix: add retries to eth1 sim tests #4501
Conversation
I was also able to reproduce it locally running in non-release mode. I suspected earlier that it could be a race condition between initial block production and deposit submission timing, but I've also seen it happen at different indices, e.g. got an error like |
a5d4c1f
to
e9c6305
Compare
/* | ||
* Deploy the deposit contract, spawn tasks to keep creating new blocks and deposit | ||
* validators. | ||
*/ | ||
let anvil_eth1_instance = AnvilEth1Instance::new(DEFAULT_CHAIN_ID.into()).await?; | ||
let deposit_contract = anvil_eth1_instance.deposit_contract; | ||
let chain_id = anvil_eth1_instance.anvil.chain_id(); | ||
let anvil = anvil_eth1_instance.anvil; | ||
let eth1_endpoint = SensitiveUrl::parse(anvil.endpoint().as_str()) | ||
.expect("Unable to parse anvil endpoint."); | ||
let deposit_contract_address = deposit_contract.address(); | ||
|
||
// Start a timer that produces eth1 blocks on an interval. | ||
tokio::spawn(async move { | ||
let mut interval = tokio::time::interval(eth1_block_time); | ||
loop { | ||
interval.tick().await; | ||
let _ = anvil.evm_mine().await; | ||
} | ||
}); | ||
|
||
// Submit deposits to the deposit contract. | ||
tokio::spawn(async move { | ||
for i in 0..total_validator_count { | ||
println!("Submitting deposit for validator {}...", i); | ||
let _ = deposit_contract | ||
.deposit_deterministic_async::<E>(i, deposit_amount) | ||
.await; | ||
} | ||
}); | ||
|
||
let mut beacon_config = testing_client_config(); | ||
|
||
beacon_config.genesis = ClientGenesis::DepositContract; | ||
beacon_config.eth1.endpoint = Eth1Endpoint::NoAuth(eth1_endpoint); | ||
beacon_config.eth1.deposit_contract_address = deposit_contract_address; | ||
beacon_config.eth1.deposit_contract_deploy_block = 0; | ||
beacon_config.eth1.lowest_cached_block_number = 0; | ||
beacon_config.eth1.follow_distance = 1; | ||
beacon_config.eth1.node_far_behind_seconds = 20; | ||
beacon_config.dummy_eth1_backend = false; | ||
beacon_config.sync_eth1_chain = true; | ||
beacon_config.eth1.auto_update_interval_millis = eth1_block_time.as_millis() as u64; | ||
beacon_config.eth1.chain_id = Eth1Id::from(chain_id); | ||
beacon_config.network.target_peers = node_count + proposer_nodes - 1; | ||
|
||
beacon_config.network.enr_address = (Some(Ipv4Addr::LOCALHOST), None); | ||
|
||
if post_merge_sim { | ||
let el_config = execution_layer::Config { | ||
execution_endpoints: vec![SensitiveUrl::parse(&format!( | ||
"http://localhost:{}", | ||
EXECUTION_PORT | ||
)) | ||
.unwrap()], | ||
..Default::default() | ||
}; | ||
|
||
beacon_config.execution_layer = Some(el_config); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section has been extracted into a create_local_network
function below without any logic change.
https://github.com/sigp/lighthouse/pull/4501/files#diff-21c83610adf1cfb2330cabd0bec079ae7f7c0f3e3d525c7931181fd5592983c3R309
/* | ||
* Create a new `LocalNetwork` with one beacon node. | ||
*/ | ||
let network = LocalNetwork::new(context.clone(), beacon_config.clone()).await?; | ||
let max_retries = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying a small retry number to start with, the deposit logs issue seems to occur about 50/50 of times and usually passes on the 1st retry attempt, so i reckon maybe 3 would be enough, so we don't wait for too long for the job to fail if there's a geniune issue in the beacon node start up.
I initially had a different error type to signify a timeout error, but I don't feel it's worth it so I've removed them to keep this PR small.
This is awesome. Was not able to repro this when I was investigating earlier. Will try to dig around in anvil when I get the chance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome, the with_retry
function is really neat.
Let's cook 🧑🍳
bors r+ |
## Issue Addressed This PR attempts to workaround the recent frequent eth1 simulator failures caused by missing eth logs from Anvil. > FailedToInsertDeposit(NonConsecutive { log_index: 1, expected: 0 }) This usually occurs at the beginning of the tests, and it guarantees a timeout after a few hours if this log shows up, and this is currently causing our CIs to fail quite frequently. Example failure here: https://github.com/sigp/lighthouse/actions/runs/5525760195/jobs/10079736914 ## Proposed Changes The quick fix applied here adds a timeout to node startup and restarts the node again. - Add a 60 seconds timeout to beacon node startup in eth1 simulator tests. It takes ~10 seconds on my machine, but could take longer on CI runners. - Wrap the startup code in a retry function, that allows for 3 retries before returning an error. ## Additional Info We should probably raise an issue under the Anvil GitHub repo there so this can be further investigated.
This PR was included in a batch that was canceled, it will be automatically retried |
## Issue Addressed This PR attempts to workaround the recent frequent eth1 simulator failures caused by missing eth logs from Anvil. > FailedToInsertDeposit(NonConsecutive { log_index: 1, expected: 0 }) This usually occurs at the beginning of the tests, and it guarantees a timeout after a few hours if this log shows up, and this is currently causing our CIs to fail quite frequently. Example failure here: https://github.com/sigp/lighthouse/actions/runs/5525760195/jobs/10079736914 ## Proposed Changes The quick fix applied here adds a timeout to node startup and restarts the node again. - Add a 60 seconds timeout to beacon node startup in eth1 simulator tests. It takes ~10 seconds on my machine, but could take longer on CI runners. - Wrap the startup code in a retry function, that allows for 3 retries before returning an error. ## Additional Info We should probably raise an issue under the Anvil GitHub repo there so this can be further investigated.
This PR was included in a batch that timed out, it will be automatically retried |
bors r- |
Canceled. |
bors r+ |
## Issue Addressed This PR attempts to workaround the recent frequent eth1 simulator failures caused by missing eth logs from Anvil. > FailedToInsertDeposit(NonConsecutive { log_index: 1, expected: 0 }) This usually occurs at the beginning of the tests, and it guarantees a timeout after a few hours if this log shows up, and this is currently causing our CIs to fail quite frequently. Example failure here: https://github.com/sigp/lighthouse/actions/runs/5525760195/jobs/10079736914 ## Proposed Changes The quick fix applied here adds a timeout to node startup and restarts the node again. - Add a 60 seconds timeout to beacon node startup in eth1 simulator tests. It takes ~10 seconds on my machine, but could take longer on CI runners. - Wrap the startup code in a retry function, that allows for 3 retries before returning an error. ## Additional Info We should probably raise an issue under the Anvil GitHub repo there so this can be further investigated.
Pull request successfully merged into unstable. Build succeeded! The publicly hosted instance of bors-ng is deprecated and will go away soon. If you want to self-host your own instance, instructions are here. If you want to switch to GitHub's built-in merge queue, visit their help page.
|
@pawanjay176 I haven't seen it happen on CI yet after updating to use the pre-built foundry binaries, perhaps it's less likely to occur with |
## Issue Addressed This PR attempts to workaround the recent frequent eth1 simulator failures caused by missing eth logs from Anvil. > FailedToInsertDeposit(NonConsecutive { log_index: 1, expected: 0 }) This usually occurs at the beginning of the tests, and it guarantees a timeout after a few hours if this log shows up, and this is currently causing our CIs to fail quite frequently. Example failure here: https://github.com/sigp/lighthouse/actions/runs/5525760195/jobs/10079736914 ## Proposed Changes The quick fix applied here adds a timeout to node startup and restarts the node again. - Add a 60 seconds timeout to beacon node startup in eth1 simulator tests. It takes ~10 seconds on my machine, but could take longer on CI runners. - Wrap the startup code in a retry function, that allows for 3 retries before returning an error. ## Additional Info We should probably raise an issue under the Anvil GitHub repo there so this can be further investigated.
## Issue Addressed This PR attempts to workaround the recent frequent eth1 simulator failures caused by missing eth logs from Anvil. > FailedToInsertDeposit(NonConsecutive { log_index: 1, expected: 0 }) This usually occurs at the beginning of the tests, and it guarantees a timeout after a few hours if this log shows up, and this is currently causing our CIs to fail quite frequently. Example failure here: https://github.com/sigp/lighthouse/actions/runs/5525760195/jobs/10079736914 ## Proposed Changes The quick fix applied here adds a timeout to node startup and restarts the node again. - Add a 60 seconds timeout to beacon node startup in eth1 simulator tests. It takes ~10 seconds on my machine, but could take longer on CI runners. - Wrap the startup code in a retry function, that allows for 3 retries before returning an error. ## Additional Info We should probably raise an issue under the Anvil GitHub repo there so this can be further investigated.
Issue Addressed
This PR attempts to workaround the recent frequent eth1 simulator failures caused by missing eth logs from Anvil.
This usually occurs at the beginning of the tests, and it guarantees a timeout after a few hours if this log shows up, and this is currently causing our CIs to fail quite frequently.
Example failure here: https://github.com/sigp/lighthouse/actions/runs/5525760195/jobs/10079736914
Proposed Changes
The quick fix applied here adds a timeout to node startup and restarts the node again.
Additional Info
We should probably raise an issue under the Anvil GitHub repo there so this can be further investigated.