[Merged by Bors] - CI fix: add retries to eth1 sim tests #4501

jimmygchen · 2023-07-13T07:33:55Z

Issue Addressed

This PR attempts to workaround the recent frequent eth1 simulator failures caused by missing eth logs from Anvil.

FailedToInsertDeposit(NonConsecutive { log_index: 1, expected: 0 })

This usually occurs at the beginning of the tests, and it guarantees a timeout after a few hours if this log shows up, and this is currently causing our CIs to fail quite frequently.

Example failure here: https://github.com/sigp/lighthouse/actions/runs/5525760195/jobs/10079736914

Proposed Changes

The quick fix applied here adds a timeout to node startup and restarts the node again.

Add a 60 seconds timeout to beacon node startup in eth1 simulator tests. It takes ~10 seconds on my machine, but could take longer on CI runners.
Wrap the startup code in a retry function, that allows for 3 retries before returning an error.

Additional Info

We should probably raise an issue under the Anvil GitHub repo there so this can be further investigated.

jimmygchen · 2023-07-13T12:12:38Z

I was also able to reproduce it locally running in non-release mode. I suspected earlier that it could be a race condition between initial block production and deposit submission timing, but I've also seen it happen at different indices, e.g. got an error like FailedToInsertDeposit(NonConsecutive { log_index: 19, ... , so it's not always just missing initial deposit event... although this (missing 0 index deposit) seems to be the only scenario I've seen on CI.

jimmygchen · 2023-07-13T12:32:07Z

testing/simulator/src/eth1_sim.rs

-        /*
-         * Deploy the deposit contract, spawn tasks to keep creating new blocks and deposit
-         * validators.
-         */
-        let anvil_eth1_instance = AnvilEth1Instance::new(DEFAULT_CHAIN_ID.into()).await?;
-        let deposit_contract = anvil_eth1_instance.deposit_contract;
-        let chain_id = anvil_eth1_instance.anvil.chain_id();
-        let anvil = anvil_eth1_instance.anvil;
-        let eth1_endpoint = SensitiveUrl::parse(anvil.endpoint().as_str())
-            .expect("Unable to parse anvil endpoint.");
-        let deposit_contract_address = deposit_contract.address();
-
-        // Start a timer that produces eth1 blocks on an interval.
-        tokio::spawn(async move {
-            let mut interval = tokio::time::interval(eth1_block_time);
-            loop {
-                interval.tick().await;
-                let _ = anvil.evm_mine().await;
-            }
-        });
-
-        // Submit deposits to the deposit contract.
-        tokio::spawn(async move {
-            for i in 0..total_validator_count {
-                println!("Submitting deposit for validator {}...", i);
-                let _ = deposit_contract
-                    .deposit_deterministic_async::<E>(i, deposit_amount)
-                    .await;
-            }
-        });
-
-        let mut beacon_config = testing_client_config();
-
-        beacon_config.genesis = ClientGenesis::DepositContract;
-        beacon_config.eth1.endpoint = Eth1Endpoint::NoAuth(eth1_endpoint);
-        beacon_config.eth1.deposit_contract_address = deposit_contract_address;
-        beacon_config.eth1.deposit_contract_deploy_block = 0;
-        beacon_config.eth1.lowest_cached_block_number = 0;
-        beacon_config.eth1.follow_distance = 1;
-        beacon_config.eth1.node_far_behind_seconds = 20;
-        beacon_config.dummy_eth1_backend = false;
-        beacon_config.sync_eth1_chain = true;
-        beacon_config.eth1.auto_update_interval_millis = eth1_block_time.as_millis() as u64;
-        beacon_config.eth1.chain_id = Eth1Id::from(chain_id);
-        beacon_config.network.target_peers = node_count + proposer_nodes - 1;
-
-        beacon_config.network.enr_address = (Some(Ipv4Addr::LOCALHOST), None);
-
-        if post_merge_sim {
-            let el_config = execution_layer::Config {
-                execution_endpoints: vec![SensitiveUrl::parse(&format!(
-                    "http://localhost:{}",
-                    EXECUTION_PORT
-                ))
-                .unwrap()],
-                ..Default::default()
-            };
-
-            beacon_config.execution_layer = Some(el_config);
-        }
-


This section has been extracted into a create_local_network function below without any logic change.
https://github.com/sigp/lighthouse/pull/4501/files#diff-21c83610adf1cfb2330cabd0bec079ae7f7c0f3e3d525c7931181fd5592983c3R309

jimmygchen · 2023-07-13T12:40:03Z

testing/simulator/src/eth1_sim.rs

        /*
         * Create a new `LocalNetwork` with one beacon node.
         */
-        let network = LocalNetwork::new(context.clone(), beacon_config.clone()).await?;
+        let max_retries = 3;


Trying a small retry number to start with, the deposit logs issue seems to occur about 50/50 of times and usually passes on the 1st retry attempt, so i reckon maybe 3 would be enough, so we don't wait for too long for the job to fail if there's a geniune issue in the beacon node start up.

I initially had a different error type to signify a timeout error, but I don't feel it's worth it so I've removed them to keep this PR small.

pawanjay176 · 2023-07-13T21:25:46Z

was also able to reproduce it locally running in non-release mode

This is awesome. Was not able to repro this when I was investigating earlier. Will try to dig around in anvil when I get the chance.

michaelsproul

This is awesome, the with_retry function is really neat.

Let's cook 🧑‍🍳

michaelsproul · 2023-07-13T22:46:24Z

bors r+

## Issue Addressed This PR attempts to workaround the recent frequent eth1 simulator failures caused by missing eth logs from Anvil. > FailedToInsertDeposit(NonConsecutive { log_index: 1, expected: 0 }) This usually occurs at the beginning of the tests, and it guarantees a timeout after a few hours if this log shows up, and this is currently causing our CIs to fail quite frequently. Example failure here: https://github.com/sigp/lighthouse/actions/runs/5525760195/jobs/10079736914 ## Proposed Changes The quick fix applied here adds a timeout to node startup and restarts the node again. - Add a 60 seconds timeout to beacon node startup in eth1 simulator tests. It takes ~10 seconds on my machine, but could take longer on CI runners. - Wrap the startup code in a retry function, that allows for 3 retries before returning an error. ## Additional Info We should probably raise an issue under the Anvil GitHub repo there so this can be further investigated.

bors · 2023-07-13T22:47:34Z

This PR was included in a batch that was canceled, it will be automatically retried

## Issue Addressed This PR attempts to workaround the recent frequent eth1 simulator failures caused by missing eth logs from Anvil. > FailedToInsertDeposit(NonConsecutive { log_index: 1, expected: 0 }) This usually occurs at the beginning of the tests, and it guarantees a timeout after a few hours if this log shows up, and this is currently causing our CIs to fail quite frequently. Example failure here: https://github.com/sigp/lighthouse/actions/runs/5525760195/jobs/10079736914 ## Proposed Changes The quick fix applied here adds a timeout to node startup and restarts the node again. - Add a 60 seconds timeout to beacon node startup in eth1 simulator tests. It takes ~10 seconds on my machine, but could take longer on CI runners. - Wrap the startup code in a retry function, that allows for 3 retries before returning an error. ## Additional Info We should probably raise an issue under the Anvil GitHub repo there so this can be further investigated.

bors · 2023-07-14T01:55:51Z

This PR was included in a batch that timed out, it will be automatically retried

michaelsproul · 2023-07-14T01:57:22Z

bors r-

bors · 2023-07-14T01:57:24Z

Canceled.

michaelsproul · 2023-07-17T00:13:51Z

bors r+

## Issue Addressed This PR attempts to workaround the recent frequent eth1 simulator failures caused by missing eth logs from Anvil. > FailedToInsertDeposit(NonConsecutive { log_index: 1, expected: 0 }) This usually occurs at the beginning of the tests, and it guarantees a timeout after a few hours if this log shows up, and this is currently causing our CIs to fail quite frequently. Example failure here: https://github.com/sigp/lighthouse/actions/runs/5525760195/jobs/10079736914 ## Proposed Changes The quick fix applied here adds a timeout to node startup and restarts the node again. - Add a 60 seconds timeout to beacon node startup in eth1 simulator tests. It takes ~10 seconds on my machine, but could take longer on CI runners. - Wrap the startup code in a retry function, that allows for 3 retries before returning an error. ## Additional Info We should probably raise an issue under the Anvil GitHub repo there so this can be further investigated.

bors · 2023-07-17T02:48:39Z

Pull request successfully merged into unstable.

Build succeeded!

The publicly hosted instance of bors-ng is deprecated and will go away soon.

If you want to self-host your own instance, instructions are here.
For more help, visit the forum.

If you want to switch to GitHub's built-in merge queue, visit their help page.

jimmygchen · 2023-07-17T03:06:00Z

was also able to reproduce it locally running in non-release mode

This is awesome. Was not able to repro this when I was investigating earlier. Will try to dig around in anvil when I get the chance.

@pawanjay176 I haven't seen it happen on CI yet after updating to use the pre-built foundry binaries, perhaps it's less likely to occur with release profile.

## Issue Addressed This PR attempts to workaround the recent frequent eth1 simulator failures caused by missing eth logs from Anvil. > FailedToInsertDeposit(NonConsecutive { log_index: 1, expected: 0 }) This usually occurs at the beginning of the tests, and it guarantees a timeout after a few hours if this log shows up, and this is currently causing our CIs to fail quite frequently. Example failure here: https://github.com/sigp/lighthouse/actions/runs/5525760195/jobs/10079736914 ## Proposed Changes The quick fix applied here adds a timeout to node startup and restarts the node again. - Add a 60 seconds timeout to beacon node startup in eth1 simulator tests. It takes ~10 seconds on my machine, but could take longer on CI runners. - Wrap the startup code in a retry function, that allows for 3 retries before returning an error. ## Additional Info We should probably raise an issue under the Anvil GitHub repo there so this can be further investigated.

Add retry to eth1 sim tests

60a430b

jimmygchen added the infra-ci label Jul 13, 2023

Reorder retry_count increment so it logs correctly

b6601f7

jimmygchen changed the title ~~Add retry to eth1 sim tests~~ Add retries to eth1 sim tests Jul 13, 2023

Remove some unnecessary error types to reduce amount of changes

e9c6305

jimmygchen force-pushed the add-retry-to-sim-tests branch from a5d4c1f to e9c6305 Compare July 13, 2023 12:29

jimmygchen commented Jul 13, 2023

View reviewed changes

jimmygchen added the ready-for-review The code is ready for review label Jul 13, 2023

jimmygchen commented Jul 13, 2023

View reviewed changes

jimmygchen changed the title ~~Add retries to eth1 sim tests~~ CI fix: add retries to eth1 sim tests Jul 13, 2023

michaelsproul approved these changes Jul 13, 2023

View reviewed changes

michaelsproul added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Jul 13, 2023

bors bot changed the title ~~CI fix: add retries to eth1 sim tests~~ [Merged by Bors] - CI fix: add retries to eth1 sim tests Jul 17, 2023

bors bot closed this Jul 17, 2023

jimmygchen deleted the add-retry-to-sim-tests branch July 17, 2023 03:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Merged by Bors] - CI fix: add retries to eth1 sim tests #4501

[Merged by Bors] - CI fix: add retries to eth1 sim tests #4501

jimmygchen commented Jul 13, 2023

jimmygchen commented Jul 13, 2023

jimmygchen Jul 13, 2023

jimmygchen Jul 13, 2023 •

edited

Loading

pawanjay176 commented Jul 13, 2023

michaelsproul left a comment

michaelsproul commented Jul 13, 2023

bors bot commented Jul 13, 2023

bors bot commented Jul 14, 2023

michaelsproul commented Jul 14, 2023

bors bot commented Jul 14, 2023

michaelsproul commented Jul 17, 2023

bors bot commented Jul 17, 2023

jimmygchen commented Jul 17, 2023

[Merged by Bors] - CI fix: add retries to eth1 sim tests #4501

[Merged by Bors] - CI fix: add retries to eth1 sim tests #4501

Conversation

jimmygchen commented Jul 13, 2023

Issue Addressed

Proposed Changes

Additional Info

jimmygchen commented Jul 13, 2023

jimmygchen Jul 13, 2023

Choose a reason for hiding this comment

jimmygchen Jul 13, 2023 • edited Loading

Choose a reason for hiding this comment

pawanjay176 commented Jul 13, 2023

michaelsproul left a comment

Choose a reason for hiding this comment

michaelsproul commented Jul 13, 2023

bors bot commented Jul 13, 2023

bors bot commented Jul 14, 2023

michaelsproul commented Jul 14, 2023

bors bot commented Jul 14, 2023

michaelsproul commented Jul 17, 2023

bors bot commented Jul 17, 2023

jimmygchen commented Jul 17, 2023

jimmygchen Jul 13, 2023 •

edited

Loading