Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merged by Bors] - CI fix: add retries to eth1 sim tests #4501

Closed
wants to merge 3 commits into from

Conversation

jimmygchen
Copy link
Member

Issue Addressed

This PR attempts to workaround the recent frequent eth1 simulator failures caused by missing eth logs from Anvil.

FailedToInsertDeposit(NonConsecutive { log_index: 1, expected: 0 })

This usually occurs at the beginning of the tests, and it guarantees a timeout after a few hours if this log shows up, and this is currently causing our CIs to fail quite frequently.

Example failure here: https://github.com/sigp/lighthouse/actions/runs/5525760195/jobs/10079736914

Proposed Changes

The quick fix applied here adds a timeout to node startup and restarts the node again.

  • Add a 60 seconds timeout to beacon node startup in eth1 simulator tests. It takes ~10 seconds on my machine, but could take longer on CI runners.
  • Wrap the startup code in a retry function, that allows for 3 retries before returning an error.

Additional Info

We should probably raise an issue under the Anvil GitHub repo there so this can be further investigated.

@jimmygchen jimmygchen changed the title Add retry to eth1 sim tests Add retries to eth1 sim tests Jul 13, 2023
@jimmygchen
Copy link
Member Author

I was also able to reproduce it locally running in non-release mode. I suspected earlier that it could be a race condition between initial block production and deposit submission timing, but I've also seen it happen at different indices, e.g. got an error like FailedToInsertDeposit(NonConsecutive { log_index: 19, ... , so it's not always just missing initial deposit event... although this (missing 0 index deposit) seems to be the only scenario I've seen on CI.

@jimmygchen jimmygchen force-pushed the add-retry-to-sim-tests branch from a5d4c1f to e9c6305 Compare July 13, 2023 12:29
Comment on lines -110 to -170
/*
* Deploy the deposit contract, spawn tasks to keep creating new blocks and deposit
* validators.
*/
let anvil_eth1_instance = AnvilEth1Instance::new(DEFAULT_CHAIN_ID.into()).await?;
let deposit_contract = anvil_eth1_instance.deposit_contract;
let chain_id = anvil_eth1_instance.anvil.chain_id();
let anvil = anvil_eth1_instance.anvil;
let eth1_endpoint = SensitiveUrl::parse(anvil.endpoint().as_str())
.expect("Unable to parse anvil endpoint.");
let deposit_contract_address = deposit_contract.address();

// Start a timer that produces eth1 blocks on an interval.
tokio::spawn(async move {
let mut interval = tokio::time::interval(eth1_block_time);
loop {
interval.tick().await;
let _ = anvil.evm_mine().await;
}
});

// Submit deposits to the deposit contract.
tokio::spawn(async move {
for i in 0..total_validator_count {
println!("Submitting deposit for validator {}...", i);
let _ = deposit_contract
.deposit_deterministic_async::<E>(i, deposit_amount)
.await;
}
});

let mut beacon_config = testing_client_config();

beacon_config.genesis = ClientGenesis::DepositContract;
beacon_config.eth1.endpoint = Eth1Endpoint::NoAuth(eth1_endpoint);
beacon_config.eth1.deposit_contract_address = deposit_contract_address;
beacon_config.eth1.deposit_contract_deploy_block = 0;
beacon_config.eth1.lowest_cached_block_number = 0;
beacon_config.eth1.follow_distance = 1;
beacon_config.eth1.node_far_behind_seconds = 20;
beacon_config.dummy_eth1_backend = false;
beacon_config.sync_eth1_chain = true;
beacon_config.eth1.auto_update_interval_millis = eth1_block_time.as_millis() as u64;
beacon_config.eth1.chain_id = Eth1Id::from(chain_id);
beacon_config.network.target_peers = node_count + proposer_nodes - 1;

beacon_config.network.enr_address = (Some(Ipv4Addr::LOCALHOST), None);

if post_merge_sim {
let el_config = execution_layer::Config {
execution_endpoints: vec![SensitiveUrl::parse(&format!(
"http://localhost:{}",
EXECUTION_PORT
))
.unwrap()],
..Default::default()
};

beacon_config.execution_layer = Some(el_config);
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section has been extracted into a create_local_network function below without any logic change.
https://github.com/sigp/lighthouse/pull/4501/files#diff-21c83610adf1cfb2330cabd0bec079ae7f7c0f3e3d525c7931181fd5592983c3R309

@jimmygchen jimmygchen added the ready-for-review The code is ready for review label Jul 13, 2023
/*
* Create a new `LocalNetwork` with one beacon node.
*/
let network = LocalNetwork::new(context.clone(), beacon_config.clone()).await?;
let max_retries = 3;
Copy link
Member Author

@jimmygchen jimmygchen Jul 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying a small retry number to start with, the deposit logs issue seems to occur about 50/50 of times and usually passes on the 1st retry attempt, so i reckon maybe 3 would be enough, so we don't wait for too long for the job to fail if there's a geniune issue in the beacon node start up.

I initially had a different error type to signify a timeout error, but I don't feel it's worth it so I've removed them to keep this PR small.

@jimmygchen jimmygchen changed the title Add retries to eth1 sim tests CI fix: add retries to eth1 sim tests Jul 13, 2023
@pawanjay176
Copy link
Member

was also able to reproduce it locally running in non-release mode

This is awesome. Was not able to repro this when I was investigating earlier. Will try to dig around in anvil when I get the chance.

Copy link
Member

@michaelsproul michaelsproul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, the with_retry function is really neat.

Let's cook 🧑‍🍳

@michaelsproul michaelsproul added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Jul 13, 2023
@michaelsproul
Copy link
Member

bors r+

bors bot pushed a commit that referenced this pull request Jul 13, 2023
## Issue Addressed

This PR attempts to workaround the recent frequent eth1 simulator failures caused by missing eth logs from Anvil. 

> FailedToInsertDeposit(NonConsecutive { log_index: 1, expected: 0 })

This usually occurs at the beginning of the tests, and it guarantees a timeout after a few hours if this log shows up, and this is currently causing our CIs to fail quite frequently. 

Example failure here: https://github.com/sigp/lighthouse/actions/runs/5525760195/jobs/10079736914

## Proposed Changes

The quick fix applied here adds a timeout to node startup and restarts the node again.

- Add a 60 seconds timeout to beacon node startup in eth1 simulator tests. It takes ~10 seconds on my machine, but could take longer on CI runners.
- Wrap the startup code in a retry function, that allows for 3 retries before returning an error.

## Additional Info

We should probably raise an issue under the Anvil GitHub repo there so this can be further investigated.
@bors
Copy link

bors bot commented Jul 13, 2023

This PR was included in a batch that was canceled, it will be automatically retried

bors bot pushed a commit that referenced this pull request Jul 13, 2023
## Issue Addressed

This PR attempts to workaround the recent frequent eth1 simulator failures caused by missing eth logs from Anvil. 

> FailedToInsertDeposit(NonConsecutive { log_index: 1, expected: 0 })

This usually occurs at the beginning of the tests, and it guarantees a timeout after a few hours if this log shows up, and this is currently causing our CIs to fail quite frequently. 

Example failure here: https://github.com/sigp/lighthouse/actions/runs/5525760195/jobs/10079736914

## Proposed Changes

The quick fix applied here adds a timeout to node startup and restarts the node again.

- Add a 60 seconds timeout to beacon node startup in eth1 simulator tests. It takes ~10 seconds on my machine, but could take longer on CI runners.
- Wrap the startup code in a retry function, that allows for 3 retries before returning an error.

## Additional Info

We should probably raise an issue under the Anvil GitHub repo there so this can be further investigated.
@bors
Copy link

bors bot commented Jul 14, 2023

This PR was included in a batch that timed out, it will be automatically retried

@michaelsproul
Copy link
Member

bors r-

@bors
Copy link

bors bot commented Jul 14, 2023

Canceled.

@michaelsproul
Copy link
Member

bors r+

bors bot pushed a commit that referenced this pull request Jul 17, 2023
## Issue Addressed

This PR attempts to workaround the recent frequent eth1 simulator failures caused by missing eth logs from Anvil. 

> FailedToInsertDeposit(NonConsecutive { log_index: 1, expected: 0 })

This usually occurs at the beginning of the tests, and it guarantees a timeout after a few hours if this log shows up, and this is currently causing our CIs to fail quite frequently. 

Example failure here: https://github.com/sigp/lighthouse/actions/runs/5525760195/jobs/10079736914

## Proposed Changes

The quick fix applied here adds a timeout to node startup and restarts the node again.

- Add a 60 seconds timeout to beacon node startup in eth1 simulator tests. It takes ~10 seconds on my machine, but could take longer on CI runners.
- Wrap the startup code in a retry function, that allows for 3 retries before returning an error.

## Additional Info

We should probably raise an issue under the Anvil GitHub repo there so this can be further investigated.
@bors
Copy link

bors bot commented Jul 17, 2023

@bors bors bot changed the title CI fix: add retries to eth1 sim tests [Merged by Bors] - CI fix: add retries to eth1 sim tests Jul 17, 2023
@bors bors bot closed this Jul 17, 2023
@jimmygchen
Copy link
Member Author

was also able to reproduce it locally running in non-release mode

This is awesome. Was not able to repro this when I was investigating earlier. Will try to dig around in anvil when I get the chance.

@pawanjay176 I haven't seen it happen on CI yet after updating to use the pre-built foundry binaries, perhaps it's less likely to occur with release profile.

@jimmygchen jimmygchen deleted the add-retry-to-sim-tests branch July 17, 2023 03:06
Woodpile37 pushed a commit to Woodpile37/lighthouse that referenced this pull request Jan 6, 2024
## Issue Addressed

This PR attempts to workaround the recent frequent eth1 simulator failures caused by missing eth logs from Anvil. 

> FailedToInsertDeposit(NonConsecutive { log_index: 1, expected: 0 })

This usually occurs at the beginning of the tests, and it guarantees a timeout after a few hours if this log shows up, and this is currently causing our CIs to fail quite frequently. 

Example failure here: https://github.com/sigp/lighthouse/actions/runs/5525760195/jobs/10079736914

## Proposed Changes

The quick fix applied here adds a timeout to node startup and restarts the node again.

- Add a 60 seconds timeout to beacon node startup in eth1 simulator tests. It takes ~10 seconds on my machine, but could take longer on CI runners.
- Wrap the startup code in a retry function, that allows for 3 retries before returning an error.

## Additional Info

We should probably raise an issue under the Anvil GitHub repo there so this can be further investigated.
Woodpile37 pushed a commit to Woodpile37/lighthouse that referenced this pull request Jan 6, 2024
## Issue Addressed

This PR attempts to workaround the recent frequent eth1 simulator failures caused by missing eth logs from Anvil. 

> FailedToInsertDeposit(NonConsecutive { log_index: 1, expected: 0 })

This usually occurs at the beginning of the tests, and it guarantees a timeout after a few hours if this log shows up, and this is currently causing our CIs to fail quite frequently. 

Example failure here: https://github.com/sigp/lighthouse/actions/runs/5525760195/jobs/10079736914

## Proposed Changes

The quick fix applied here adds a timeout to node startup and restarts the node again.

- Add a 60 seconds timeout to beacon node startup in eth1 simulator tests. It takes ~10 seconds on my machine, but could take longer on CI runners.
- Wrap the startup code in a retry function, that allows for 3 retries before returning an error.

## Additional Info

We should probably raise an issue under the Anvil GitHub repo there so this can be further investigated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infra-ci ready-for-merge This PR is ready to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants