Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sync and inbound timeouts #1586

Merged
merged 6 commits into from
Jan 14, 2021
Merged

Conversation

teor2345
Copy link
Contributor

@teor2345 teor2345 commented Jan 13, 2021

Motivation

The sync service sometimes hangs (#1559).

It's also possible that the inbound service sometimes hangs, but we just don't notice.

Solution

Add timeouts to:

  • the sync verifier
  • the inbound downloads
  • the inbound verifier

Add some timeout constraints as a unit test.
Add correctness warning docs to the network and verify timeouts.

Document the services used by sync and inbound.

The code in this pull request has:

Manual Testing

I ran 4 zebrad instances:

  • 3 mainnet:
    • 1 close to tip
    • 2 ephemeral
  • 1 testnet:
    • 1 close to tip

None of them have hung after about 12 hours, so it seems like this PR fixes our sync hangs.

I'm now running 5 zebrad instances:

  • 2 mainnet:
    • 1 close to tip
    • 1 ephemeral
  • 3 testnet:
    • 1 close to tip
    • 2 ephemeral

They also succeeded, but the ephemeral mainnet instance is lagging about 500 blocks behind the tip, and some of the testnet instances have a low number of peers. (These both look like #1435, but with a much lower impact.)

Review

I've been talking to @yaahc about these bugs.

I'd like to get this PR merged by Friday, so I don't have to change my #1435 fixes too much.

Related Issues

Closes #1559
Handles one part of #1389
Removes some potential causes of #1435 and makes it easier to diagnose

Follow Up Work

We won't know if we've completely fixed this issue until we've also fixed:

We need some sync hang tests. I have some ideas, and I'll write them up tomorrow.

  • Open a follow-up task to create a consistent design for:

This timeout stops the sync service hanging when it is missing required
blocks, but the lookahead queue is full of dependent verify tasks, so the
missing blocks never get downloaded.
And adjust the sync restart delay as a consequence.
And add comments about correctness and usage.
@teor2345 teor2345 added C-bug Category: This is a bug A-docs Area: Documentation A-rust Area: Updates to Rust code C-cleanup Category: This is a cleanup labels Jan 13, 2021
@teor2345 teor2345 added this to the 2021 Sprint 1 milestone Jan 13, 2021
@teor2345 teor2345 requested a review from yaahc January 13, 2021 10:42
@teor2345 teor2345 self-assigned this Jan 13, 2021
@dconnolly
Copy link
Contributor

None of them have hung after about 12 hours, so it seems like this PR fixes our sync hangs.

🎉

Copy link
Contributor

@yaahc yaahc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥇

Comment on lines +217 to +221
let lookahead_limit = if config.sync.lookahead_limit < MIN_LOOKAHEAD_LIMIT {
tracing::warn!(config_lookahead_limit = ?config.sync.lookahead_limit, ?MIN_LOOKAHEAD_LIMIT,
"configured lookahead limit too low: using minimum lookahead limit");
MIN_LOOKAHEAD_LIMIT
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that users might not notice this warning, what do you think about turning this into a hard error at startup, probably by adding some sort of fn validate(&self) -> Result<(), ConfigError>; that we call in ZebraApp when we're initializing the config and what not.

Copy link
Contributor Author

@teor2345 teor2345 Jan 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all our components were abscissa FrameworkComponents, we could do config validation in the after_config callback, and return a FrameworkError:
https://docs.rs/abscissa_core/0.5.2/abscissa_core/application/trait.Application.html#tymethod.after_config

But we had trouble with component dependencies, so some components aren't abscissa FrameworkComponents.

Since this change isn't a high priority, and it isn't really the focus of this PR, I'll open a quick follow-up PR that turns this into a panic.

And then we open a follow-up task to create a consistent design for:

  • defining limits for each config value
  • enforcing those limits
  • handing limit errors

Edit: let's just panic in a quick follow up PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-docs Area: Documentation A-rust Area: Updates to Rust code C-bug Category: This is a bug C-cleanup Category: This is a cleanup
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sync service hangs, but inbound service continues to answer requests
3 participants