-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sync and inbound timeouts #1586
Conversation
This timeout stops the sync service hanging when it is missing required blocks, but the lookahead queue is full of dependent verify tasks, so the missing blocks never get downloaded.
And adjust the sync restart delay as a consequence.
And add comments about correctness and usage.
And add download comments
3300610
to
a648f1c
Compare
🎉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥇
let lookahead_limit = if config.sync.lookahead_limit < MIN_LOOKAHEAD_LIMIT { | ||
tracing::warn!(config_lookahead_limit = ?config.sync.lookahead_limit, ?MIN_LOOKAHEAD_LIMIT, | ||
"configured lookahead limit too low: using minimum lookahead limit"); | ||
MIN_LOOKAHEAD_LIMIT | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried that users might not notice this warning, what do you think about turning this into a hard error at startup, probably by adding some sort of fn validate(&self) -> Result<(), ConfigError>;
that we call in ZebraApp
when we're initializing the config and what not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all our components were abscissa FrameworkComponents, we could do config validation in the after_config callback, and return a FrameworkError:
https://docs.rs/abscissa_core/0.5.2/abscissa_core/application/trait.Application.html#tymethod.after_config
But we had trouble with component dependencies, so some components aren't abscissa FrameworkComponents.
Since this change isn't a high priority, and it isn't really the focus of this PR, I'll open a quick follow-up PR that turns this into a panic.
And then we open a follow-up task to create a consistent design for:
- defining limits for each config value
- enforcing those limits
- handing limit errors
Edit: let's just panic in a quick follow up PR
Motivation
The sync service sometimes hangs (#1559).
It's also possible that the inbound service sometimes hangs, but we just don't notice.
Solution
Add timeouts to:
Add some timeout constraints as a unit test.
Add correctness warning docs to the network and verify timeouts.
Document the services used by sync and inbound.
The code in this pull request has:
Manual Testing
I ran 4
zebrad
instances:None of them have hung after about 12 hours, so it seems like this PR fixes our sync hangs.
I'm now running 5
zebrad
instances:They also succeeded, but the ephemeral mainnet instance is lagging about 500 blocks behind the tip, and some of the testnet instances have a low number of peers. (These both look like #1435, but with a much lower impact.)
Review
I've been talking to @yaahc about these bugs.
I'd like to get this PR merged by Friday, so I don't have to change my #1435 fixes too much.
Related Issues
Closes #1559
Handles one part of #1389
Removes some potential causes of #1435 and makes it easier to diagnose
Follow Up Work
We won't know if we've completely fixed this issue until we've also fixed:
We need some sync hang tests. I have some ideas, and I'll write them up tomorrow.
This is now Design: config option limits #1591.