-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bifrost] Design improvements for find_tail #2593
Conversation
b3b0704
to
638c13b
Compare
Five minute test with random partitions looks good! I will re-run it a few more times to see how it behaves. https://github.com/restatedev/jepsen/actions/runs/13080247655/job/36501963775 The gaps during the partitions (grey bars) indicate that no processing seems to be happening, and sometimes we don't recover for a couple of on/off cycles - so 15-25s since the first partition event. I added some clients-side timeouts to the test driver to ride out some of the short-term unavailability but I'd want to double check that I'm not starving out Jepsen's worker threads with these. I was under the impression that there is dedicated concurrency per Restate worker node. Some more analysis required but at first glance it looks like a big improvement over before with no long-term lockups. |
3a7b64b
to
98648aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for improving the find_tail
operation @AhmedSoliman. Impressive work as always! I left a few questions for clarifying my understanding. Apart from this +1 for merging it.
// already fully sealed, just make sure the sequencer is drained. | ||
handle.drain().await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is important to drain to avoid a race condition where a record has been stored on enough log servers but the sequencer hasn't update it's known_global_tail
yet. If we called notify_seal
w/o waiting for the draining, then we would mark a lsn as sealed which is lower than the last ack'ed lsn (assuming that the sequencer eventually acks the lsn that it replicated but has not updated the known_global_tail
with yet). Is this roughly correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. correct :)
crates/bifrost/src/providers/replicated_loglet/tasks/check_seal.rs
Outdated
Show resolved
Hide resolved
This PR introduces a few changes resulting in significant improvements in failover time, and performance of common operation like find_tail(). The result is failover time that in the hundreds of milliseconds in the happy path and in the order of a couple of seconds in the unhappy path. `find_tail()` is also now significantly cheaper to run if the sequencer is running, this enables parallelization of `find_tail()` runs in logs controller (and more frequent as well). The latter comment will be reflected in a separate PR. This also includes a new implementation of the seal task that reuses the `RunOnSingleNode` utility and that doesn't continue attempts once f-majority is sealed to reduce overloading the cluster. This can be reconsidered if we observed issues.
This PR introduces a few changes resulting in significant improvements in failover time, and performance of common operation like find_tail().
The result is failover time that in the hundreds of milliseconds in the happy path and in the order of a couple of seconds in the unhappy path.
find_tail()
isalso now significantly cheaper to run if the sequencer is running, this enables parallelization of
find_tail()
runs in logs controller (and more frequent as well). The latter comment will be reflected in a separate PR.This also includes a new implementation of the seal task that reuses the
RunOnSingleNode
utility and that doesn't continue attempts once f-majority is sealed to reduce overloading the cluster. This can be reconsidered if we observed issues.Stack created with Sapling. Best reviewed with ReviewStack.