Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bifrost] Design improvements for find_tail #2593

Merged
merged 1 commit into from
Feb 4, 2025
Merged

[Bifrost] Design improvements for find_tail #2593

merged 1 commit into from
Feb 4, 2025

Conversation

AhmedSoliman
Copy link
Contributor

@AhmedSoliman AhmedSoliman commented Jan 30, 2025

This PR introduces a few changes resulting in significant improvements in failover time, and performance of common operation like find_tail().
The result is failover time that in the hundreds of milliseconds in the happy path and in the order of a couple of seconds in the unhappy path. find_tail() is
also now significantly cheaper to run if the sequencer is running, this enables parallelization of find_tail() runs in logs controller (and more frequent as well). The latter comment will be reflected in a separate PR.

This also includes a new implementation of the seal task that reuses the RunOnSingleNode utility and that doesn't continue attempts once f-majority is sealed to reduce overloading the cluster. This can be reconsidered if we observed issues.

Stack created with Sapling. Best reviewed with ReviewStack.

Copy link

github-actions bot commented Jan 30, 2025

Test Results

  7 files  ±0    7 suites  ±0   4m 35s ⏱️ -2s
 47 tests ±0   46 ✅ ±0  1 💤 ±0  0 ❌ ±0 
182 runs  ±0  179 ✅ ±0  3 💤 ±0  0 ❌ ±0 

Results for commit bfda31e. ± Comparison against base commit b830b16.

♻️ This comment has been updated with latest results.

@AhmedSoliman AhmedSoliman changed the title More logging changes [Bifrost] Design improvements for find_tail Jan 31, 2025
@AhmedSoliman AhmedSoliman force-pushed the pr2593 branch 2 times, most recently from b3b0704 to 638c13b Compare January 31, 2025 18:18
@pcholakov
Copy link
Contributor

Five minute test with random partitions looks good! I will re-run it a few more times to see how it behaves.

https://github.com/restatedev/jepsen/actions/runs/13080247655/job/36501963775

latency-raw

The gaps during the partitions (grey bars) indicate that no processing seems to be happening, and sometimes we don't recover for a couple of on/off cycles - so 15-25s since the first partition event. I added some clients-side timeouts to the test driver to ride out some of the short-term unavailability but I'd want to double check that I'm not starving out Jepsen's worker threads with these. I was under the impression that there is dedicated concurrency per Restate worker node.

Some more analysis required but at first glance it looks like a big improvement over before with no long-term lockups.

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for improving the find_tail operation @AhmedSoliman. Impressive work as always! I left a few questions for clarifying my understanding. Apart from this +1 for merging it.

crates/bifrost/src/providers/replicated_loglet/loglet.rs Outdated Show resolved Hide resolved
Comment on lines +221 to +222
// already fully sealed, just make sure the sequencer is drained.
handle.drain().await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is important to drain to avoid a race condition where a record has been stored on enough log servers but the sequencer hasn't update it's known_global_tail yet. If we called notify_seal w/o waiting for the draining, then we would mark a lsn as sealed which is lower than the last ack'ed lsn (assuming that the sequencer eventually acks the lsn that it replicated but has not updated the known_global_tail with yet). Is this roughly correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. correct :)

crates/bifrost/src/providers/replicated_loglet/loglet.rs Outdated Show resolved Hide resolved
crates/bifrost/src/providers/replicated_loglet/provider.rs Outdated Show resolved Hide resolved
crates/types/src/config/bifrost.rs Outdated Show resolved Hide resolved
This PR introduces a few changes resulting in significant improvements in failover time, and performance of common operation like find_tail().
The result is failover time that in the hundreds of milliseconds in the happy path and in the order of a couple of seconds in the unhappy path. `find_tail()` is
also now significantly cheaper to run if the sequencer is running, this enables parallelization of `find_tail()` runs in logs controller (and more frequent as well). The latter comment will be reflected in a separate PR.

This also includes a new implementation of the seal task that reuses the `RunOnSingleNode` utility and that doesn't continue attempts once f-majority is sealed to reduce overloading the cluster. This can be reconsidered if we observed issues.
@AhmedSoliman AhmedSoliman merged commit bfda31e into main Feb 4, 2025
49 checks passed
@AhmedSoliman AhmedSoliman deleted the pr2593 branch February 4, 2025 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants