[Bifrost] Design improvements for find_tail #2593

AhmedSoliman · 2025-01-30T18:25:59Z

This PR introduces a few changes resulting in significant improvements in failover time, and performance of common operation like find_tail().
The result is failover time that in the hundreds of milliseconds in the happy path and in the order of a couple of seconds in the unhappy path. find_tail() is
also now significantly cheaper to run if the sequencer is running, this enables parallelization of find_tail() runs in logs controller (and more frequent as well). The latter comment will be reflected in a separate PR.

This also includes a new implementation of the seal task that reuses the `RunOnSingleNode` utility and that doesn't continue attempts once f-majority is sealed to reduce overloading the cluster. This can be reconsidered if we observed issues.

Stack created with Sapling. Best reviewed with ReviewStack.

github-actions · 2025-01-30T18:54:30Z

Test Results

7 files ±0 7 suites ±0 4m 35s ⏱️ -2s
47 tests ±0 46 ✅ ±0 1 💤 ±0 0 ❌ ±0
182 runs ±0 179 ✅ ±0 3 💤 ±0 0 ❌ ±0

Results for commit bfda31e. ± Comparison against base commit b830b16.

♻️ This comment has been updated with latest results.

pcholakov · 2025-01-31T21:00:35Z

Five minute test with random partitions looks good! I will re-run it a few more times to see how it behaves.

https://github.com/restatedev/jepsen/actions/runs/13080247655/job/36501963775

The gaps during the partitions (grey bars) indicate that no processing seems to be happening, and sometimes we don't recover for a couple of on/off cycles - so 15-25s since the first partition event. I added some clients-side timeouts to the test driver to ride out some of the short-term unavailability but I'd want to double check that I'm not starving out Jepsen's worker threads with these. I was under the impression that there is dedicated concurrency per Restate worker node.

Some more analysis required but at first glance it looks like a big improvement over before with no long-term lockups.

tillrohrmann

Thanks for improving the find_tail operation @AhmedSoliman. Impressive work as always! I left a few questions for clarifying my understanding. Apart from this +1 for merging it.

crates/bifrost/src/providers/replicated_loglet/loglet.rs

tillrohrmann · 2025-02-03T16:53:44Z

crates/bifrost/src/providers/replicated_loglet/loglet.rs

+                                // already fully sealed, just make sure the sequencer is drained.
+                                handle.drain().await?;


It is important to drain to avoid a race condition where a record has been stored on enough log servers but the sequencer hasn't update it's known_global_tail yet. If we called notify_seal w/o waiting for the draining, then we would mark a lsn as sealed which is lower than the last ack'ed lsn (assuming that the sequencer eventually acks the lsn that it replicated but has not updated the known_global_tail with yet). Is this roughly correct?

Yes. correct :)

crates/bifrost/src/providers/replicated_loglet/loglet.rs

crates/bifrost/src/providers/replicated_loglet/provider.rs

crates/bifrost/src/providers/replicated_loglet/tasks/check_seal.rs

crates/bifrost/src/providers/replicated_loglet/tasks/seal.rs

crates/types/src/config/bifrost.rs

This PR introduces a few changes resulting in significant improvements in failover time, and performance of common operation like find_tail(). The result is failover time that in the hundreds of milliseconds in the happy path and in the order of a couple of seconds in the unhappy path. `find_tail()` is also now significantly cheaper to run if the sequencer is running, this enables parallelization of `find_tail()` runs in logs controller (and more frequent as well). The latter comment will be reflected in a separate PR. This also includes a new implementation of the seal task that reuses the `RunOnSingleNode` utility and that doesn't continue attempts once f-majority is sealed to reduce overloading the cluster. This can be reconsidered if we observed issues.

AhmedSoliman changed the title ~~More logging changes~~ [Bifrost] Design improvements for find_tail Jan 31, 2025

AhmedSoliman force-pushed the pr2593 branch 2 times, most recently from b3b0704 to 638c13b Compare January 31, 2025 18:18

AhmedSoliman force-pushed the pr2593 branch 3 times, most recently from 3a7b64b to 98648aa Compare February 3, 2025 10:35

AhmedSoliman marked this pull request as ready for review February 3, 2025 10:35

AhmedSoliman requested a review from tillrohrmann February 3, 2025 10:56

AhmedSoliman force-pushed the pr2593 branch from 98648aa to 2653b06 Compare February 3, 2025 11:03

This was referenced Feb 3, 2025

Health in TaskCenter #2613

Closed

[Core] Improve connection termination on shutdown #2614

Closed

AhmedSoliman force-pushed the pr2593 branch from 2653b06 to fc68d20 Compare February 3, 2025 15:59

AhmedSoliman mentioned this pull request Feb 3, 2025

Fixes for tail repair and reduction of inner retries #2615

Closed

tillrohrmann approved these changes Feb 3, 2025

View reviewed changes

AhmedSoliman force-pushed the pr2593 branch from fc68d20 to 849cd9b Compare February 4, 2025 11:18

AhmedSoliman force-pushed the pr2593 branch from 849cd9b to bfda31e Compare February 4, 2025 11:51

AhmedSoliman merged commit bfda31e into main Feb 4, 2025
49 checks passed

AhmedSoliman deleted the pr2593 branch February 4, 2025 12:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bifrost] Design improvements for find_tail #2593

[Bifrost] Design improvements for find_tail #2593

AhmedSoliman commented Jan 30, 2025 •

edited

Loading

github-actions bot commented Jan 30, 2025 •

edited

Loading

pcholakov commented Jan 31, 2025

tillrohrmann left a comment

tillrohrmann Feb 3, 2025

AhmedSoliman Feb 4, 2025

		// already fully sealed, just make sure the sequencer is drained.
		handle.drain().await?;

[Bifrost] Design improvements for find_tail #2593

[Bifrost] Design improvements for find_tail #2593

Conversation

AhmedSoliman commented Jan 30, 2025 • edited Loading

This also includes a new implementation of the seal task that reuses the RunOnSingleNode utility and that doesn't continue attempts once f-majority is sealed to reduce overloading the cluster. This can be reconsidered if we observed issues.

github-actions bot commented Jan 30, 2025 • edited Loading

Test Results

pcholakov commented Jan 31, 2025

tillrohrmann left a comment

Choose a reason for hiding this comment

tillrohrmann Feb 3, 2025

Choose a reason for hiding this comment

AhmedSoliman Feb 4, 2025

Choose a reason for hiding this comment

AhmedSoliman commented Jan 30, 2025 •

edited

Loading

This also includes a new implementation of the seal task that reuses the `RunOnSingleNode` utility and that doesn't continue attempts once f-majority is sealed to reduce overloading the cluster. This can be reconsidered if we observed issues.

github-actions bot commented Jan 30, 2025 •

edited

Loading