LSM: fixing Compaction interactions with iterators #177

kprotty · 2022-10-04T15:47:57Z

MergeIterator stream_peek()

There are times when peek() on the TableIterator (and transitively, the LevelIterator) will return null without being empty. This happens if it runs out of values buffered in memory and needs to tick() again. Unfortunately, the MergeIterator currently assumes that peek() returning null means the iterator is completed and stops accessing it / removes it from its heap data structure.

MergeIterator.stream_peek() now can return a third state of Pending which means that, on the given iterator being merged, peek() would return null but buffered_all_values() would return false. This ensures iterators are accessed until they are truly empty.

Compaction iterator IO tracking

There was a bug in the order IO was tracked during Compaction. tick() would be called on an iterator which would initiate IO on the Grid. The Grid would service the IO from the cache and complete it inline. This bubbles up and calls Compaction.io_finish() early. tick() returns true that IO started and does Compaction.io_start() but it was too late.

A fix is to do Compaction.io_start() before calling tick(). If it returns true, io_finish() will be called either inline or eventually. If it returns false, Compaction calls io_finish() itself to cancel it out.

Manifest invisible table iterator issues

One of the last issues is that assert_no_invisible_tables() is being hit at a checkpoint(). When the even/odd compactions complete, they end up calling remove_invisible_tables for their level, which all happens before the checkpoint. It seems as though some invisible tables still remain according to the ManifestLevels.

Changing remove_invisible_tables to iterate all levels and not use the KeyRange as a hint seems to address this for lsm/test.zig but it's still an issue in the rafiki tests. Other attempts (to no luck in rafiki) include re-creating the iterator in a loop to get the next table to mitigate the possibility of invalidation by removal, and removing from all levels at the end of the fourth beat instead of when Compactions are done.

This bug is left unresolved in this PR until a good fix can be found, but the other changes are still worth getting merged in the mean time.

src/lsm/compaction.zig

jorangreef · 2022-10-05T14:29:46Z

Thanks for the awesome write-up @kprotty.

src/lsm/compaction.zig

src/lsm/k_way_merge.zig

sentientwaffle · 2022-10-05T14:51:11Z

src/lsm/k_way_merge.zig

            if (stream_peek(it.context, root)) |key| {
                it.keys[0] = key;
                it.down_heap();
-            } else {
-                it.swap(0, it.k - 1);
-                it.k -= 1;
-                it.down_heap();
+            } else |err| switch (err) {
+                error.Pending => return null,
+                error.Empty => {
+                    it.swap(0, it.k - 1);
+                    it.k -= 1;
+                    it.down_heap();
+                },
            }


This is related to a bug that we discussed. The value may not be available right now (in the Pending case), but when becomes available the stream may be in the wrong position in the heap. I think we need to move this whole heap update block from after the stream_pop() to before the stream_pop(). (Note that this means that the heap invariant of keys would need to be modified accordingly).

So Pending should do the same as the successful if |key| branch to ensure it's peek'd again?

It can't because it doesn't have a key to use.

I think we need to move the whole heap update immediately before we remove a value. e.g.

if (stream_peek(it.context, root)) |key| { it.keys[0] = key; it.down_heap(); } else { it.swap(0, it.k - 1); it.k -= 1; it.down_heap(); } else |err| switch (err) { error.Pending => return null, error.Empty => { it.swap(0, it.k - 1); it.k -= 1; it.down_heap(); }, }

must all move before

const root = it.streams[0]; // We know that each input iterator is sorted, so we don't need to compare the next // key on that iterator with the current min/max. const value = stream_pop(it.context, root);

src/lsm/table_iterator.zig

src/lsm/k_way_merge.zig

src/lsm/table_iterator.zig

src/lsm/level_iterator.zig

src/lsm/k_way_merge.zig

src/lsm/grid.zig

src/lsm/level_iterator.zig

src/lsm/table_iterator.zig

sentientwaffle · 2022-10-14T19:43:05Z

src/lsm/k_way_merge.zig

            if (stream_peek(it.context, root)) |key| {
                it.keys[0] = key;
                it.down_heap();
-            } else {
-                it.swap(0, it.k - 1);
-                it.k -= 1;
-                it.down_heap();
+            } else |err| switch (err) {
+                error.Pending => return null,
+                error.Empty => {
+                    it.swap(0, it.k - 1);
+                    it.k -= 1;
+                    it.down_heap();
+                },
            }


It can't because it doesn't have a key to use.

I think we need to move the whole heap update immediately before we remove a value. e.g.

if (stream_peek(it.context, root)) |key| { it.keys[0] = key; it.down_heap(); } else { it.swap(0, it.k - 1); it.k -= 1; it.down_heap(); } else |err| switch (err) { error.Pending => return null, error.Empty => { it.swap(0, it.k - 1); it.k -= 1; it.down_heap(); }, }

must all move before

const root = it.streams[0]; // We know that each input iterator is sorted, so we don't need to compare the next // key on that iterator with the current min/max. const value = stream_pop(it.context, root);

sentientwaffle · 2022-10-14T19:51:48Z

In addition, can you verify that this fixes #183, #186, and #188? When I checked those last week it looked like all 3 would be addressed by this PR.

src/lsm/compaction.zig

kprotty · 2022-10-17T16:32:36Z

#183 should be fixed by making Grid.read_block async to avoid reordered io_finish()
#186 should be fixed by reporting error.Drained from the iterators & handling all of that
#188 should be fixed by using buffered_all_values() in peek() (which does the same assert), then fixing that usage by #186

The merge iterator tests are currently failing I believe due to the heap update changes (think I implemented it wrong).

sentientwaffle

It is great to have these bugs fixed! Just a few notes, but it is most of the way there 🔥

src/lsm/grid.zig

sentientwaffle · 2022-10-19T01:59:58Z

src/lsm/grid.zig

+                grid.read_queue = .{};
+                while (copy.pop()) |pending_read| {
+                    if (pending_read.address == read.address) {
+                        assert(pending_read.checksum == read.checksum);


🤔 This is true right now, but I think once we have grid recovery this won't be true — another replica that is far behind might ask us for an old block we don't have anymore, from an address that we are using for some new data. Both of those reads might be queued up simultaneously.

I think we can still queue the reads up together, we just need to be careful how we handle them when the read completes to make sure that the correct read gets a response. (The incorrect read can be unreachable for now — implementing that will be part of grid recovery).

Is the case of "an address that we are using for some new data" in grid.read_queue already handled by release_at_checkpoint(addr) calling assert_not_reading(addr)?

We might have to update release_at_checkpoint for grid repair as well. When we release the block, we know that we aren't reading it due to (e.g.) compaction or serving a prefetch(). But we can't predict when another replica might request a block from us.

src/lsm/grid.zig

src/lsm/k_way_merge.zig

sentientwaffle · 2022-10-19T02:48:45Z

src/lsm/level_iterator.zig

@@ -212,7 +212,8 @@ pub fn LevelIteratorType(comptime Table: type, comptime Storage: type) type {
        fn next_table_iterator(it: *LevelIterator) *TableIterator {
            if (it.tables.full()) {
                const table = &it.tables.head_ptr().?.table_iterator;
-                while (table.peek() != null) {
+                while (true) {
+                    _ = table.peek() catch break;
                    it.values.push(table.pop()) catch unreachable;


I know it wasn't touched by your PR, but could this use push_assume_capacity?

src/lsm/level_iterator.zig

sentientwaffle · 2022-10-19T02:51:29Z

src/lsm/level_iterator.zig

            };
-            return scope.table_iterator.peek().?;
+
+            return scope.table_iterator.peek() catch unreachable;


How are we guaranteed that peek() always has a value ready?

src/lsm/set_associative_cache.zig

src/lsm/table_iterator.zig

sentientwaffle · 2022-10-20T00:17:44Z

src/lsm/level_iterator.zig

            };
-            return scope.table_iterator.peek().?;
+
+            return scope.table_iterator.peek();
        }

        /// This may only be called after peek() has returned non-null.


Please update this comment (and any others similarly out of date).

sentientwaffle · 2022-10-20T00:43:45Z

Edit: Nevermind, I fixed the heap update issue and this seems to be resolved.

~~This branch is still running into crashes on lsm_forest_fuzz that appear related to iterator buffering, e.g.:~~

/home/djg/Code/tb/tb-lsm-iterators/src/lsm/level_iterator.zig:304:57: 0x2841bd in lsm.level_iterator.LevelIteratorType(lsm.table.TableType(u64,tigerbeetle.Account,lsm.groove.ObjectTreeHelpers(tigerbeetle.Account).compare_keys,lsm.groove.ObjectTreeHelpers(tigerbeetle.Account).key_from_value,18446744073709551615,lsm.groove.ObjectTreeHelpers(tigerbeetle.Account).tombstone,lsm.groove.ObjectTreeHelpers(tigerbeetle.Account).tombstone_from_key),test.storage.Storage).pop (lsm_forest_fuzz)
            const table_iterator = &it.tables.head_ptr().?.table_iterator;
                                                        ^
/home/djg/Code/tb/tb-lsm-iterators/src/lsm/compaction.zig:99:51: 0x283e24 in lsm.compaction.MergeStreamSelector.pop (lsm_forest_fuzz)
                    1 => compaction.iterator_b.pop(),
                                                  ^
/home/djg/Code/tb/tb-lsm-iterators/src/lsm/k_way_merge.zig:109:37: 0x480b29 in lsm.k_way_merge.KWayMergeIterator(lsm.compaction.CompactionType(lsm.table.TableType(u64,tigerbeetle.Account,lsm.groove.ObjectTreeHelpers(tigerbeetle.Account).compare_keys,lsm.groove.ObjectTreeHelpers(tigerbeetle.Account).key_from_value,18446744073709551615,lsm.groove.ObjectTreeHelpers(tigerbeetle.Account).tombstone,lsm.groove.ObjectTreeHelpers(tigerbeetle.Account).tombstone_from_key),test.storage.Storage,lsm.table_immutable.TableImmutableIteratorType),u64,tigerbeetle.Account,lsm.groove.ObjectTreeHelpers(tigerbeetle.Account).key_from_value,lsm.groove.ObjectTreeHelpers(tigerbeetle.Account).compare_keys,2,lsm.compaction.MergeStreamSelector.peek,lsm.compaction.MergeStreamSelector.pop,lsm.compaction.MergeStreamSelector.precedence).pop_internal (lsm_forest_fuzz)
            const value = stream_pop(it.context, root);
                                    ^
/home/djg/Code/tb/tb-lsm-iterators/src/lsm/k_way_merge.zig:88:35: 0x43dd72 in lsm.k_way_merge.KWayMergeIterator(lsm.compaction.CompactionType(lsm.table.TableType(u64,tigerbeetle.Account,lsm.groove.ObjectTreeHelpers(tigerbeetle.Account).compare_keys,lsm.groove.ObjectTreeHelpers(tigerbeetle.Account).key_from_value,18446744073709551615,lsm.groove.ObjectTreeHelpers(tigerbeetle.Account).tombstone,lsm.groove.ObjectTreeHelpers(tigerbeetle.Account).tombstone_from_key),test.storage.Storage,lsm.table_immutable.TableImmutableIteratorType),u64,tigerbeetle.Account,lsm.groove.ObjectTreeHelpers(tigerbeetle.Account).key_from_value,lsm.groove.ObjectTreeHelpers(tigerbeetle.Account).compare_keys,2,lsm.compaction.MergeStreamSelector.peek,lsm.compaction.MergeStreamSelector.pop,lsm.compaction.MergeStreamSelector.precedence).pop (lsm_forest_fuzz)
            while (it.pop_internal()) |value| {
                                  ^
/home/djg/Code/tb/tb-lsm-iterators/src/lsm/compaction.zig:432:49: 0x3f7406 in lsm.compaction.CompactionType(lsm.table.TableType(u64,tigerbeetle.Account,lsm.groove.ObjectTreeHelpers(tigerbeetle.Account).compare_keys,lsm.groove.ObjectTreeHelpers(tigerbeetle.Account).key_from_value,18446744073709551615,lsm.groove.ObjectTreeHelpers(tigerbeetle.Account).tombstone,lsm.groove.ObjectTreeHelpers(tigerbeetle.Account).tombstone_from_key),test.storage.Storage,lsm.table_immutable.TableImmutableIteratorType).cpu_merge (lsm_forest_fuzz)
                const value = merge_iterator.pop() orelse break;
                                                ^
/home/djg/Code/tb/tb-lsm-iterators/src/lsm/compaction.zig:402:37: 0x3cf91e in lsm.compaction.CompactionType(lsm.table.TableType(u64,tigerbeetle.Account,lsm.groove.ObjectTreeHelpers(tigerbeetle.Account).compare_keys,lsm.groove.ObjectTreeHelpers(tigerbeetle.Account).key_from_value,18446744073709551615,lsm.groove.ObjectTreeHelpers(tigerbeetle.Account).tombstone,lsm.groove.ObjectTreeHelpers(tigerbeetle.Account).tombstone_from_key),test.storage.Storage,lsm.table_immutable.TableImmutableIteratorType).cpu_merge_start (lsm_forest_fuzz)
                compaction.cpu_merge();
                                    ^
/home/djg/Code/tb/tb-lsm-iterators/src/lsm/compaction.zig:381:71: 0x3a55e5 in lsm.compaction.CompactionType(lsm.table.TableType(u64,tigerbeetle.Account,lsm.groove.ObjectTreeHelpers(tigerbeetle.Account).compare_keys,lsm.groove.ObjectTreeHelpers(tigerbeetle.Account).key_from_value,18446744073709551615,lsm.groove.ObjectTreeHelpers(tigerbeetle.Account).tombstone,lsm.groove.ObjectTreeHelpers(tigerbeetle.Account).tombstone_from_key),test.storage.Storage,lsm.table_immutable.TableImmutableIteratorType).io_finish (lsm_forest_fuzz)
            if (compaction.io_pending == 0) compaction.cpu_merge_start();
                                                                      ^
/home/djg/Code/tb/tb-lsm-iterators/src/lsm/compaction.zig:348:42: 0x3cf690 in lsm.compaction.write_callback.callback (lsm_forest_fuzz)
                    _compaction.io_finish();
                                         ^

~~Can you check it out? (The above is seed 12421653945172149209).~~

The TableIterator (and consequently the LevelIterator) could have `peek()` return null even if all values weren't yet buffered in memory. The issue is that a null `peek()` is treated as the iterator being empty. The merge iterator's `stream_peek()` can now return a Pending state to differentiate this. Pending iterators can be discovered by checking `buffered_all_values()` when `peek()` returns null.

The iterator's tick() would complete inline and call io_finish() before returning true for Compaction to call io_start(). Now, io_start() is called before-hand. If tick() returns true, the io_finish() will be (or was) called by the iterator. If tick() returns false, the Compaction will resolve the io_finish() instead.

Now that Grid.read_block always completes asynchronous and never inline, the callers (compaction iterators) don't have to worry about out-of-order io_finish() calls.

Previously, you would use `buffered_all_values()` to know if `peek()` returning null meant the iterator was empty or still buffering values with nothing immediate to offer. Now this state distinction is returned directly by `peek()`, reducing the chance of a footgun assuming null means empty. `buffered_all_values()` still exists and is pub as the buffering aspects (instead of the peeking aspect) is still used by `tick()`.

Ensures that a stream which isn't ready (but also isn't empty) is not in the wrong heap position after pop(). This also ensures that streams which are still buffering are not passed into MergeIterator.init(). Compaction already guarantees this by only calling init() after ticking the relevant iterators for the merge stream.

"Buffering" isn't necessarily correct as it implies there's currently I/O happening to refill the values (this happens on a subsequent tick() instead).

The stream is now peeked again after a pop, but instead of returning null on Drained, it correctly returns the popped value.

Only check for the existence of a grid address when pushing to read_cache to be resolved on next tick(). This avoids bumping up the likelyhood of keeping the block in cache unnecessarily.

sentientwaffle

🚀 Great work!

jorangreef reviewed Oct 5, 2022

View reviewed changes

src/lsm/compaction.zig Outdated Show resolved Hide resolved

sentientwaffle reviewed Oct 5, 2022

View reviewed changes

This was referenced Oct 6, 2022

LSM: Fix: Remove invisible tables promptly #180

Merged

lsm_forest_fuzz: 11035627816572296482 #186

Closed

lsm_forest_fuzz: 10351473442468558373 #188

Closed

lsm_forest_fuzz: 7289599745671114828 #183

Closed

kprotty force-pushed the lsm-iterators branch from 5159eb9 to 51ab743 Compare October 13, 2022 13:55

sentientwaffle reviewed Oct 14, 2022

View reviewed changes

src/lsm/compaction.zig Outdated Show resolved Hide resolved

kprotty force-pushed the lsm-iterators branch from c511eee to 0f912c4 Compare October 17, 2022 16:26

sentientwaffle reviewed Oct 19, 2022

View reviewed changes

kprotty force-pushed the lsm-iterators branch from fb36a2d to 4ebdee4 Compare October 19, 2022 16:49

sentientwaffle reviewed Oct 20, 2022

View reviewed changes

kprotty added 13 commits October 20, 2022 10:02

LSM: restructure Grid reads to always be async to caller

d391dba

LSM: reverse Compaction io read starts to simple version

a43d6ae

Now that Grid.read_block always completes asynchronous and never inline, the callers (compaction iterators) don't have to worry about out-of-order io_finish() calls.

LSM: Clarify doc comments for Level/Table iterators

f4bcd38

LSM: Rename error.Buffering to error.Drained for iterators

5b30bdc

"Buffering" isn't necessarily correct as it implies there's currently I/O happening to refill the values (this happens on a subsequent tick() instead).

LSM: Re-adjust order MergeIterator heap updates

c943a78

The stream is now peeked again after a pop, but instead of returning null on Drained, it correctly returns the popped value.

LSM: Use grid.cache.exists for queued to read_cached

d13aec9

Only check for the existence of a grid address when pushing to read_cache to be resolved on next tick(). This avoids bumping up the likelyhood of keeping the block in cache unnecessarily.

LSM: Clarify grid read queue invariants.

095477b

LSM: Code/Comment cleanup

65bd3c7

LSM: LevelIterator use RingBuffer.push_assume_capacity()

fa15d81

kprotty force-pushed the lsm-iterators branch from 4ebdee4 to fa15d81 Compare October 20, 2022 15:08

LSM: Update Iterator comments about peek()'s return type

4ef3614

sentientwaffle approved these changes Oct 20, 2022

View reviewed changes

sentientwaffle merged commit 773f3d3 into main Oct 20, 2022

sentientwaffle mentioned this pull request Oct 22, 2022

random transactions (w/no environmental faults) can panic or cause the replica to become unavailable #215

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LSM: fixing Compaction interactions with iterators #177

LSM: fixing Compaction interactions with iterators #177

kprotty commented Oct 4, 2022

jorangreef commented Oct 5, 2022

sentientwaffle Oct 5, 2022 •

edited

Loading

kprotty Oct 14, 2022

sentientwaffle Oct 14, 2022

sentientwaffle Oct 14, 2022

sentientwaffle commented Oct 14, 2022

kprotty commented Oct 17, 2022

sentientwaffle left a comment

sentientwaffle Oct 19, 2022

kprotty Oct 19, 2022

sentientwaffle Oct 20, 2022

sentientwaffle Oct 19, 2022

sentientwaffle Oct 19, 2022

sentientwaffle Oct 20, 2022

sentientwaffle commented Oct 20, 2022 •

edited

Loading

sentientwaffle left a comment

LSM: fixing Compaction interactions with iterators #177

LSM: fixing Compaction interactions with iterators #177

Conversation

kprotty commented Oct 4, 2022

MergeIterator stream_peek()

Compaction iterator IO tracking

Manifest invisible table iterator issues

jorangreef commented Oct 5, 2022

sentientwaffle Oct 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sentientwaffle commented Oct 14, 2022

kprotty commented Oct 17, 2022

sentientwaffle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sentientwaffle commented Oct 20, 2022 • edited Loading

sentientwaffle left a comment

Choose a reason for hiding this comment

sentientwaffle Oct 5, 2022 •

edited

Loading

sentientwaffle commented Oct 20, 2022 •

edited

Loading