Introduce MergingRunData and MergingTreeData #592

mheinzel · 2025-02-25T22:16:29Z

Description

Similarly to the existing RunData, these are simplified versions of MergingRun and MergingTree, which can be generated, shrunk and turned into their full on-disk versions.

I also added labelling and tests for the generators and shrinkers, as well as the code creating the MergingTree etc. from the data.

The tests are quite slow even after halving the maximum size. The main issue is that RunData has a lot of possible shrinks (sometimes over 100k), which gets even worse when there are dozens of them in the tree.

test output

  Test.Database.LSMTree.Generators
    RunData
      Arbitrary satisfies invariant:          OK (0.60s)
        +++ OK, passed 100 tests.
        
        run has large k/ops (100 in total):
        52% has large k/op
        48% no large k/op
        
        run length (100 in total):
        68% 10 <= n < 100
        26% 1 <= n < 10
         6% n == 0
        
        value size (1974 in total):
        53.80% 10 <= n < 100
        17.88% 100 <= n < 1000
        15.75% 1000 <= n < 10000
        11.65% 1 <= n < 10
         0.91% n == 0
      Shrinking satisfies invariant:          OK (6.67s)
        +++ OK, passed 100 tests (3% no shrinks).
        
        number of shrinks (97 in total):
        45% 10000 <= n < 100000
        32% 1000 <= n < 10000
        20% 100 <= n < 1000
         3% 10 <= n < 100
      withRun doesn't leak resources:         OK (0.65s)
        +++ OK, passed 100 tests.
    NonEmptyRunData
      Arbitrary satisfies invariant:          OK (0.69s)
        +++ OK, passed 100 tests.
        
        run has large k/ops (100 in total):
        57% has large k/op
        43% no large k/op
        
        run length (100 in total):
        67% 10 <= n < 100
        33% 1 <= n < 10
        
        value size (2475 in total):
        54.71% 10 <= n < 100
        18.83% 100 <= n < 1000
        15.27% 1000 <= n < 10000
        10.79% 1 <= n < 10
         0.40% n == 0
      Shrinking satisfies invariant:          OK (14.08s)
        +++ OK, passed 100 tests (1% no shrinks).
        
        number of shrinks (99 in total):
        45% 10000 <= n < 100000
        38% 1000 <= n < 10000
        12% 100 <= n < 1000
         2% 1 <= n < 10
         2% 10 <= n < 100
    MergingRunData
      Arbitrary satisfies invariant:          OK (0.56s)
        +++ OK, passed 100 tests.
        
        merge type (100 in total):
        50% MergeLastLevel
        50% MergeMidLevel
        
        merging run inputs (53 in total):
        55% 4 <= n < 8
        36% 2 <= n < 4
         9% 8 <= n < 16
        
        merging run state (100 in total):
        53% OngoingMerge
        47% CompletedMerge
        
        run has large k/ops (295 in total):
        75.3% no large k/op
        24.7% has large k/op
        
        run length (295 in total):
        74.6% 1 <= n < 10
        23.7% 10 <= n < 100
         1.7% n == 0
        
        value size (2138 in total):
        33.77% 10 <= n < 100
        30.17% 1 <= n < 10
        17.87% 100 <= n < 1000
        16.09% 1000 <= n < 10000
         2.10% n == 0
      Shrinking satisfies invariant:          OK (4.02s)
        +++ OK, passed 100 tests.
        
        number of shrinks (100 in total):
        47% 1000 <= n < 10000
        24% 100 <= n < 1000
        24% 10000 <= n < 100000
         3% 1 <= n < 10
         2% 10 <= n < 100
      withMergingRun doesn't leak resources:  OK (0.80s)
        +++ OK, passed 100 tests.
    MergingTreeData
      Arbitrary satisfies invariant:          OK (4.25s)
        +++ OK, passed 100 tests.
        
        merge type (593 in total):
        40.6% MergeMidLevel
        37.1% MergeLastLevel
        12.5% MergeLevel
         9.8% MergeUnion
        
        merging run inputs (278 in total):
        52.5% 4 <= n < 8
        33.1% 2 <= n < 4
        14.4% 8 <= n < 16
        
        merging run state (593 in total):
        53.1% CompletedMerge
        46.9% OngoingMerge
        
        merging tree state (604 in total):
        29.8% PendingLevelMerge
        25.0% PendingUnionMerge
        23.3% CompletedTreeMerge
        21.9% OngoingTreeMerge
        
        run has large k/ops (2214 in total):
        72.36% no large k/op
        27.64% has large k/op
        
        run length (2214 in total):
        70.28% 1 <= n < 10
        28.27% 10 <= n < 100
         1.45% n == 0
        
        tree depth (100 in total):
        51% 2 <= n < 4
        38% 4 <= n < 8
        11% 1 <= n < 2
        
        value size (16784 in total):
        37.786% 10 <= n < 100
        27.359% 1 <= n < 10
        17.165% 100 <= n < 1000
        15.473% 1000 <= n < 10000
         2.216% n == 0
      Shrinking satisfies invariant:          OK (22.70s)
        +++ OK, passed 100 tests (2% no shrinks).
        
        number of shrinks (98 in total):
        39% 10000 <= n < 100000
        24% 1000 <= n < 10000
        21% 100000 <= n < 1000000
        13% 100 <= n < 1000
         1% 1 <= n < 10
         1% 10 <= n < 100
      withMergingTree doesn't leak resources: OK (6.39s)
        +++ OK, passed 100 tests.

jorisdral · 2025-02-27T09:39:35Z

The tests are quite slow ...

"the tests" meaning the generator invariant tests, I take it?

mheinzel · 2025-02-27T10:42:01Z

"the tests" meaning the generator invariant tests, I take it?

Mainly all the shrinker invariant tests, but for MergingTreeData in particular, even just generating them is kind of slow (see the test output in the PR description).

Instead of decreasing the size of these particular tests, I should probably change the instance to generate smaller trees, since any property on trees will otherwise be quite slow.

jorisdral

Looks useful!

The tests are quite slow even after halving the maximum size. The main issue is that RunData has a lot of possible shrinks (sometimes over 100k), which gets even worse when there are dozens of them in the tree.

Normally with shrinking test cases it's not extremely important to not generate too many shrinks. Rather, the ordering of shrunk values is more important, so that we don't evaluate too many before we find the next, smaller, failing test case. But yeah, it would be nice to reduce the number of shrink values for these "arbitrary/shrinking preserve invariant" tests. I've commented elsewhere that we could maybe reduce the tree size by generating fewer children in the merging tree

test/Test/Util/Arbitrary.hs

src-extras/Database/LSMTree/Extras/RunData.hs

jorisdral · 2025-02-27T10:39:54Z

src-extras/Database/LSMTree/Extras/MergingTreeData.hs

+      = QC.frequency
+          [ (1, do
+              -- pending level merge without child
+              preExisting <- QC.vectorOf (n - 1) $  -- 1 for constructor itself
+                genPreExistingRunData genKey genVal genBlob
+              return (PendingLevelMergeData preExisting Nothing))
+          , (1, do
+              -- pending level merge with child
+              numPreExisting <- QC.chooseInt (0, min 20 (n - 2))
+              preExisting <- QC.vectorOf numPreExisting $
+                genPreExistingRunData genKey genVal genBlob
+              tree <- go (n - numPreExisting - 1)
+              return (PendingLevelMergeData preExisting (Just tree)))
+          , (2, do
+              -- pending union merge
+              ns <- QC.shuffle =<< arbitraryPartition2 n
+              PendingUnionMergeData <$> traverse go ns)
+          ]


Same comment about the number of children: would we test most interesting cases with 2 or 3 children?

Here it does indeed matter and I landed on distribution between 0 and 6 pre-existing run, but skewing towards smaller numbers.

How do you feel about the chooseIntSkewed (0, 6) I went with?

No objections.

Out of interest, do you feel there are interesting test cases where we'd need more than 2 or 3 runs? In my mind, 4 or more runs wouldn't do anything more interesting than 2 or 3 runs would do. It's more data mostly, and having fewer runs could speed tests down the line

I don't think there's anything interesting about having more than 3 runs, but it can make some cases more likely, like having the same key in more than 2 runs. I can play around with it once more to see how much it affects test runtime.

It has a noticeable, but relatively small effect. Changing the range to (0, 3) lead to about 25% faster tests. I decided to generate 50% shorter runs instead (see last commit), more than halving the runtime. Not sure it was necessary, but ~17s for all MergingTreeData tests seems more tolerable to me than ~35s.

src-extras/Database/LSMTree/Extras/MergingTreeData.hs

test/Test/Database/LSMTree/Generators.hs

jorisdral

LGTM!

jorisdral · 2025-03-07T09:27:10Z

src-extras/Database/LSMTree/Extras/MergingTreeData.hs

+unsafeCreateMergingTree hfs hbio resolve indexType path counter = go
+  where
+    go = \case


You can inline the go

I can inline it, but then I need to pass all the arguments to each recursive call, which is kind of noisy and often requires a line break.

jorisdral · 2025-03-07T09:31:50Z

src-extras/Database/LSMTree/Extras/MergingTreeData.hs

+      = QC.frequency
+          [ (1, do
+              -- pending level merge without child
+              preExisting <- QC.vectorOf (n - 1) $  -- 1 for constructor itself
+                genPreExistingRunData genKey genVal genBlob
+              return (PendingLevelMergeData preExisting Nothing))
+          , (1, do
+              -- pending level merge with child
+              numPreExisting <- QC.chooseInt (0, min 20 (n - 2))
+              preExisting <- QC.vectorOf numPreExisting $
+                genPreExistingRunData genKey genVal genBlob
+              tree <- go (n - numPreExisting - 1)
+              return (PendingLevelMergeData preExisting (Just tree)))
+          , (2, do
+              -- pending union merge
+              ns <- QC.shuffle =<< arbitraryPartition2 n
+              PendingUnionMergeData <$> traverse go ns)
+          ]


No objections.

Out of interest, do you feel there are interesting test cases where we'd need more than 2 or 3 runs? In my mind, 4 or more runs wouldn't do anything more interesting than 2 or 3 runs would do. It's more data mostly, and having fewer runs could speed tests down the line

src-extras/Database/LSMTree/Extras/RunData.hs

test/Test/Database/LSMTree/Generators.hs

src-extras/Database/LSMTree/Extras/RunData.hs

This keeps the runtime of the shrinker tests around 10s with only a small impact on the distribution of generated tests. There is still a decent amount of somewhat long runs (max size is now 50, not 100). Before: run length (1249 in total): 63.23% 10 <= n < 100 36.17% 1 <= n < 10 0.59% n == 0 After: run length (1181 in total): 51.82% 10 <= n < 100 45.39% 1 <= n < 10 2.79% n == 0

mheinzel requested review from dcoutts, jorisdral, recursion-ninja and wenkokke as code owners February 25, 2025 22:16

mheinzel force-pushed the mheinzel/merging-tree-data branch 3 times, most recently from bff20b9 to 7188e2a Compare February 26, 2025 13:12

jorisdral requested changes Feb 27, 2025

View reviewed changes

mheinzel mentioned this pull request Mar 3, 2025

Improvements to generators and shrinkers #604

Merged

mheinzel force-pushed the mheinzel/merging-tree-data branch 3 times, most recently from 352840f to bff6984 Compare March 6, 2025 11:22

mheinzel mentioned this pull request Mar 6, 2025

Lookups in merging tree #609

Draft

mheinzel force-pushed the mheinzel/merging-tree-data branch from bff6984 to 442e8a8 Compare March 6, 2025 16:52

jorisdral approved these changes Mar 7, 2025

View reviewed changes

mheinzel added 4 commits March 7, 2025 14:39

create MergingRunData and MergingTreeData

b6f81fd

test that withRun/withMergingRun/... don't leak resources

1cec2a1

also generate structurally empty trees

7a24768

mheinzel force-pushed the mheinzel/merging-tree-data branch from 442e8a8 to d78f451 Compare March 7, 2025 13:45

mheinzel enabled auto-merge March 7, 2025 13:49

mheinzel added this pull request to the merge queue Mar 7, 2025

Merged via the queue into main with commit 8f3ca3c Mar 7, 2025
27 checks passed

mheinzel deleted the mheinzel/merging-tree-data branch March 7, 2025 14:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce MergingRunData and MergingTreeData #592

Introduce MergingRunData and MergingTreeData #592

mheinzel commented Feb 25, 2025 •

edited

Loading

jorisdral commented Feb 27, 2025

mheinzel commented Feb 27, 2025

jorisdral left a comment •

edited

Loading

jorisdral Feb 27, 2025 •

edited

Loading

mheinzel Mar 3, 2025

mheinzel Mar 6, 2025 •

edited

Loading

jorisdral Mar 7, 2025

mheinzel Mar 7, 2025

mheinzel Mar 7, 2025 •

edited

Loading

jorisdral left a comment

jorisdral Mar 7, 2025

mheinzel Mar 7, 2025

jorisdral Mar 7, 2025

Introduce MergingRunData and MergingTreeData #592

Introduce MergingRunData and MergingTreeData #592

Conversation

mheinzel commented Feb 25, 2025 • edited Loading

Description

jorisdral commented Feb 27, 2025

mheinzel commented Feb 27, 2025

jorisdral left a comment • edited Loading

Choose a reason for hiding this comment

jorisdral Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

mheinzel Mar 3, 2025

Choose a reason for hiding this comment

mheinzel Mar 6, 2025 • edited Loading

Choose a reason for hiding this comment

jorisdral Mar 7, 2025

Choose a reason for hiding this comment

mheinzel Mar 7, 2025

Choose a reason for hiding this comment

mheinzel Mar 7, 2025 • edited Loading

Choose a reason for hiding this comment

jorisdral left a comment

Choose a reason for hiding this comment

jorisdral Mar 7, 2025

Choose a reason for hiding this comment

mheinzel Mar 7, 2025

Choose a reason for hiding this comment

jorisdral Mar 7, 2025

Choose a reason for hiding this comment

mheinzel commented Feb 25, 2025 •

edited

Loading

jorisdral left a comment •

edited

Loading

jorisdral Feb 27, 2025 •

edited

Loading

mheinzel Mar 6, 2025 •

edited

Loading

mheinzel Mar 7, 2025 •

edited

Loading