Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce MergingRunData and MergingTreeData #592

Merged
merged 4 commits into from
Mar 7, 2025

Conversation

mheinzel
Copy link
Collaborator

@mheinzel mheinzel commented Feb 25, 2025

Description

Similarly to the existing RunData, these are simplified versions of MergingRun and MergingTree, which can be generated, shrunk and turned into their full on-disk versions.

I also added labelling and tests for the generators and shrinkers, as well as the code creating the MergingTree etc. from the data.

The tests are quite slow even after halving the maximum size. The main issue is that RunData has a lot of possible shrinks (sometimes over 100k), which gets even worse when there are dozens of them in the tree.

test output
  Test.Database.LSMTree.Generators
    RunData
      Arbitrary satisfies invariant:          OK (0.60s)
        +++ OK, passed 100 tests.
        
        run has large k/ops (100 in total):
        52% has large k/op
        48% no large k/op
        
        run length (100 in total):
        68% 10 <= n < 100
        26% 1 <= n < 10
         6% n == 0
        
        value size (1974 in total):
        53.80% 10 <= n < 100
        17.88% 100 <= n < 1000
        15.75% 1000 <= n < 10000
        11.65% 1 <= n < 10
         0.91% n == 0
      Shrinking satisfies invariant:          OK (6.67s)
        +++ OK, passed 100 tests (3% no shrinks).
        
        number of shrinks (97 in total):
        45% 10000 <= n < 100000
        32% 1000 <= n < 10000
        20% 100 <= n < 1000
         3% 10 <= n < 100
      withRun doesn't leak resources:         OK (0.65s)
        +++ OK, passed 100 tests.
    NonEmptyRunData
      Arbitrary satisfies invariant:          OK (0.69s)
        +++ OK, passed 100 tests.
        
        run has large k/ops (100 in total):
        57% has large k/op
        43% no large k/op
        
        run length (100 in total):
        67% 10 <= n < 100
        33% 1 <= n < 10
        
        value size (2475 in total):
        54.71% 10 <= n < 100
        18.83% 100 <= n < 1000
        15.27% 1000 <= n < 10000
        10.79% 1 <= n < 10
         0.40% n == 0
      Shrinking satisfies invariant:          OK (14.08s)
        +++ OK, passed 100 tests (1% no shrinks).
        
        number of shrinks (99 in total):
        45% 10000 <= n < 100000
        38% 1000 <= n < 10000
        12% 100 <= n < 1000
         2% 1 <= n < 10
         2% 10 <= n < 100
    MergingRunData
      Arbitrary satisfies invariant:          OK (0.56s)
        +++ OK, passed 100 tests.
        
        merge type (100 in total):
        50% MergeLastLevel
        50% MergeMidLevel
        
        merging run inputs (53 in total):
        55% 4 <= n < 8
        36% 2 <= n < 4
         9% 8 <= n < 16
        
        merging run state (100 in total):
        53% OngoingMerge
        47% CompletedMerge
        
        run has large k/ops (295 in total):
        75.3% no large k/op
        24.7% has large k/op
        
        run length (295 in total):
        74.6% 1 <= n < 10
        23.7% 10 <= n < 100
         1.7% n == 0
        
        value size (2138 in total):
        33.77% 10 <= n < 100
        30.17% 1 <= n < 10
        17.87% 100 <= n < 1000
        16.09% 1000 <= n < 10000
         2.10% n == 0
      Shrinking satisfies invariant:          OK (4.02s)
        +++ OK, passed 100 tests.
        
        number of shrinks (100 in total):
        47% 1000 <= n < 10000
        24% 100 <= n < 1000
        24% 10000 <= n < 100000
         3% 1 <= n < 10
         2% 10 <= n < 100
      withMergingRun doesn't leak resources:  OK (0.80s)
        +++ OK, passed 100 tests.
    MergingTreeData
      Arbitrary satisfies invariant:          OK (4.25s)
        +++ OK, passed 100 tests.
        
        merge type (593 in total):
        40.6% MergeMidLevel
        37.1% MergeLastLevel
        12.5% MergeLevel
         9.8% MergeUnion
        
        merging run inputs (278 in total):
        52.5% 4 <= n < 8
        33.1% 2 <= n < 4
        14.4% 8 <= n < 16
        
        merging run state (593 in total):
        53.1% CompletedMerge
        46.9% OngoingMerge
        
        merging tree state (604 in total):
        29.8% PendingLevelMerge
        25.0% PendingUnionMerge
        23.3% CompletedTreeMerge
        21.9% OngoingTreeMerge
        
        run has large k/ops (2214 in total):
        72.36% no large k/op
        27.64% has large k/op
        
        run length (2214 in total):
        70.28% 1 <= n < 10
        28.27% 10 <= n < 100
         1.45% n == 0
        
        tree depth (100 in total):
        51% 2 <= n < 4
        38% 4 <= n < 8
        11% 1 <= n < 2
        
        value size (16784 in total):
        37.786% 10 <= n < 100
        27.359% 1 <= n < 10
        17.165% 100 <= n < 1000
        15.473% 1000 <= n < 10000
         2.216% n == 0
      Shrinking satisfies invariant:          OK (22.70s)
        +++ OK, passed 100 tests (2% no shrinks).
        
        number of shrinks (98 in total):
        39% 10000 <= n < 100000
        24% 1000 <= n < 10000
        21% 100000 <= n < 1000000
        13% 100 <= n < 1000
         1% 1 <= n < 10
         1% 10 <= n < 100
      withMergingTree doesn't leak resources: OK (6.39s)
        +++ OK, passed 100 tests.

@mheinzel mheinzel force-pushed the mheinzel/merging-tree-data branch 3 times, most recently from bff20b9 to 7188e2a Compare February 26, 2025 13:12
@jorisdral
Copy link
Collaborator

The tests are quite slow ...

"the tests" meaning the generator invariant tests, I take it?

@mheinzel
Copy link
Collaborator Author

"the tests" meaning the generator invariant tests, I take it?

Mainly all the shrinker invariant tests, but for MergingTreeData in particular, even just generating them is kind of slow (see the test output in the PR description).

Instead of decreasing the size of these particular tests, I should probably change the instance to generate smaller trees, since any property on trees will otherwise be quite slow.

Copy link
Collaborator

@jorisdral jorisdral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks useful!

The tests are quite slow even after halving the maximum size. The main issue is that RunData has a lot of possible shrinks (sometimes over 100k), which gets even worse when there are dozens of them in the tree.

Normally with shrinking test cases it's not extremely important to not generate too many shrinks. Rather, the ordering of shrunk values is more important, so that we don't evaluate too many before we find the next, smaller, failing test case. But yeah, it would be nice to reduce the number of shrink values for these "arbitrary/shrinking preserve invariant" tests. I've commented elsewhere that we could maybe reduce the tree size by generating fewer children in the merging tree

Comment on lines 252 to 269
= QC.frequency
[ (1, do
-- pending level merge without child
preExisting <- QC.vectorOf (n - 1) $ -- 1 for constructor itself
genPreExistingRunData genKey genVal genBlob
return (PendingLevelMergeData preExisting Nothing))
, (1, do
-- pending level merge with child
numPreExisting <- QC.chooseInt (0, min 20 (n - 2))
preExisting <- QC.vectorOf numPreExisting $
genPreExistingRunData genKey genVal genBlob
tree <- go (n - numPreExisting - 1)
return (PendingLevelMergeData preExisting (Just tree)))
, (2, do
-- pending union merge
ns <- QC.shuffle =<< arbitraryPartition2 n
PendingUnionMergeData <$> traverse go ns)
]
Copy link
Collaborator

@jorisdral jorisdral Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about the number of children: would we test most interesting cases with 2 or 3 children?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it does indeed matter and I landed on distribution between 0 and 6 pre-existing run, but skewing towards smaller numbers.

Copy link
Collaborator Author

@mheinzel mheinzel Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about the chooseIntSkewed (0, 6) I went with?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objections.

Out of interest, do you feel there are interesting test cases where we'd need more than 2 or 3 runs? In my mind, 4 or more runs wouldn't do anything more interesting than 2 or 3 runs would do. It's more data mostly, and having fewer runs could speed tests down the line

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's anything interesting about having more than 3 runs, but it can make some cases more likely, like having the same key in more than 2 runs. I can play around with it once more to see how much it affects test runtime.

Copy link
Collaborator Author

@mheinzel mheinzel Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has a noticeable, but relatively small effect. Changing the range to (0, 3) lead to about 25% faster tests. I decided to generate 50% shorter runs instead (see last commit), more than halving the runtime. Not sure it was necessary, but ~17s for all MergingTreeData tests seems more tolerable to me than ~35s.

@mheinzel mheinzel force-pushed the mheinzel/merging-tree-data branch 3 times, most recently from 352840f to bff6984 Compare March 6, 2025 11:22
@mheinzel mheinzel mentioned this pull request Mar 6, 2025
@mheinzel mheinzel force-pushed the mheinzel/merging-tree-data branch from bff6984 to 442e8a8 Compare March 6, 2025 16:52
Copy link
Collaborator

@jorisdral jorisdral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment on lines +74 to +76
unsafeCreateMergingTree hfs hbio resolve indexType path counter = go
where
go = \case
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can inline the go

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can inline it, but then I need to pass all the arguments to each recursive call, which is kind of noisy and often requires a line break.

Comment on lines 252 to 269
= QC.frequency
[ (1, do
-- pending level merge without child
preExisting <- QC.vectorOf (n - 1) $ -- 1 for constructor itself
genPreExistingRunData genKey genVal genBlob
return (PendingLevelMergeData preExisting Nothing))
, (1, do
-- pending level merge with child
numPreExisting <- QC.chooseInt (0, min 20 (n - 2))
preExisting <- QC.vectorOf numPreExisting $
genPreExistingRunData genKey genVal genBlob
tree <- go (n - numPreExisting - 1)
return (PendingLevelMergeData preExisting (Just tree)))
, (2, do
-- pending union merge
ns <- QC.shuffle =<< arbitraryPartition2 n
PendingUnionMergeData <$> traverse go ns)
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objections.

Out of interest, do you feel there are interesting test cases where we'd need more than 2 or 3 runs? In my mind, 4 or more runs wouldn't do anything more interesting than 2 or 3 runs would do. It's more data mostly, and having fewer runs could speed tests down the line

mheinzel added 4 commits March 7, 2025 14:39
This keeps the runtime of the shrinker tests around 10s with only a
small impact on the distribution of generated tests. There is still a
decent amount of somewhat long runs (max size is now 50, not 100).

Before:

        run length (1249 in total):
        63.23% 10 <= n < 100
        36.17% 1 <= n < 10
         0.59% n == 0

After:

        run length (1181 in total):
        51.82% 10 <= n < 100
        45.39% 1 <= n < 10
         2.79% n == 0
@mheinzel mheinzel force-pushed the mheinzel/merging-tree-data branch from 442e8a8 to d78f451 Compare March 7, 2025 13:45
@mheinzel mheinzel enabled auto-merge March 7, 2025 13:49
@mheinzel mheinzel added this pull request to the merge queue Mar 7, 2025
Merged via the queue into main with commit 8f3ca3c Mar 7, 2025
27 checks passed
@mheinzel mheinzel deleted the mheinzel/merging-tree-data branch March 7, 2025 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants