-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(level): change split key range right key to use ts=0 #1932
Conversation
Hi @hugy718, thank you for the PR. Do you think it is possible to add a test for this? |
|
@hugy718 I am wondering whether it is easy to add a test that shows that things could go wrong without this change and this change fixes the issue? I can also look into it myself if you think it's not easy to come up with a test. |
Actually, nothing would go wrong without this change. It only improves the implementation such that keys in the same bottom table will not span two splits, which is what the old code comments there suggested. |
yes, you are right. I will let @harshil-goel review it and merge it. Thanks for the PR again. |
@hugy718 Thanks a lot for the PR. I went through the logic, and the old PR: https://github.com/dgraph-io/badger/pull/1587/files. From what I understand, what's happening is that
Is my understanding correct? |
C3 is placed before C2, since the encoding of time stamp uses maxuint64 to minus it. The old approach is very unlikely give meaningful extra parallelism, since it only affects the records of the same key. (Actually no extra parallelism, because of the if statement below. the placement of the records of the last user key of a table will not affect the number of splits. For the last table, the condition is true and the whole table will be placed in the last split with right boundary set to empty. For tables in between, those tail records will be processed in the next split that contains the next table.) However, the old approach places a bottom table into two splits. The old example actually is not so good in explaining the resulting splits, since the if statement below will return just one split. Lines 1071 to 1074 in 9afd0a0
If we consider a compaction that truly uses multi-split subcompaction, there are much more bottom tables. We consider the examples in the comment L1049. Lines 1049 to 1054 in 9afd0a0
Again the change is unlikely affect the performance and will not affect the correctness. It is just that placing a bottom table in only one split is what the logic was designed to do. |
From what I understand, you are saying that a bottom table is only in one subcompaction job. However, with this change, if the key is split into multiple tables (C2 in table 3, C3 in table 4, C4 in top level), then the table 4 would be in 2 different subcompaction job right? If we take the 2nd example, t2 might be present in table 3, so your splits would also look like [t0, t1, t2, t3], [t3, t4, t5], [t6, t7, t8], [t8, t9] It doesn't make sense to merge the change if it doesn't provide any performance / correctness benefit. However, this diff made us go deeper into some topics. You have done a really great work in understanding badger. We would be happy if you just want to change the comment to make it more understandable / correct. |
All the tables mentioned in the second example are bottom tables because Performance-wise, there is unlikely improvement because block cache is used most of the time and the last block of a bottom table referenced by two splits (processed by two separate goroutine) is likely in-cache when the slower split needs it. But when block cache is disabled, by placing that block into one split, it avoids both goroutine fetching that block from storage, reducing 1 IO, but that would not be significant since a compaction access hundreds of blocks. If you still feel that it is easier to keep the old way, I will then revert back and only correct those code comments describing the old way. |
It makes sense, that we don't have keys spilling over to new tables. In that case it would make sense to do the changes. Thanks a lot for the diff, I have accepted it. |
Problem
This old implementation is not incorrect, but it does not seem to follow the logic in its description. When a compaction is split into subcompaction jobs, each of them builds an iterator on all tables touched by the compaction and iterates over a certain
keyRange
defined inaddSplits
. In combination, they covers the key range of all the tables in the compaction. The right key of a key range is intended to be the right key of a bottom level table (cd,bot
), It is intended to include all different versions of that key, as described in the comments. However, usingmath.MaxUint64
will exclude those keys from thiskeyRange
and include them in the next split. The reason is that the timestamp is encoded asmath.MaUint64-ts
iny.KeyWithTs
, so a key with larger ts is actually smaller iny.CompareKeys
. It can be corrected by using ts=0. Note that even using ts=math.MaxUint64 is not going to drop keys unlike what the comments above suggest. because those keys are covered in the subsequent split and the last split have an empty right key, iterating till the end of the tables being compacted.Solution
Changed the timestamp used for split key range right key from
math.MaxUint64
to0
. Updated the comments above it.