-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Add portable Roaring bitmap for row positions #11372
Conversation
core/src/main/java/org/apache/iceberg/deletes/RoaringPositionBitmap.java
Show resolved
Hide resolved
f168c3a
to
dfbba84
Compare
* @param posStartInclusive the start position of the range (inclusive) | ||
* @param posEndExclusive the end position of the range (exclusive) | ||
*/ | ||
public void addRange(long posStartInclusive, long posEndExclusive) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be optimized in the future but we have no use case in the repo beyond tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make this package-private then? This has implications for run encoding in the bitmaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our public PositionDeleteIndex
where we will eventually consume this class actually supports marking ranges as removed. Therefore, it is needed, but we shouldn't be concerned optimizing this right away.
core/src/main/java/org/apache/iceberg/deletes/RoaringPositionBitmap.java
Outdated
Show resolved
Hide resolved
import org.roaringbitmap.RoaringBitmap; | ||
|
||
/** | ||
* A bitmap that supports positive 64-bit positions, but is optimized for cases where most positions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Credit to @rdblue for most of this description copied from the spec PRs for DVs.
core/src/main/java/org/apache/iceberg/deletes/RoaringPositionBitmap.java
Outdated
Show resolved
Hide resolved
* @see <a href="https://github.com/RoaringBitmap/RoaringFormatSpe">Roaring bitmap spec</a> | ||
*/ | ||
public void serialize(ByteBuffer buffer) { | ||
validateByteOrder(buffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not duplicate the ByteBuffer, set its endianness, and then update the buffer's position to match the duplicate after serialization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do similar stuff in the caller, it is more like a sanity check at this point, which also follows the Roaring lib API.
} | ||
|
||
// extracts high 32 bits from a 64-bit position | ||
private static int highBytes(long pos) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keyBytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree here, since you need to read the javadoc to know that these are the key bytes, so might be worth renaming the method to indicate that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or just key
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that input in many methods is called position
and represents 64-bit positions. If we want to rename this method to key
(which I agree makes sense), we will have to consider renaming lowBytes
too, which I don't have a very good alternative for.
Let me think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed to "key" and "value". Please, take a look, @RussellSpitzer @nastra @amogh-jahagirdar.
core/src/main/java/org/apache/iceberg/deletes/RoaringPositionBitmap.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/deletes/RoaringPositionBitmap.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
/** | ||
* Adds a position to the bitmap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should use "add", because in my mind this is more of a "set"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used add
/contains
from the Roaring lib but open to any names.
What would be contains
equivalent if we go with set
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have super strong opinions on this, but I always kind of think of "add" for Sets and "set" for Bitmaps. I know we are kind of doing the same things but at least mentally in my head it's not as if we are actually "adding" more like we are just flipping a bit.
I think contains would still probably be fine?
core/src/main/java/org/apache/iceberg/deletes/RoaringPositionBitmap.java
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/deletes/RoaringPositionBitmap.java
Show resolved
Hide resolved
* | ||
* @param that the other bitmap | ||
*/ | ||
public void or(RoaringPositionBitmap that) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not return this for chaining? I'm not sure what our standard is for this sort of API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To highlight it modifies the bitmap in place, not a strong opinion. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way is ok with me, I know in most of our public APIs we use the functional approach even for modify in place.
* | ||
* @return whether the bitmap was compressed | ||
*/ | ||
public boolean runOptimize() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we want to expose Roaring bitmap apis like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use this while serializing BitmapPositionDeleteIndex
in the main PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to just call this compress
or runLengthEncode
? I'm OK either way, i see roaring bit map calls it runOptimize
, just feels like too broad of a name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like compress
, I'll rename.
core/src/main/java/org/apache/iceberg/deletes/RoaringPositionBitmap.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/deletes/TestRoaringPositionBitmap.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/deletes/TestRoaringPositionBitmap.java
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/deletes/TestRoaringPositionBitmap.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/deletes/TestRoaringPositionBitmap.java
Outdated
Show resolved
Hide resolved
position(1 /* bitmap */, 2 /* container */, 1L), | ||
position(1 /* bitmap */, 2 /* container */, CONTAINER_OFFSET - 1)); | ||
|
||
assertThat(bitmap.runOptimize()).as("Bitmap must be RLE encoded").isTrue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should optimize be run as part of serialization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I call that in PositionDeleteIndex
, not sure about the bitmap itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we can't do that. The caller may have called serializedSizeInBytes()
to estimate the size prior to the serialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks correct overall to me. I have a few suggestions for minor things, but it looks like this is about ready.
|
||
private void allocateBitmapsIfNeeded(int requiredLength) { | ||
if (bitmaps.length < requiredLength) { | ||
if (bitmaps.length == 0 && requiredLength == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there a special case for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply optimizes the most frequent use case.
core/src/main/java/org/apache/iceberg/deletes/RoaringPositionBitmap.java
Outdated
Show resolved
Hide resolved
import org.apache.iceberg.relocated.com.google.common.io.Resources; | ||
import org.junit.jupiter.api.Test; | ||
|
||
public class TestRoaringPositionBitmap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be a good place to do some property based testing since we already have a class that matches the behavior we want ("HashSet"). This way we can hit more than a select few positions and verify the behaviors pretty simply I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add some fuzz testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some fuzz tests and included seed information in the test arguments. If anything fails, we will have a way to reproduce the issue reliably.
core/src/test/java/org/apache/iceberg/deletes/TestRoaringPositionBitmap.java
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/deletes/TestRoaringPositionBitmap.java
Outdated
Show resolved
Hide resolved
private static int readKey(ByteBuffer buffer, int lastKey) { | ||
int key = buffer.getInt(); | ||
Preconditions.checkArgument(key >= 0, "Invalid unsigned key: %s", key); | ||
Preconditions.checkArgument(key <= Integer.MAX_VALUE - 1, "Key is too large: %s", key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this and the precondition right below it isn't examined in the tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Covering this would be fairly involved as this logic is part of deserialization and we don't produce such files.
core/src/main/java/org/apache/iceberg/deletes/RoaringPositionBitmap.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are a couple good outstanding ideas for clarifying method names, but I'm ok with whatever is decided.
} | ||
|
||
@TestTemplate | ||
public void testRandomSparseBitmap() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RussellSpitzer, here we have 3 tests with random values that are compared against Set<Long>
.
core/src/main/java/org/apache/iceberg/deletes/RoaringPositionBitmap.java
Outdated
Show resolved
Hide resolved
return Resources.toByteArray(resource); | ||
} | ||
|
||
private static void assertEqual(RoaringPositionBitmap bitmap, Set<Long> positions) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really 3 asserts
assertEqual()
assertEqualAfterSerDE()
assertEqualAfterCompress()
, but it's probably fine to combine them since the debugger will point to which block we are in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a common method for the checks.
* | ||
* @param pos the position | ||
*/ | ||
public void set(long pos) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RussellSpitzer, I switched to set
instead of add
. It matches BitSet
in Java.
I am going to merge this one as it was thoroughly reviewed. There is no need to include it in 1.7, though (no harm too). Thanks for reviewing, @rdblue @RussellSpitzer @amogh-jahagirdar @nastra @danielcweeks! |
This PR adds a Roaring bitmap implementation that follows the official "portable" serialization format from the library for row positions in preparation for V3 position deletes. We are still discussing the blob layout for DVs on the dev list but we agree we should serialize the bitmap according to the portable serialization spec.
We cannot rely on the
Roaring64Bitmap
from the library as it has undefined serialization behavior:Here are benchmarks comparing this implementation to the one from the library we use today.
This is not a spec change but part of #11122.