-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various fixes for large columns. #17691
Conversation
This patch fixes a class of bugs where various primitive column readers were not providing a SmooshedFileMapper to GenericIndexed, even though the corresponding writer could potentially write multi-file columns. For example, apache#7943 is an instance of this bug. This patch also includes a fix for an issue on the writer for compressed multi-value string columns, V3CompressedVSizeColumnarMultiIntsSerializer, where it would use the same base filename for both the offset and values sections. This bug would only be triggered for segments in excess of 500 million rows. When a segment has fewer rows than that, it could potentially have a values section that needs to be split over multiple files, but the offset is never more than 4 bytes per row. This bug was triggered by the new tests, which use a smaller fileSizeLimit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -66,7 +67,8 @@ public static CompressedBigDecimalLongColumnSerializer create( | |||
segmentWriteOutMedium, | |||
String.format(Locale.ROOT, "%s.magnitude", filenameBase), | |||
Integer.MAX_VALUE, | |||
CompressionStrategy.LZ4 | |||
CompressionStrategy.LZ4, | |||
GenericIndexedWriter.MAX_FILE_SIZE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any reason to call this in non-test code with any value other than GenericIndexedWriter.MAX_FILE_SIZE
? like wondering if we should make a version of the creators that automatically passes this argument in and mark the one that takes the size argument as for tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this, but this seemed fine enough and makes the GenericIndexedWriter
API less cluttered. Having two overloads of the method with slightly different arguments seemed like it might be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, i guess new callers of these methods aren’t very frequent so maybe its ok. I was just afraid it just seems more confusing for the production code to be accepting an argument where there is basically one reasonable value to ever pass in, so any new callers should hopefully check around to see what the other callers are doing to pass in the same constant. Maybe we should add javadocs on these create methods to indicate that most callers should specify GenericIndexedWriter.MAX_FILE_SIZE
as the argument except for testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a javadoc to the GenericIndexedWriter#ofCompressedByteBuffers
and V3CompressedVSizeColumnarMultiIntsSerializer#create
methods. There are some other methods that now accept fileSizeLimit
, but they are package-private so are really only used by tests and by closely-related code. I didn't add javadocs to those since it didn't seem necessary.
@@ -72,7 +79,7 @@ | |||
|
|||
private static <T> GenericIndexed<T> read(ByteBuffer buf, ComplexMetricSerde serde) | |||
{ | |||
return GenericIndexed.read(buf, serde.getObjectStrategy()); | |||
return GenericIndexed.read(buf, serde.getObjectStrategy(), null); |
Check notice
Code scanning / CodeQL
Deprecated method or constructor invocation Note
ComplexMetricSerde.getObjectStrategy
) | ||
{ | ||
return new V3CompressedVSizeColumnarMultiIntsSerializer( | ||
columnName, | ||
new CompressedColumnarIntsSerializer( | ||
columnName, | ||
segmentWriteOutMedium, | ||
filenameBase, | ||
filenameBase + ".offsets", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain why the offsets
and value
need to be added. Is it QOL
improvements and not directly related to this patch ?
How does backward compatibility look post this change, ie can older segment reading code still read these files ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was fixing another bug, mentioned in the PR description:
This patch also includes a fix for an issue on the writer for compressed multi-value string columns,
V3CompressedVSizeColumnarMultiIntsSerializer
, where it would use the same base filename for both the offset and values sections. This bug would only be triggered for segments in excess of 500 million rows. When a segment has fewer rows than that, it could potentially have a values section that needs to be split over multiple files, but the offset is never more than 4 bytes per row. This bug was triggered by the new tests, which use a smaller fileSizeLimit.
Backward compatibility would be OK, because the seconary filenames are written into the primary file. They aren't hard-coded on the reader side. So, older readers will still be able to read these differently-named files.
final Closer closer | ||
) | ||
{ | ||
GenericIndexedWriter<ByteBuffer> writer = new GenericIndexedWriter<>( | ||
segmentWriteOutMedium, | ||
filenameBase, | ||
compressedByteBuffersWriteObjectStrategy(compressionStrategy, bufferSize, closer) | ||
compressedByteBuffersWriteObjectStrategy(compressionStrategy, bufferSize, closer), | ||
fileSizeLimit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gianm It looks like this is the crux of the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The crux of the fix was passing in a SmooshedFileMapper
to GenericIndexed#read
in various call sites. The new fileSizeLimit
parameter is really just there for testing.
* Various fixes for large columns. This patch fixes a class of bugs where various primitive column readers were not providing a SmooshedFileMapper to GenericIndexed, even though the corresponding writer could potentially write multi-file columns. For example, apache#7943 is an instance of this bug. This patch also includes a fix for an issue on the writer for compressed multi-value string columns, V3CompressedVSizeColumnarMultiIntsSerializer, where it would use the same base filename for both the offset and values sections. This bug would only be triggered for segments in excess of 500 million rows. When a segment has fewer rows than that, it could potentially have a values section that needs to be split over multiple files, but the offset is never more than 4 bytes per row. This bug was triggered by the new tests, which use a smaller fileSizeLimit. * Use a Random seed. * Remove erroneous test code. * Fix two compilation problems. * Add javadocs. * Another javadoc.
This patch fixes a class of bugs where various primitive column readers were not providing a SmooshedFileMapper to GenericIndexed, even though the corresponding writer could potentially write multi-file columns. For example, #7943 is an instance of this bug.
This patch also includes a fix for an issue on the writer for compressed multi-value string columns, V3CompressedVSizeColumnarMultiIntsSerializer, where it would use the same base filename for both the offset and values sections. This bug would only be triggered for segments in excess of 500 million rows. When a segment has fewer rows than that, it could potentially have a values section that needs to be split over multiple files, but the offset is never more than 4 bytes per row. This bug was triggered by the new tests, which use a smaller fileSizeLimit.
This patch also removes the 2-arg overload of
GenericIndexed#read
(which doesn't accept aSmooshedFileMapper
). The 3-arg one can be used with theSmooshedFileMapper
arg set tonull
if the caller really doesn't want to use a mapper. However, production callers generally do.