-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reinitialize the column writers when flushing a row group #10418
Conversation
3d8d389
to
89667af
Compare
89667af
to
7cadf0f
Compare
@@ -164,7 +159,7 @@ private void writeChunk(Page page) | |||
if (bufferedBytes >= writerOption.getMaxRowGroupSize()) { | |||
columnWriters.forEach(ColumnWriter::close); | |||
flush(); | |||
columnWriters.forEach(ColumnWriter::reset); | |||
initColumnWriters(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bugfix follows the same strategy as in parquet-mr
project:
After the row group has been flushed, the column writers get reinitialized from scratch.
The FallbackValuesWriter
doesn't reset both of its writers within the #reset()
method and this behavior leads to the unwanted behavior documented in #5518 (comment)
One concern which I have with this approach is that the reinitialisation of the writers may affect the performance. I'd need in this regard feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparing https://github.com/trinodb/trino/runs/4650149952?check_suite_focus=true (a test run of this branch) and
https://github.com/trinodb/trino/runs/4649720577?check_suite_focus=true (a build on master
) on TestFullParquetReader
i see differences of about 20% in favor of the master
implementation compared to the current bugfix implementation (for thetrino-hive
, trino-parquet
tests).
I ran again the tests https://github.com/trinodb/trino/runs/4652707273?check_suite_focus=true and found out no major difference between the runs. Actually in most of the cases, the tests were running faster on the branch bugfix/parquet-9749
than on master
The test TestFullParquetReader.testNestedStructs
took 1m / 2m on the bugfix branch and 8 seconds on master. I'm not sure whether this is relevant or is it linked to the github build runner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FallbackValuesWriter doesn't reset both of its writers within the #reset() method and this behavior leads to the unwanted behavior documented in #5518 (comment)
When does the FallbackValuesWriter get used?
One concern which I have with this approach is that the reinitialisation of the writers may affect the performance. I'd need in this regard feedback.
That should not be much of a problem. It happens once per row group and it doesn't seem to be a very expensive operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When does the FallbackValuesWriter get used?
It gets used as a fallback for the situation where the dictionary encoding grows too big.
7cadf0f
to
7b4e5e0
Compare
plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergParquetConnectorTest.java
Show resolved
Hide resolved
23bf3a8
to
e1b6c09
Compare
What about changing the top three The rest of the fields should reset/clear fine, it's just the ValueWriters that seem to have a problem. If that's not worth it, I think the |
@alexjo2144 I'm not sure that I follow what you mean. |
Sorry, my thought was can we either:
or if not
|
e1b6c09
to
682eb69
Compare
@alexjo2144 Indeed, this method is now no longer needed. I'm still investigating also the refactoring possibility:
Taken from the Parquet documentation https://parquet.apache.org/documentation/latest/
I'm thinking that in such cases there is no really a big overhead of creating again from scratch the writers in case of dealing with a new row group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks ok.
However, my knowledge about writing Parquet files is non-existent so I can't be the one to approve anything here.
Thanks @skrzypo987 for taking a look |
Reinitialize the column writers when flushing a row group
The previous strategy involved in doing the bugfix was:
However, this strategy collides with #5518 (comment) reason why now the column writers get reinitialized from scratch.
Fixes #9749