-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent value for data_page_max_rows
setting in DataFusion ParquetOptions
and in ArrowWriterOptions
#11367
Comments
take |
#11444 is similar |
Edited My goal was to delineate the differences btwn the APIs more, but to leave the actual fixing of the defaults (& tests to enforce that it remains in sync) as this issue assigned to @Rachelint . I was hoping #11444 could actually help make this issue/PR (setting defaults) be easier to test. I could also add in the changes to the defaults as well, if @Rachelint is ok with that? |
Ok, feel free to make it. |
take |
@wiedld 's pr #11524 has a very nice table of the current state of affairs: Here are the places where the current defaults differ:
† For these settings, datafusion has no default (None). However, once datafusion's ParquetOptions are used by the extern parquet (a.k.a. converted to parquet's ArrowWriterOptions) then it uses the extern parquet's defaults. Refer to the newly added tests. . Additionally, there are differences in the bloom filter configurations based upon partial definition (a.k.a. leaving some as default, and some as defined):
|
Describe the bug
@wiedld pointed out that the default value of
data_page_max_rows
is different in DataFusion than in arrow-rshttps://docs.rs/datafusion/latest/datafusion/common/config/struct.ParquetOptions.html sets it to
usize::max
As of apache/arrow-rs#5957 was released in 51.1.0 , arrow-rs sets it to
20,000
To Reproduce
N/A
Expected behavior
The DataFusion defaults should be the same as the arrow-rs https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowWriterOptions.html, unless there is a good reason to deviate
In this case, I think we should go with the arrow-rs value
Ideally a fix for this ticket would both
ParquetOptions
to match the default inArrowWriterOptions
Additional context
No response
The text was updated successfully, but these errors were encountered: