Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet min_max statistics cannot read by pyarrow #799

Closed
tfiasco opened this issue Sep 23, 2021 · 1 comment · Fixed by #3527
Closed

parquet min_max statistics cannot read by pyarrow #799

tfiasco opened this issue Sep 23, 2021 · 1 comment · Fixed by #3527
Labels

Comments

@tfiasco
Copy link

tfiasco commented Sep 23, 2021

Describe the bug
a parquet file created by arrow-rs has no min_max statistics when reading by pyarrow.

To Reproduce

// rust code

let id_array = Int32Array::from(vec![1, 2, 3, 4, 5]);
let id_array2 = Int32Array::from(vec![2, 3, 4, 5, 6]);
let schema = Arc::new(Schema::new(vec![
    Field::new("id", DataType::Int32, false),
    Field::new("id2", DataType::Int32, false),
]));

let batch = RecordBatch::try_new(
    schema.clone(),
    vec![Arc::new(id_array), Arc::new(id_array2)],
)
.unwrap();

let writer_properties = WriterProperties::builder()
    .set_compression(Compression::ZSTD)
    .set_statistics_enabled(true)
    .build();

let path = "/.../test.parquet";
let file = fs::File::create(&path).unwrap();

let mut writer = ArrowWriter::try_new(file, schema.clone(), Some(writer_properties)).unwrap();
writer.write(&batch).unwrap();
writer.close().unwrap();

let file2 = fs::File::open(&path).unwrap();

let file_reader = SerializedFileReader::new(file2).unwrap();
let mut arrow_reader = ParquetFileArrowReader::new(Arc::new(file_reader));

println!(
    "statistics: {:?}",
    arrow_reader
        .get_metadata()
        .row_group(0)
        .column(0)
        .statistics()
);
println!(
    "statistics: {:?}",
    arrow_reader
        .get_metadata()
        .row_group(0)
        .column(1)
        .statistics()
);

// output: 
// statistics: Some(Int32({min: Some(1), max: Some(5), distinct_count: None, null_count: 0, min_max_deprecated: false}))
// statistics: Some(Int32({min: Some(2), max: Some(6), distinct_count: None, null_count: 0, min_max_deprecated: false}))
# python code

import pyarrow.parquet as pq
f = pq.ParquetFile('./test.parquet')
print(f.metadata.row_group(0).column(0).statistics)

# output:
"""
<pyarrow._parquet.Statistics object at 0x7fbf8d409dd0>
  has_min_max: False
  min: None
  max: None
  null_count: 0
  distinct_count: 0
  num_values: 5
  physical_type: INT32
  logical_type: None
  converted_type (legacy): NONE
"""

Expected behavior
pyarrow should get statistics like

  has_min_max: True
  min: 1
  max: 5

Additional context
rust lib version:

parquet = "5.4.0"
arrow = "5.4.0"

python lib version:

pyarrow==5.0.0
@tfiasco tfiasco added the bug label Sep 23, 2021
@pjmore
Copy link
Contributor

pjmore commented Jan 24, 2022

@tfiasco I think this is due to the c++ implementation only using min_value and max_value if a column order is set: https://github.com/apache/arrow/blob/54460d96ba1d613e472d8d9a96c072147e736b4d/cpp/src/parquet/metadata.cc#L82
Where for your example the current implementation, I used version 7, a modified version of your snippet prints None:

println!(
        "statistics: {:?}",
        &arrow_reader
            .get_metadata()
            .row_group(0)
            .to_thrift()
            .columns[0]
            .meta_data.as_ref()
    );
    println!(
        "statistics: {:?}",
        &arrow_reader
            .get_metadata()
            .row_group(0)
            .to_thrift()
            .columns[1]
            .meta_data.as_ref()
    );
    println!(
        "column_orders: {:?}",
        &arrow_reader
            .get_metadata()
            .file_metadata()
            .column_orders()
    );

This prints

statistics: Some(ColumnMetaData { type_: Int32, encodings: [Plain, RleDictionary, Rle], path_in_schema: ["id"], codec: Zstd, num_values: 5, total_uncompressed_size: 70, total_compressed_size: 88, key_value_metadata: None, data_page_offset: 47, index_page_offset: None, dictionary_page_offset: Some(4), statistics: Some(Statistics { max: None, min: None, null_count: None, distinct_count: None, max_value: Some([5, 0, 0, 0]), min_value: Some([1, 0, 0, 0]) }), encoding_stats: None, bloom_filter_offset: None })
statistics: Some(ColumnMetaData { type_: Int32, encodings: [Plain, RleDictionary, Rle], path_in_schema: ["id2"], codec: Zstd, num_values: 5, total_uncompressed_size: 70, total_compressed_size: 88, key_value_metadata: None, data_page_offset: 181, index_page_offset: None, dictionary_page_offset: Some(138), statistics: Some(Statistics { max: None, min: None, null_count: None, distinct_count: None, max_value: Some([6, 0, 0, 0]), min_value: Some([2, 0, 0, 0]) }), encoding_stats: None, bloom_filter_offset: None })
column_orders: None

So it looks like the c++ implementation is just ignoring the current stats values. While digging through the code to figure this out I found a comment in parquet_format that said that without column_orders the meaning of min_value and max_value is undefined. If the comment is accurate this seems like a bug in the current implementation that the min_value and max_value are being used the way that they are. The comment in question is:

  /// Sort order used for the min_value and max_value fields of each column in
  /// this file. Sort orders are listed in the order matching the columns in the
  /// schema. The indexes are not necessary the same though, because only leaf
  /// nodes of the schema are represented in the list of sort orders.
  /// 
  /// Without column_orders, the meaning of the min_value and max_value fields is
  /// undefined. To ensure well-defined behaviour, if min_value and max_value are
  /// written to a Parquet file, column_orders must be written as well.
  /// 
  /// The obsolete min and max fields are always sorted by signed comparison
  /// regardless of column_orders.

https://github.com/sunchao/parquet-format-rs/blob/b0d5bcb51a919837310c7dccd5141ea956346357/src/parquet_format.rs#L4919

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants