Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: Interval columns shouldn't write min/max stats #5145

Closed
Jefffrey opened this issue Nov 29, 2023 · 1 comment · Fixed by #5147
Closed

Parquet: Interval columns shouldn't write min/max stats #5145

Jefffrey opened this issue Nov 29, 2023 · 1 comment · Fixed by #5147
Labels
bug parquet Changes to the parquet crate

Comments

@Jefffrey
Copy link
Contributor

Describe the bug

See parquet spec:

The sort order used for INTERVAL is undefined. When writing data, no min/max statistics should be saved for this type and if such non-compliant statistics are found during reading, they must be ignored.

To Reproduce

Test in https://github.com/apache/arrow-rs/blob/ef6932f31e243d8545e097569653c8d3f1365b4d/parquet/src/column/writer/mod.rs:

    #[test]
    fn test_interval_should_not_have_min_max() {
        let input = [
            vec![0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
            vec![0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
            vec![0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2],
        ]
        .into_iter()
        .map(|s| ByteArray::from(s).into())
        .collect::<Vec<_>>();

        let page_writer = get_test_page_writer();
        let mut writer = get_test_interval_column_writer(page_writer);
        writer.write_batch(&input, None, None).unwrap();

        let metadata = writer.close().unwrap().metadata;
        let stats = if let Some(Statistics::FixedLenByteArray(stats)) = metadata.statistics() {
            stats.clone()
        } else {
            panic!("metadata missing statistics");
        };
        assert!(!stats.has_min_max_set());
    }

    fn get_test_interval_column_writer(
        page_writer: Box<dyn PageWriter>,
    ) -> ColumnWriterImpl<'static, FixedLenByteArrayType> {
        let descr = Arc::new(get_test_interval_column_descr());
        let column_writer = get_column_writer(descr, Default::default(), page_writer);
        get_typed_column_writer::<FixedLenByteArrayType>(column_writer)
    }

    fn get_test_interval_column_descr() -> ColumnDescriptor {
        let path = ColumnPath::from("col");
        let tpe =
            SchemaType::primitive_type_builder("col", FixedLenByteArrayType::get_physical_type())
                .with_length(12)
                .with_converted_type(ConvertedType::INTERVAL)
                .build()
                .unwrap();
        ColumnDescriptor::new(Arc::new(tpe), 0, 0, path)
    }

Expected behavior

Should pass test. Currently fails as seems min/max is being set.

Additional context

@tustvold
Copy link
Contributor

tustvold commented Jan 5, 2024

label_issue.py automatically added labels {'parquet'} from #5147

@tustvold tustvold added the parquet Changes to the parquet crate label Jan 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants