Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet v8.0.0 panics when reading all null column to NullArray #1245

Closed
bjchambers opened this issue Jan 28, 2022 · 5 comments · Fixed by #1246
Closed

Parquet v8.0.0 panics when reading all null column to NullArray #1245

bjchambers opened this issue Jan 28, 2022 · 5 comments · Fixed by #1246
Labels
bug parquet Changes to the parquet crate

Comments

@bjchambers
Copy link
Contributor

Describe the bug
When reading a Parquet file with a single row, containing a single optional int32 column containing null the parquet reader panics.

This was found when trying to upgrade to Arrow/Parquet v8.0.0. The original file contained additional columns but was reduced to this.

The referenced branch contains a Parquet file containing a single nullable Int32 value. In this case I produced it using the following Pandas code:

d = {'int32': [None]}
df = pd.DataFrame(data=d)
df.to_parquet("minimal.parquet")

The file loads correctly using 6.3.0. With 8.0.0 it panics.

running 1 test
thread 'arrow::arrow_reader::tests::test_single_null_i32' panicked at 'assertion failed: `(left == right)`
  left: `1`,
 right: `0`', parquet/src/arrow/record_reader/definition_levels.rs:102:9
stack backtrace:
   0: rust_begin_unwind
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panicking.rs:517:5
   1: core::panicking::panic_fmt
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/panicking.rs:100:14
   2: core::panicking::assert_failed_inner
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/panicking.rs:181:17
   3: core::panicking::assert_failed
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/panicking.rs:138:5
   4: parquet::arrow::record_reader::definition_levels::DefinitionLevelBuffer::set_len
             at ./src/arrow/record_reader/definition_levels.rs:102:9
   5: parquet::arrow::record_reader::GenericRecordReader<V,CV>::set_values_written
             at ./src/arrow/record_reader.rs:347:13
   6: parquet::arrow::record_reader::GenericRecordReader<V,CV>::read_one_batch
             at ./src/arrow/record_reader.rs:299:9
   7: parquet::arrow::record_reader::GenericRecordReader<V,CV>::read_records
             at ./src/arrow/record_reader.rs:196:31
   8: parquet::arrow::array_reader::read_records
             at ./src/arrow/array_reader.rs:137:33
   9: <parquet::arrow::array_reader::NullArrayReader<T> as parquet::arrow::array_reader::ArrayReader>::next_batch
             at ./src/arrow/array_reader.rs:209:13
  10: <parquet::arrow::array_reader::StructArrayReader as parquet::arrow::array_reader::ArrayReader>::next_batch::{{closure}}
             at ./src/arrow/array_reader.rs:1202:27
  11: core::iter::adapters::map::map_try_fold::{{closure}}
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/iter/adapters/map.rs:91:28
  12: core::iter::traits::iterator::Iterator::try_fold
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/iter/traits/iterator.rs:1995:21
  13: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/iter/adapters/map.rs:117:9
  14: <parquet::arrow::array_reader::StructArrayReader as parquet::arrow::array_reader::ArrayReader>::next_batch
             at ./src/arrow/array_reader.rs:1199:30
  15: <parquet::arrow::arrow_reader::ParquetRecordBatchReader as core::iter::traits::iterator::Iterator>::next
             at ./src/arrow/arrow_reader.rs:177:15
  16: parquet::arrow::arrow_reader::tests::test_single_null_i32
             at ./src/arrow/arrow_reader.rs:1000:22
  17: parquet::arrow::arrow_reader::tests::test_single_null_i32::{{closure}}
             at ./src/arrow/arrow_reader.rs:990:5
  18: core::ops::function::FnOnce::call_once
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/ops/function.rs:227:5
  19: core::ops::function::FnOnce::call_once
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/ops/function.rs:227:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
test arrow::arrow_reader::tests::test_single_null_i32 ... FAILED

To Reproduce

Steps to reproduce the behavior:

Run the test from this branch.

https://github.com/bjchambers/arrow-rs/tree/repro-parquet-panic

https://github.com/bjchambers/arrow-rs/blob/repro-parquet-panic/parquet/src/arrow/arrow_reader.rs#L990

Expected behavior

Not a panic. The behavior in 6.3.0 was expected.

Additional context
Add any other context about the problem here.

@bjchambers bjchambers added the bug label Jan 28, 2022
@bjchambers
Copy link
Contributor Author

@tustvold I'm not sure, but it looks like #1054 may be related?

@tustvold
Copy link
Contributor

tustvold commented Jan 28, 2022

Yup that definitely sounds plausible, I'll take a look shortly, thanks for reporting 👍

FWIW I notice it is using the NullArrayReader, I don't know if you can encourage pandas to give it a different column type (e.g. Int32), but that might be an interesting data points

@bjchambers
Copy link
Contributor Author

Ah, so it is. I gave it the encouragement, and it looks like that works. Interestingly, trying pqrs indicates that the schema in both cases is the same (OPTIONAL int32).

version: 1
num of rows: 1
created by: parquet-cpp-arrow version 6.0.1
metadata:
  pandas: {"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 1, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "int32", "field_name": "int32", "pandas_type": "empty", "numpy_type": "object", "metadata": null}], "creator": {"library": "pyarrow", "version": "6.0.1"}, "pandas_version": "1.4.0"}
  ARROW:schema: /////1gCAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAAOQBAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYAAABwYW5kYXMAAK4BAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJhbmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAic3RvcCI6IDEsICJzdGVwIjogMX1dLCAiY29sdW1uX2luZGV4ZXMiOiBbeyJuYW1lIjogbnVsbCwgImZpZWxkX25hbWUiOiBudWxsLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IHsiZW5jb2RpbmciOiAiVVRGLTgifX1dLCAiY29sdW1ucyI6IFt7Im5hbWUiOiAiaW50MzIiLCAiZmllbGRfbmFtZSI6ICJpbnQzMiIsICJwYW5kYXNfdHlwZSI6ICJlbXB0eSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9XSwgImNyZWF0b3IiOiB7ImxpYnJhcnkiOiAicHlhcnJvdyIsICJ2ZXJzaW9uIjogIjYuMC4xIn0sICJwYW5kYXNfdmVyc2lvbiI6ICIxLjQuMCJ9AAABAAAAFAAAABAAFAAIAAYABwAMAAAAEAAQAAAAAAABARAAAAAcAAAABAAAAAAAAAAFAAAAaW50MzIAAAAEAAQABAAAAAAAAAA=
message schema {
  OPTIONAL INT32 int32 (UNKNOWN);
}

vs.

version: 1
num of rows: 1
created by: parquet-cpp-arrow version 6.0.1
metadata:
  pandas: {"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 1, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "int32", "field_name": "int32", "pandas_type": "int32", "numpy_type": "Int32", "metadata": null}], "creator": {"library": "pyarrow", "version": "6.0.1"}, "pandas_version": "1.4.0"}
  ARROW:schema: /////2ACAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAAOQBAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYAAABwYW5kYXMAAK0BAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJhbmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAic3RvcCI6IDEsICJzdGVwIjogMX1dLCAiY29sdW1uX2luZGV4ZXMiOiBbeyJuYW1lIjogbnVsbCwgImZpZWxkX25hbWUiOiBudWxsLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IHsiZW5jb2RpbmciOiAiVVRGLTgifX1dLCAiY29sdW1ucyI6IFt7Im5hbWUiOiAiaW50MzIiLCAiZmllbGRfbmFtZSI6ICJpbnQzMiIsICJwYW5kYXNfdHlwZSI6ICJpbnQzMiIsICJudW1weV90eXBlIjogIkludDMyIiwgIm1ldGFkYXRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6ICJweWFycm93IiwgInZlcnNpb24iOiAiNi4wLjEifSwgInBhbmRhc192ZXJzaW9uIjogIjEuNC4wIn0AAAABAAAAFAAAABAAFAAIAAYABwAMAAAAEAAQAAAAAAABAhAAAAAgAAAABAAAAAAAAAAFAAAAaW50MzIAAAAIAAwACAAHAAgAAAAAAAABIAAAAA==
message schema {
  OPTIONAL INT32 int32;
}

@tustvold
Copy link
Contributor

tustvold commented Jan 28, 2022

I've found the bug, the derp is strong, and will only impact NullArray - will post a PR shortly 😄

indicates that the schema in both cases is the same

Yeah the ARROW:schema blob is a base64 encoded flatbuffer with the arrow schema, and that is where the null array-ness is coming from.

And yes, there is a base64 encoded flatbuffer, inside a thrift metadata payload, inside a parquet file - it's wild 😆

tustvold added a commit to tustvold/arrow-rs that referenced this issue Jan 28, 2022
@tustvold
Copy link
Contributor

Thanks again for reporting, fix in #1246

@alamb alamb added the parquet Changes to the parquet crate label Jan 29, 2022
@alamb alamb changed the title Parquet v8.0.0 panics when reading a simple file Parquet v8.0.0 panics when reading all null column to NullArray Jan 29, 2022
alamb pushed a commit that referenced this issue Jan 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants