Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty or null list of struct cannot be written to parquet #703

Closed
mosyp opened this issue Aug 20, 2021 · 4 comments · Fixed by #1166
Closed

Empty or null list of struct cannot be written to parquet #703

mosyp opened this issue Aug 20, 2021 · 4 comments · Fixed by #1166
Labels

Comments

@mosyp
Copy link
Contributor

mosyp commented Aug 20, 2021

Describe the bug
When writing arrow batch with empty or null list struct, it fails with General("Inconsistent length of definition and repetition levels: 0 != 1")

To Reproduce
Write record batch which contain empty or null list of struct. Here's the PR with full repro behaviour #704

Expected behavior
The batch is successfully written

Additional context
WHat I could debug is that this line should fail https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/levels.rs#L646, because at some point the array_mask becomes empty slice which leads to the Inconsistent length of definition and repetition levels

I feel like this might be somehow connected #594.

@chadbrewbaker
Copy link

I'm getting the same issue on darwin-aarch64 using the json2parquet CLI.

wget https://raw.githubusercontent.com/json-iterator/test-data/master/large-file.json
# Convert it into newline delimited JSON or else Arrow complains
cat large-file.json | jq -c '.[]' > lf.json  
json2parquet  lf.json bork.pq

# Error: General("Inconsistent length of definition and repetition levels: 891 != 1388")

@chadbrewbaker
Copy link

chadbrewbaker commented Dec 12, 2021

This line of JSON is barfing in json2parquet with:

thread 'main' panicked at 'Cannot filter indices on a non-primitive array, found List(true)'

{"ts":1331901001.88,"fuid":"Fd3cGk2agqUftBeFx4","tx_hosts":["192.168.229.251"],"rx_hosts":["192.168.202.79"],"conn_uids":["CaJMZy195M8cuXfxn4"],"source":"HTTP","depth":0,"analyzers":[],"mime_type":"text/html","duration":0.0,"is_orig":false,"seen_bytes":1433,"total_bytes":1433,"missing_bytes":0,"overflow_bytes":0,"timedout":false}

The Python bindings handle this just fine.

from pyarrow import json
fn = 'mini.json'
table = json.read_json(fn)
print(table)
pyarrow.Table
ts: double
fuid: string
tx_hosts: list<item: string>
  child 0, item: string
rx_hosts: list<item: string>
  child 0, item: string
conn_uids: list<item: string>
  child 0, item: string
source: string
depth: int64
analyzers: list<item: null>
  child 0, item: null
mime_type: string
duration: double
is_orig: bool
seen_bytes: int64
total_bytes: int64
missing_bytes: int64
overflow_bytes: int64
timedout: bool
----
ts: [[1331901001.88]]
fuid: [["Fd3cGk2agqUftBeFx4"]]
tx_hosts: [[["192.168.229.251"]]]
rx_hosts: [[["192.168.202.79"]]]
conn_uids: [[["CaJMZy195M8cuXfxn4"]]]
source: [["HTTP"]]
depth: [[0]]
analyzers: [[0 nulls]]
mime_type: [["text/html"]]
duration: [[0]]
...

@chadbrewbaker
Copy link

Looks like we might have to translate https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_json.py

Better yet add a json directory for all the arrow clients:
https://github.com/apache/arrow-testing/tree/master/data

@alamb
Copy link
Contributor

alamb commented Jan 13, 2022

Here is a PR with a proposed fix: #1166

Anyone have time to check if it fixes their usecase?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants