-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Empty or null list of struct cannot be written to parquet #703
Comments
I'm getting the same issue on darwin-aarch64 using the json2parquet CLI. wget https://raw.githubusercontent.com/json-iterator/test-data/master/large-file.json
# Convert it into newline delimited JSON or else Arrow complains
cat large-file.json | jq -c '.[]' > lf.json
json2parquet lf.json bork.pq
# Error: General("Inconsistent length of definition and repetition levels: 891 != 1388") |
This line of JSON is barfing in json2parquet with: thread 'main' panicked at 'Cannot filter indices on a non-primitive array, found List(true)' arrow-rs/parquet/src/arrow/levels.rs Line 757 in e0abda2
{"ts":1331901001.88,"fuid":"Fd3cGk2agqUftBeFx4","tx_hosts":["192.168.229.251"],"rx_hosts":["192.168.202.79"],"conn_uids":["CaJMZy195M8cuXfxn4"],"source":"HTTP","depth":0,"analyzers":[],"mime_type":"text/html","duration":0.0,"is_orig":false,"seen_bytes":1433,"total_bytes":1433,"missing_bytes":0,"overflow_bytes":0,"timedout":false} The Python bindings handle this just fine. from pyarrow import json
fn = 'mini.json'
table = json.read_json(fn)
print(table) pyarrow.Table
ts: double
fuid: string
tx_hosts: list<item: string>
child 0, item: string
rx_hosts: list<item: string>
child 0, item: string
conn_uids: list<item: string>
child 0, item: string
source: string
depth: int64
analyzers: list<item: null>
child 0, item: null
mime_type: string
duration: double
is_orig: bool
seen_bytes: int64
total_bytes: int64
missing_bytes: int64
overflow_bytes: int64
timedout: bool
----
ts: [[1331901001.88]]
fuid: [["Fd3cGk2agqUftBeFx4"]]
tx_hosts: [[["192.168.229.251"]]]
rx_hosts: [[["192.168.202.79"]]]
conn_uids: [[["CaJMZy195M8cuXfxn4"]]]
source: [["HTTP"]]
depth: [[0]]
analyzers: [[0 nulls]]
mime_type: [["text/html"]]
duration: [[0]]
... |
Looks like we might have to translate https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_json.py Better yet add a json directory for all the arrow clients: |
Here is a PR with a proposed fix: #1166 Anyone have time to check if it fixes their usecase? |
Describe the bug
When writing arrow batch with empty or null list struct, it fails with
General("Inconsistent length of definition and repetition levels: 0 != 1")
To Reproduce
Write record batch which contain empty or null list of struct. Here's the PR with full repro behaviour #704
Expected behavior
The batch is successfully written
Additional context
WHat I could debug is that this line should fail https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/levels.rs#L646, because at some point the
array_mask
becomes empty slice which leads to the Inconsistent length of definition and repetition levelsI feel like this might be somehow connected #594.
The text was updated successfully, but these errors were encountered: