You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Goal:
Trying to read a mongo document with an embedded object containing an empty array to a pyarrow table, then write it out as a parquet file.
Expected result:
Parquet file created
Actual Result:
Getting error from pymongoarrow when creating the pyarrow.Table. Interestingly reading the same document from mongo directly and using pyarrow.json to create the table works fine. Obviously embedded objects with non-empty arrays work fine with pymongoarrow.
Steps to reproduce:
from pymongo import MongoClient
import pymongoarrow.api as pmaapi
import pyarrow.parquet as papq
import pyarrow.json as pajson
import io
import json
import bson
client = MongoClient()
collection = client.testdb.data;
collection.drop();
client.testdb.data.insert_many([
{ '_id': 1, 'foo': { 'bar': ['1','2'] } },
{ '_id': 2, 'foo': { 'bar': [] } }
])
# get document out of mongo, put it in a file and read it with pyarrow and write it to parquet
doc1 = client.testdb.data.find_one({'_id': 1})
string1 = bson.json_util.dumps(doc1, indent = 2)
file1 = io.BytesIO(bytes(string1, encoding='utf-8'))
papatable1 = pajson.read_json(file1)
print(str(papatable1))
papq.write_table(papatable1, 'pyarrow' + str(1) + '.parquet')
# read document with pymongoarrow and write it to parquet
pmapatable1 = pmaapi.find_arrow_all(client.testdb.data,{'_id': {'$eq': 1}})
print(str(pmapatable1))
papq.write_table(pmapatable1, 'pymongoarrow' + str(1) + '.parquet')
doc2 = client.testdb.data.find_one({'_id': 2})
string2 = bson.json_util.dumps(doc2, indent = 2)
file2 = io.BytesIO(bytes(string2, encoding='utf-8'))
papatable2 = pajson.read_json(file2)
print(str(papatable2))
papq.write_table(papatable2, 'pyarrow' + str(2) + '.parquet')
pmapatable2 = pmaapi.find_arrow_all(client.testdb.data,{'_id': {'$eq': 2}})
papq.write_table(pmapatable2, 'pymongoarrow' + str(2) + '.parquet')
produces
$ python repro.py
pyarrow.Table
_id: int64
foo: struct<bar: list<item: string>>
child 0, bar: list<item: string>
child 0, item: string
----
_id: [[1]]
foo: [
-- is_valid: all not null
-- child 0 type: list<item: string>
[["1","2"]]]
pyarrow.Table
_id: int32
foo: struct<bar: list<item: string>>
child 0, bar: list<item: string>
child 0, item: string
----
_id: [[1]]
foo: [
-- is_valid: all not null
-- child 0 type: list<item: string>
[["1","2"]]]
pyarrow.Table
_id: int64
foo: struct<bar: list<item: null>>
child 0, bar: list<item: null>
child 0, item: null
----
_id: [[2]]
foo: [
-- is_valid: all not null
-- child 0 type: list<item: null>
[0 nulls]]
Traceback (most recent call last):
File "/workspaces/vscode-python/pymongoarrow/repro.py", line 45, in <module>
pmapatable2 = pmaapi.find_arrow_all(client.testdb.data,{'_id': {'$eq': 2}})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vscode/Envs/pma1/lib/python3.11/site-packages/pymongoarrow/api.py", line 112, in find_arrow_all
process_bson_stream(batch, context)
File "pymongoarrow/lib.pyx", line 159, in pymongoarrow.lib.process_bson_stream
File "pymongoarrow/lib.pyx", line 246, in pymongoarrow.lib.process_raw_bson_stream
File "pymongoarrow/lib.pyx", line 133, in pymongoarrow.lib.extract_document_dtype
File "pymongoarrow/lib.pyx", line 108, in pymongoarrow.lib.extract_field_dtype
File "pyarrow/types.pxi", line 4452, in pyarrow.lib.list_
TypeError: List requires DataType or Field
FWIW the three parquet files which are produced, duckdb shows the following...
Goal:
Trying to read a mongo document with an embedded object containing an empty array to a pyarrow table, then write it out as a parquet file.
Expected result:
Parquet file created
Actual Result:
Getting error from pymongoarrow when creating the pyarrow.Table. Interestingly reading the same document from mongo directly and using pyarrow.json to create the table works fine. Obviously embedded objects with non-empty arrays work fine with pymongoarrow.
Steps to reproduce:
produces
FWIW the three parquet files which are produced, duckdb shows the following...
Versions:
The text was updated successfully, but these errors were encountered: