Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON Decode Errors #20517

Closed
2 tasks done
FrocketGaming opened this issue Dec 31, 2024 · 4 comments
Closed
2 tasks done

JSON Decode Errors #20517

FrocketGaming opened this issue Dec 31, 2024 · 4 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@FrocketGaming
Copy link

FrocketGaming commented Dec 31, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.read_parquet("test.parquet")

filter_df = df

decode_df = filter_df.with_columns(
    pl.col("json_data").str.json_decode().alias("parsed_json")
)
df_exploded = decode_df.with_columns(
    pl.col("parsed_json").struct.field("data").alias("data")
).explode("data")

I can't share the data sadly so I don't know how helpful this will be but has something changed that I'm missing?

I'm trying to load in this json data in a column of the dataframe so I can explode it and turn all the keys of a specific section into columns of the dataframe. Some versions of Polars this seems to work without issue but others it fails every time.

Log output

python : Traceback (most recent call last):
At line:1 char:1
+ python test.py 2> stderr_output.txt
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (Traceback (most recent call last)::String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError
 
  File "test.py", line 7, in <module>
    decode_df = filter_df.with_columns(
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\w\.virtualenvs\test-e9koBuxn\Lib\site-packages\polars\dataframe\frame.py", line 9495, in with_columns
    return self.lazy().with_columns(*exprs, **named_exprs).collect(_eager=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\w\.virtualenvs\test-e9koBuxn\Lib\site-packages\polars\lazyframe\frame.py", line 2043, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: error deserializing JSON: extra key in struct data: OW4N

Issue description

I'm getting a json decode error from this line

decode_df = filter_df.with_columns(
    pl.col("json_data").str.json_decode().alias("parsed_json")
)

If I forcefully install polars 1.5 this works without issue but if I use 1.18 I get this error, it also worked previously without any issues until I did a new polars install about a month ago.

Expected behavior

Decode the json data

Installed versions

Non-Working Version

Name: polars
Version: 1.18.0
Summary: Blazingly fast DataFrame library
Home-page: https://www.pola.rs/
Author:
Author-email: Ritchie Vink <[email protected]>
License:
Requires:
Required-by:

Working Version

Name: polars
Version: 1.5.0
Summary: Blazingly fast DataFrame library
Home-page: https://www.pola.rs/
Author:
Author-email: Ritchie Vink <[email protected]>
License:
Requires:
Required-by:
@FrocketGaming FrocketGaming added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Dec 31, 2024
@cmdlineluser
Copy link
Contributor

It looks like there was a change: #19347

It is now an error if there is a schema change:

(pl.Series(['{"a": 1}', '{"a": 2, "b": 2}'])   
   .str.json_decode(infer_schema_length=1)
)
# ComputeError: error deserializing JSON: extra key in struct data: b

It seems you'd need to increase infer_schema_length (default is 100)

@FrocketGaming
Copy link
Author

It looks like there was a change: #19347

It is now an error if there is a schema change:

(pl.Series(['{"a": 1}', '{"a": 2, "b": 2}'])   
   .str.json_decode(infer_schema_length=1)
)
# ComputeError: error deserializing JSON: extra key in struct data: b

It seems you'd need to increase infer_schema_length (default is 100)

Thanks! I found if I increased the infer_schema_length to 1000 then it works with this data but this might change depending on the data I'm working on so I wonder how I could keep this dynamic?

@cmdlineluser
Copy link
Contributor

I don't think you can, other than setting infer_schema_length=None

@FrocketGaming
Copy link
Author

I don't think you can, other than setting infer_schema_length=None

I'll give that a shot but at least I know a version that works for me and that I'm not a complete idiot; thanks! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants