Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Left/right join with single row dataframe with explicit types replaces nulls with default values. #19804

Closed
2 tasks done
DavideCanton opened this issue Nov 15, 2024 · 4 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@DavideCanton
Copy link

DavideCanton commented Nov 15, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

>>> df = pl.DataFrame({'k': [1,2]})

>>> df
shape: (2, 1)
┌─────┐
│ k   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
└─────┘

>>> df2 = pl.DataFrame({'k': [1], 'int': [None], 'float':[None], 'str': [None]}, schema_overrides={'int': pl.Int64, 'float': pl.Float64, 'str': pl.String})
shape: (1, 4)
┌─────┬──────┬───────┬──────┐
│ kintfloatstr  │
│ ------------  │
│ i64i64f64str  │
╞═════╪══════╪═══════╪══════╡
│ 1nullnullnull │
└─────┴──────┴───────┴──────┘

>>> df.join(df2, on='k', how='left')
shape: (2, 4)
┌─────┬──────┬───────┬──────┐
│ kintfloatstr  │
│ ------------  │
│ i64i64f64str  │
╞═════╪══════╪═══════╪══════╡
│ 100.0   ┆      │
│ 2nullnullnull │
└─────┴──────┴───────┴──────┘

>>> df2_casted = pl.DataFrame({'k': [1], 'int': [None], 'float':[None], 'str': [None]}).with_columns(int=pl.col.int.cast(pl.Int64), float=pl.col.float.cast(pl.Float64), str=pl.col.str.cast(pl.String))

>>> df2_casted
shape: (1, 4)
┌─────┬──────┬───────┬──────┐
│ kintfloatstr  │
│ ------------  │
│ i64i64f64str  │
╞═════╪══════╪═══════╪══════╡
│ 1nullnullnull │
└─────┴──────┴───────┴──────┘

>>> df.join(df2_casted, on='k', how='left')
shape: (2, 4)
┌─────┬──────┬───────┬──────┐
│ kintfloatstr  │
│ ------------  │
│ i64i64f64str  │
╞═════╪══════╪═══════╪══════╡
│ 100.0   ┆      │
│ 2nullnullnull │
└─────┴──────┴───────┴──────┘

Log output

join parallel: true
LEFT join dataframes finished

Issue description

It seems that left joining a dataframe with another one does not yield correct results if the right dataframe has none in rows that have a match and an explicit type is specified via schema_overrides or casting. Not casting the columns yields the correct result.

It's not shown in the example, but reversing the join by doing df2_casted.join(df, on='k', how='right') behaves the same.

Expected behavior

I would expect the two joins to be equivalent, since by specification in a left join the rows matched should be kept as they are in the original dataframe.

Installed versions

--------Version info---------
Polars:              1.13.1
Index type:          UInt32
Platform:            Windows-10-10.0.19045-SP0
Python:              3.12.7 (tags/v3.12.7:0b05ead, Oct  1 2024, 03:06:41) [MSC v.1941 64 bit (AMD64)]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
great_tables         <not installed>
matplotlib           3.9.2
nest_asyncio         1.6.0
numpy                2.1.2
openpyxl             <not installed>
pandas               2.2.3
pyarrow              17.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@DavideCanton DavideCanton added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Nov 15, 2024
@DavideCanton
Copy link
Author

It seems related to df2 having a single row. If it has many rows, the issue does not show up:

>>> df = pl.DataFrame({'k': [1,2]})

>>> df
shape: (2, 1)
┌─────┐
│ k   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
└─────┘

>>> df2 = pl.DataFrame({'k': [1,4], 'int': [None,None], 'float':[None,None], 'str': [None,None]}, schema_overrides={'int': pl.Int64, 'float': pl.Float64, 'str': pl.String})

>>> df2
shape: (2, 4)
┌─────┬──────┬───────┬──────┐
│ kintfloatstr  │
│ ------------  │
│ i64i64f64str  │
╞═════╪══════╪═══════╪══════╡
│ 1nullnullnull │
│ 4nullnullnull │
└─────┴──────┴───────┴──────┘

>>> df.join(df2, on='k', how="left")
shape: (2, 4)
┌─────┬──────┬───────┬──────┐
│ kintfloatstr  │
│ ------------  │
│ i64i64f64str  │
╞═════╪══════╪═══════╪══════╡
│ 1nullnullnull │
│ 2nullnullnull │
└─────┴──────┴───────┴──────┘

@DavideCanton DavideCanton changed the title Left/right join with casted dataframe replaces nulls with default values. Left/right join with single row dataframe with explicit types replaces nulls with default values. Nov 15, 2024
@DavideCanton
Copy link
Author

It seems that the bug shows up since version 1.13.0, the 1.12.0 it's working fine.

@ritchie46
Copy link
Member

fixed by #19823. Will issue a release tomorrow.

@DavideCanton
Copy link
Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants