Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

join_asof breaks with certain parquet files (I think due to memory layout or something?) #16819

Closed
2 tasks done
kszlim opened this issue Jun 7, 2024 · 2 comments · Fixed by #16837
Closed
2 tasks done
Assignees
Labels
bug Something isn't working P-high Priority: high python Related to Python Polars regression Issue introduced by a new release

Comments

@kszlim
Copy link
Contributor

kszlim commented Jun 7, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

parquets.zip

Extract these two parquet files:

import polars as pl
l = pl.scan_parquet('left.parquet')
r = pl.scan_parquet('right.parquet')
l.join_asof(r, on='timestamp_monotonic_ns', by='run_id').collect()

Log output

ShapeError: unable to hstack Series of length 409467 and DataFrame of height 448221

Issue description

This throws with:
ShapeError: unable to hstack Series of length 409467 and DataFrame of height 448221

I noticed this is a regression from 0.20.3 -> 0.20.4

Only noticed this after joining some new tables.

Expected behavior

This should work and output something like:

┌────────┬────────────────────────┬────────┐
│ run_id ┆ timestamp_monotonic_ns ┆ status │
│ ---    ┆ ---                    ┆ ---    │
│ i64    ┆ i64                    ┆ bool   │
╞════════╪════════════════════════╪════════╡
│ 1      ┆ 4000000                ┆ null   │
│ 1      ┆ 6000000                ┆ null   │
│ 1      ┆ 8000000                ┆ null   │
│ 1      ┆ 10000000               ┆ null   │
│ …      ┆ …                      ┆ …      │
│ 1      ┆ 896438000000           ┆ false  │
│ 1      ┆ 896440000000           ┆ false  │
│ 1      ┆ 896442000000           ┆ false  │
│ 1      ┆ 896444000000           ┆ false  │
└────────┴────────────────────────┴────────┘

Installed versions

--------Version info---------
Polars:               0.20.31
Index type:           UInt32
Platform:             macOS-14.3.1-arm64-arm-64bit
Python:               3.9.6 (default, Feb  3 2024, 15:58:27)
[Clang 15.0.0 (clang-1500.3.9.4)]

----Optional dependencies----
adbc_driver_manager:  0.5.1
cloudpickle:          <not installed>
connectorx:           0.3.1
deltalake:            0.10.0
fastexcel:            <not installed>
fsspec:               2023.6.0
gevent:               <not installed>
hvplot:               0.9.2
matplotlib:           3.7.2
nest_asyncio:         1.5.6
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.1
pyarrow:              15.0.0
pydantic:             2.0.2
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.18
torch:                <not installed>
xlsx2csv:             0.8.1
xlsxwriter:           3.1.2```

</details>
@kszlim kszlim added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jun 7, 2024
@kszlim
Copy link
Contributor Author

kszlim commented Jun 8, 2024

Thanks to @cmdlineluser there's a workaround if within pl.scan_parquet rechunk is set to True.

@deanm0000 deanm0000 added regression Issue introduced by a new release P-high Priority: high and removed needs triage Awaiting prioritization by a maintainer labels Jun 8, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Jun 8, 2024
@ritchie46 ritchie46 self-assigned this Jun 9, 2024
@github-project-automation github-project-automation bot moved this from Ready to Done in Backlog Jun 9, 2024
@kszlim
Copy link
Contributor Author

kszlim commented Jun 10, 2024

Awesome, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P-high Priority: high python Related to Python Polars regression Issue introduced by a new release
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants