Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scan_ndjson not handling projection pushdown correctly #17553

Closed
2 tasks done
lithomas1 opened this issue Jul 10, 2024 · 1 comment · Fixed by #17631
Closed
2 tasks done

scan_ndjson not handling projection pushdown correctly #17553

lithomas1 opened this issue Jul 10, 2024 · 1 comment · Fixed by #17631
Assignees
Labels
A-io-json Area: reading/writing JSON files accepted Ready for implementation bug Something isn't working P-medium Priority: medium python Related to Python Polars regression Issue introduced by a new release

Comments

@lithomas1
Copy link

lithomas1 commented Jul 10, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
from polars.testing import assert_frame_equal
df = pl.DataFrame(
    {
        "a": [1, 2, 3, None],
        "b": ["ẅ", "x", "y", "z"],
        "c": [None, None, 4, 5],
    }
)
df.write_ndjson("file.jsonl")
q = pl.scan_ndjson(
    "file.jsonl",
    row_index_name="row-index",
    row_index_offset=0
).select(["row-index", "a"])

df1 = q.collect(projection_pushdown=True)
df2 = q.collect(projection_pushdown=False)

assert_frame_equal(df1, df2)

Log output

No response

Issue description

When scanning a ndjson file and doing a column projection afterwards with select (selecting the row-index and a column from the scanned dataframe), polars will drop the row-index.

I believe the correct solution is just to swap the row index and projection conditions here
235ebee#diff-f91032885fa26496b1aa443e731f1dbb9d346e0567d4bafc96999839209e01b1R319-R326

(When the row_index is not in the selected columns, I believe the optimizer passes it as None in the IR, so I think swapping the conditions should be correct for both row_index being selected and not being selected. I can try to submit a PR for this if the patch is correct.)

Expected behavior

The row index should be kept (like it is for CSV, and when the projection_pushdown option is False in collect.

Installed versions

--------Version info---------
Polars:               1.1.0
Index type:           UInt32
Platform:             Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python:               3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager:  1.0.0
cloudpickle:          3.0.0
connectorx:           0.3.3
deltalake:            0.18.2
fastexcel:            0.10.4
fsspec:               2024.6.1
gevent:               24.2.1
great_tables:         0.9.0
hvplot:               0.10.0
matplotlib:           3.9.0
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             3.1.5
pandas:               2.2.2
pyarrow:              16.1.0
pydantic:             2.8.0
pyiceberg:            <not installed>
sqlalchemy:           2.0.31
torch:                2.3.1.post300
xlsx2csv:             0.8.2
xlsxwriter:           3.2.0
@lithomas1 lithomas1 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 10, 2024
@stinodego stinodego added the A-io-json Area: reading/writing JSON files label Jul 11, 2024
@ritchie46
Copy link
Member

@nameexhaustion can you take a look here?

@nameexhaustion nameexhaustion added regression Issue introduced by a new release accepted Ready for implementation P-medium Priority: medium and removed needs triage Awaiting prioritization by a maintainer labels Jul 15, 2024
@nameexhaustion nameexhaustion self-assigned this Jul 15, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Jul 15, 2024
@github-project-automation github-project-automation bot moved this from Ready to Done in Backlog Jul 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-json Area: reading/writing JSON files accepted Ready for implementation bug Something isn't working P-medium Priority: medium python Related to Python Polars regression Issue introduced by a new release
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants