Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LazyFrame properties now require significant compute - update API to reflect this #16328

Closed
2 tasks done
dmeekpmg opened this issue May 20, 2024 · 12 comments · Fixed by #16964
Closed
2 tasks done

LazyFrame properties now require significant compute - update API to reflect this #16328

dmeekpmg opened this issue May 20, 2024 · 12 comments · Fixed by #16964
Assignees
Labels
A-api Area: changes to the public API accepted Ready for implementation python Related to Python Polars
Milestone

Comments

@dmeekpmg
Copy link

dmeekpmg commented May 20, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

#%%
import polars as pl
df = pl.LazyFrame({'a': [1, 2, 3]})
#%%
%%timeit
df.columns
  • v0.20.21: 115 ns ± 1.61 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
  • v0.20.23: 1 µs ± 8.75 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
  • v0:20:26: 1.04 µs ± 39.4 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Log output

No response

Issue description

LazyFrame.columns performs much slower from version 0.20.23. In my simple example above, 0.20.23 is nearly 10x slower than 0.20.21.
In my more realistic data pipeline where I use this function to determine the common columns between two dataframes, this calculation accounts for 92% of my runtime in v0.20.23 and only 0.3% in v0.20.21

Performance of DataFrame.columns appears unchanged.

Expected behavior

Performance of .columns is unchanged

Installed versions

--------Version info---------
Polars:               0.20.26
Index type:           UInt32
Platform:             Windows-10-10.0.19045-SP0
Python:               3.12.2 (tags/v3.12.2:6abddd9, Feb  6 2024, 21:26:36) [MSC v.1937 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           0.3.3
deltalake:            <not installed>
fastexcel:            0.10.4
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             3.1.2
pandas:               2.2.2
pyarrow:              16.1.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               1.0.10
sqlalchemy:           2.0.30
xlsx2csv:             0.8.2
xlsxwriter:           3.2.0
@dmeekpmg dmeekpmg added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 20, 2024
@alexander-beedie alexander-beedie added the performance Performance issues or improvements label May 20, 2024
@ritchie46
Copy link
Member

Yes. It is.

columns needs to resolve the logical plan and is actually quite expensive. I think we should deprecate a columns and schema as property tags and add them as functions to make more clear that they do non-trivial compute.

@ritchie46 ritchie46 removed bug Something isn't working needs triage Awaiting prioritization by a maintainer labels May 20, 2024
@stinodego stinodego added this to the 1.0.0 milestone May 20, 2024
@stinodego stinodego added A-api Area: changes to the public API accepted Ready for implementation and removed performance Performance issues or improvements labels May 20, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog May 23, 2024
@stinodego
Copy link
Contributor

stinodego commented May 23, 2024

All LazyFrame properties now require significant compute. They will be replaced by methods to reflect this:

Before After
LazyFrame.columns ...
LazyFrame.dtypes ...
LazyFrame.schema LazyFrame.collect_schema()
LazyFrame.width ...

The LazyFrame properties will be deprecated.

To facilitate writing code that is generic for DataFrames and LazyFrames, DataFrames will have the same methods added.
The DataFrame properties will not be deprecated. These do not require significant compute.

We will include documentation in the deprecation message / docstrings to explain the reason for this change.

@stinodego stinodego changed the title Degradation in performance of LazyFrame.columns LazyFrame properties now require significant compute - update API to reflect this May 23, 2024
@stinodego stinodego self-assigned this May 24, 2024
@stinodego stinodego moved this from Ready to Next in Backlog May 26, 2024
@stinodego stinodego added needs decision Awaiting decision by a maintainer accepted Ready for implementation and removed accepted Ready for implementation needs decision Awaiting decision by a maintainer labels May 27, 2024
@stinodego
Copy link
Contributor

stinodego commented May 27, 2024

We have decided to add a .collect_schema() method to both DataFrame and LazyFrame.

The LazyFrame.schema property will remain, but it will throw a PerformanceWarning and tell you to use .collect_schema.
The DataFrame.schema property will remain and throw no warnings.

With regards to the other properties: these will be handled in a second step (most likely post-1.0.0).

collect_schema will be updated to return a proper Schema object with accessors for dtypes, column_names, and width. The properties will throw a PerformanceWarning and refer to use the properties on the Schema object.

@deanm0000
Copy link
Collaborator

Just stumbled on this, could you add in .collect_schema_async too please?

@eitsupi
Copy link
Contributor

eitsupi commented Jun 12, 2024

Just to confirm, this is not as expensive as collect (or fetch), right?

@alexander-beedie
Copy link
Collaborator

Just to confirm, this is not as expensive as collect (or fetch), right?

It is not; this is the part of the compute required to establish the query plan and track schema evolution through it - however, it will not execute that plan 👍

@stinodego
Copy link
Contributor

Just stumbled on this, could you add in .collect_schema_async too please?

Please make a separate issue for that request. I will not add it as part of the initial feature.

@etiennebacher
Copy link
Contributor

Maybe it's a bit late to raise this concern but the name sounds weird for DataFrame. For LazyFrame we can clearly see the collect() / collect_schema() / collect_all() similarity but DataFrame doesn't have collect() so having a collect_schema() is strange.

Isn't get_schema() better?

@stinodego
Copy link
Contributor

stinodego commented Jun 13, 2024

It sounds a bit strange indeed, but it has to be added to DataFrame to facilitate writing generic code that works for both DataFrame/LazyFrame.

There is a difference, because for DataFrame the schema just exists, you only have to get it. For LazyFrame there is compute involved. Whichever name you come up with is going to be inappropriate for either of the two.

For DataFrame it sounds a bit strange, but you don't generally have to use it. Just use .schema.

@mcrumiller
Copy link
Contributor

writing generic code that works for both DataFrame/LazyFrame

Does this mean a DataFrame.collect() no-op should exist?

@stinodego
Copy link
Contributor

Does this mean a DataFrame.collect() no-op should exist?

So far we have decided against that:
#7882 (comment)

Not sure what that would mean for a DataFrame.collect_schema.

@JacobSantry
Copy link

This performance regression has significantly impacted my use case.

I’ve observed the regression between versions 0.20.21 and 0.20.22 and through to the latest 1.8.2 with the .collect_schema().names() syntax.

Could you help me understand why the performance cost of resolving the logical plan for columns only became an issue after this update? Previously, it seemed that the evaluation was necessary but still performed as expected.

While I agree with updating the API to reflect this cost, I’m curious about the reasons behind the increased cost. Is it possible to cache the schema, provided the logical plan has not changed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-api Area: changes to the public API accepted Ready for implementation python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

9 participants