Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Add collect() to DataFrame #19548

Open
Chuck321123 opened this issue Oct 31, 2024 · 4 comments
Open

Feature request: Add collect() to DataFrame #19548

Chuck321123 opened this issue Oct 31, 2024 · 4 comments
Labels
enhancement New feature or an improvement of an existing feature reference Reference issue for recurring topics

Comments

@Chuck321123
Copy link

Description

So collecting a df is used for lazyframes. However, I sometimes run my code in eager mode, and sometimes in lazymode. However, the amount of if-else and try-except functions i have to make in my code makes it exhausting to switch between eager and lazy mode. I would have prefered not to get an AttributeError when running collect() on a eager frame. I can't be the only one wanting this function I believe, or have I missed something?

Example:

import polars as pl

df = pl.DataFrame({"column1": [1, 2, 3]})

df = df.collect()
@Chuck321123 Chuck321123 added the enhancement New feature or an improvement of an existing feature label Oct 31, 2024
@MarcoGorelli
Copy link
Collaborator

hey, thanks for the request

this was discussed previously and rejected, could you search the issue tracker please?

@cmdlineluser
Copy link
Contributor

@Chuck321123
Copy link
Author

@MarcoGorelli Hmm I see. Although I am open for getting a warning message, or if we explicitly have to pass a keyword argument to make it work

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Nov 1, 2024

The extreme asymmetry in the pros and cons is what makes this undesirable (imho):

  • Pro

    Slightly more convenient in a few cases when experimenting between eager/lazy.

  • Con

    You thought you were operating in lazy mode and taking advantage of full query plan optimisation because you can see the final collect(), but actually you forgot to switch read_parquet back to scan_parquet and now your production pipeline is operating in eager mode, is an order of magnitude slower, and your AWS bill just went up 10x for the week until the cause of the slowdown was root-caused ;)

@nameexhaustion nameexhaustion added the reference Reference issue for recurring topics label Nov 26, 2024
@nameexhaustion nameexhaustion changed the title Feature request: Don't raise error when collecting an eager frame Feature request: Add collect() to DataFrame Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature reference Reference issue for recurring topics
Projects
None yet
Development

No branches or pull requests

5 participants