Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement pl.DataFrame.collect similarly to pl.LazyFrame.lazy #6846

Closed
StefanBRas opened this issue Feb 13, 2023 · 6 comments
Closed

Implement pl.DataFrame.collect similarly to pl.LazyFrame.lazy #6846

StefanBRas opened this issue Feb 13, 2023 · 6 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@StefanBRas
Copy link
Contributor

Problem description

In the same way pl.LazyFrame has .lazy() it would be nice if pl.DataFrame had .collect(), so it's easier to write code that works for both. (as the docstring for pl.LazyFrame.lazy notes).

My specific motivation is written some generic test assertions where I now have to do:

def assert_no_new_rows(new: pl.DataFrame | pl.LazyFrame, old: pl.DataFrame | pl.LazyFrame):
    if isinstance(new, pl.LazyFrame):
        new = new.collect()
    if isinstance(old, pl.LazyFrame):
        old = old.collect()
    assert new.height == old.height

If pl.DataFrame.collect was Implemented this could be written as:

def assert_no_new_rows(new: pl.DataFrame | pl.LazyFrame, old: pl.DataFrame | pl.LazyFrame):
   assert new.collect().height == old.collect().height

If you think it makes sense, I can make a PR.

@StefanBRas StefanBRas added the enhancement New feature or an improvement of an existing feature label Feb 13, 2023
@s-banach
Copy link
Contributor

Maybe your use of DataFrame | LazyFrame makes more sense outside of this toy example.
But this example is nonsensical, because your function is calling .collect() and then discarding the result, meaning you'll have to call .collect() again outside the function, and run the whole computation twice.

@StefanBRas
Copy link
Contributor Author

I wouldn't call it nonsensical, it's the exact functionality I want. It's for testing correctness and it's fast enough that I'd prefer to (maybe) have to do a recalculation as opposed to make conditional paths for DataFrame versus LazyFrame.

I feel like the argument is the exact same as the argument for LazyFrame.lazy - "Useful for writing code that expects either a DataFrame or LazyFrame."

Another example could be that I wanted to plot some intermediate subresults:

def plot(df: pl.LazyFrame | pl.DataFrame):
    data = df.select('value').filter(pl.col('value') > 100).collect()
    my_plot(data)

def calculate_and_plot_thing(df: pl.LazyFrame | pl.DataFrame):
    first_df = df.with_columns(...)
    plot(first_df)
    second_df = first_df.with_columns(...)
    plot(second_df)
    return second_df.with_columns(...)

If I rewrote plot to only take DataFrame I would have to collect outside plot meaning that I would either have to collect without the optimization from select and filter or I would have to put the select and filter outside of the plot function, making it less obvious and modular.

@s-banach
Copy link
Contributor

Sorry for being rude. I didn't even know about LazyFrame.lazy(), which shows there is some precedent for this kind of thing.

@ghuls
Copy link
Collaborator

ghuls commented Feb 13, 2023

df.lazy().collect() should work for both DataFrame and LazyFrame.

@StefanBRas
Copy link
Contributor Author

df.lazy().collect() should work for both DataFrame and LazyFrame.

Ah yes, that works. Thanks!

Given the existence of a lazy: (DataFrame | LazyFrame) -> LazyFrame I think it makes sense to have the inverse collect: (LazyFrame | DataFrame) -> DataFrame but I guess it's matter of preference. There are the caveats of potential extra work if people end up collecting the same LazyFrame multiple times.

@stinodego
Copy link
Contributor

Closing in favor of #7882

@stinodego stinodego closed this as not planned Won't fix, can't repro, duplicate, stale Jul 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

4 participants