Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Don't fallback to pandas after simpe hasattr check #17678

Open
MarcoGorelli opened this issue Jan 3, 2025 · 3 comments
Open

[FEA] Don't fallback to pandas after simpe hasattr check #17678

MarcoGorelli opened this issue Jan 3, 2025 · 3 comments
Labels
cudf.pandas Issues specific to cudf.pandas feature request New feature or request

Comments

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Jan 3, 2025

Discovered in Narwhals

If I have a function like:

def process_df(df):
    if hasattr(df, 'with_columns'):
        # Do something. cudf/pandas wouldn't go here
        return df.with_columns(...)
    return df.groupby('a')['b'].sum()

and run it with cudf.pandas, then this is enough to cause a fallback to pandas - even though cudf would have been perfectly capable of executing it!

This isn't farfetched - you can find examples of libraries doing hasattr checks all over the place, e.g. this one in pymc

https://github.com/pymc-devs/pymc/blob/ae4a6292aa48d5722dba6ab422e1a3d895ca7bf7/pymc/data.py#L232-L248

Scikit-learn:

https://github.com/scikit-learn/scikit-learn/blob/c9aeb15f8f1c7c54ed4ef27c871f7167e2ce3077/sklearn/feature_selection/_base.py#L129-L131

Plotly:

https://github.com/plotly/plotly.py/blob/231edaa61f8590f715134d42ea0bc2f858dd713e/packages/python/plotly/plotly/express/_imshow.py#L303-L312


I asked about this on Slack, and got the response

I think we still need to fall back because we need to raise the same AttributeError that pandas would raise if it didn’t exist.

As a user, I consider falling back to pandas unnecessarily to be significantly worse than raising a slightly inconsistent AttributeError message - especially when using hasattr which is when the AttributeError would be caught anyway.

If exception raising needs to be perfectly consistent, perhaps there's a way to still raise it without needing to fallback to pandas? e.g.

def __getattr(self, attr):
    if hasattr(pandas.DataFrame, attr):
        # do the fallback, raise if necessary
    else:
        msg = f"{attr} is present in neither cuDF nor pandas"
        raise AttributeError(msg)
@mroeschke
Copy link
Contributor

mroeschke commented Jan 3, 2025

Just wanted to comment that this is an interesting class of optimization that I think would be nice for cudf.pandas - Can cudf.pandas not fall back if it's an exception that would raise in pandas too? Like this hasattr check, other cases might be:

  • Calling methods that don't exist in either library.
  • Invalid arguments
  • Known invalid operations between two types (e.g. binops between strings and lists)

@mroeschke mroeschke added the cudf.pandas Issues specific to cudf.pandas label Jan 3, 2025
@vyasr
Copy link
Contributor

vyasr commented Jan 15, 2025

I would be fine with not falling back just to do a check that we know will also fail on the other end, but we will have to be careful to handle some special cases and I fear that it'll be hard to have a general solution here without accepting some degradation in correctness (i.e. we'll have some cases that would have succeeded with fallback that no longer will). In particular, in addition to checking whether the attribute exists on the pandas type, we will also have to check whether the pandas type overrides __getattr__ because the pandas implementation might behave differently from cudf's if we have a bug (or have missed implementing it altogether).

For example, you can access DataFrame columns using its __getattr__ overload. I'm reasonably confident that our implementation of that particular case is correct, but there are more cases beyond that (GroupBy columns and others) where we might have bugs or missing coverage, so for the moment let's imagine that our DataFrame.__getattr__ implementation did not exist (or had a bug) since it's the easiest to provide an example for. My concern would be a case like this

%load_ext cudf.pandas
df = pd.DataFrame({'a': [1, 2, 3]})
print(df.a.sum())

If our implementation is missing, we will fail the attribute check in cudf, but we will also then proceed to fail to see it in pandas because pd.DataFrame does not have an attribute a, but pd.DataFrame.__getattr__(df, "a") will return a Series due to the override.

As Marco says, there may be enough usage of hasattr checks in the wild to justify some sort of short-circuiting, but we should be aware of the potential pitfalls and consider if we need to maybe make this kind of behavior configurable in the instantiation of a particular proxy type so that we can filter which classes we are confident enough in their behavior to allow this.

@MarcoGorelli
Copy link
Contributor Author

thanks @vyasr

If it's not possible to do this everywhere, would you consider not doing the fallback for the following methods:

  • '__narwhals_dataframe__'
  • '__narwhals_lazyframe__'
  • '__narwhals_series__'
    ?

These are just methods that we check whether incoming object have, and for cuDF, I'd expect hasattr to just return False and let us continue to the next branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf.pandas Issues specific to cudf.pandas feature request New feature or request
Projects
Status: Todo
Development

No branches or pull requests

3 participants