[FEA] Don't fallback to pandas after simpe `hasattr` check #17678

MarcoGorelli · 2025-01-03T09:31:05Z

Discovered in Narwhals

If I have a function like:

def process_df(df):
    if hasattr(df, 'with_columns'):
        # Do something. cudf/pandas wouldn't go here
        return df.with_columns(...)
    return df.groupby('a')['b'].sum()

and run it with cudf.pandas, then this is enough to cause a fallback to pandas - even though cudf would have been perfectly capable of executing it!

This isn't farfetched - you can find examples of libraries doing hasattr checks all over the place, e.g. this one in pymc

https://github.com/pymc-devs/pymc/blob/ae4a6292aa48d5722dba6ab422e1a3d895ca7bf7/pymc/data.py#L232-L248

Scikit-learn:

https://github.com/scikit-learn/scikit-learn/blob/c9aeb15f8f1c7c54ed4ef27c871f7167e2ce3077/sklearn/feature_selection/_base.py#L129-L131

Plotly:

https://github.com/plotly/plotly.py/blob/231edaa61f8590f715134d42ea0bc2f858dd713e/packages/python/plotly/plotly/express/_imshow.py#L303-L312

I asked about this on Slack, and got the response

I think we still need to fall back because we need to raise the same AttributeError that pandas would raise if it didn’t exist.

As a user, I consider falling back to pandas unnecessarily to be significantly worse than raising a slightly inconsistent AttributeError message - especially when using hasattr which is when the AttributeError would be caught anyway.

If exception raising needs to be perfectly consistent, perhaps there's a way to still raise it without needing to fallback to pandas? e.g.

def __getattr(self, attr):
    if hasattr(pandas.DataFrame, attr):
        # do the fallback, raise if necessary
    else:
        msg = f"{attr} is present in neither cuDF nor pandas"
        raise AttributeError(msg)

The text was updated successfully, but these errors were encountered:

mroeschke · 2025-01-03T19:59:23Z

Just wanted to comment that this is an interesting class of optimization that I think would be nice for cudf.pandas - Can cudf.pandas not fall back if it's an exception that would raise in pandas too? Like this hasattr check, other cases might be:

Calling methods that don't exist in either library.
Invalid arguments
Known invalid operations between two types (e.g. binops between strings and lists)

vyasr · 2025-01-15T23:57:34Z

I would be fine with not falling back just to do a check that we know will also fail on the other end, but we will have to be careful to handle some special cases and I fear that it'll be hard to have a general solution here without accepting some degradation in correctness (i.e. we'll have some cases that would have succeeded with fallback that no longer will). In particular, in addition to checking whether the attribute exists on the pandas type, we will also have to check whether the pandas type overrides __getattr__ because the pandas implementation might behave differently from cudf's if we have a bug (or have missed implementing it altogether).

For example, you can access DataFrame columns using its __getattr__ overload. I'm reasonably confident that our implementation of that particular case is correct, but there are more cases beyond that (GroupBy columns and others) where we might have bugs or missing coverage, so for the moment let's imagine that our DataFrame.__getattr__ implementation did not exist (or had a bug) since it's the easiest to provide an example for. My concern would be a case like this

%load_ext cudf.pandas
df = pd.DataFrame({'a': [1, 2, 3]})
print(df.a.sum())

If our implementation is missing, we will fail the attribute check in cudf, but we will also then proceed to fail to see it in pandas because pd.DataFrame does not have an attribute a, but pd.DataFrame.__getattr__(df, "a") will return a Series due to the override.

As Marco says, there may be enough usage of hasattr checks in the wild to justify some sort of short-circuiting, but we should be aware of the potential pitfalls and consider if we need to maybe make this kind of behavior configurable in the instantiation of a particular proxy type so that we can filter which classes we are confident enough in their behavior to allow this.

MarcoGorelli · 2025-01-16T19:12:45Z

thanks @vyasr

If it's not possible to do this everywhere, would you consider not doing the fallback for the following methods:

'__narwhals_dataframe__'
'__narwhals_lazyframe__'
'__narwhals_series__'
?

These are just methods that we check whether incoming object have, and for cuDF, I'd expect hasattr to just return False and let us continue to the next branch

MarcoGorelli added the feature request New feature or request label Jan 3, 2025

MarcoGorelli mentioned this issue Jan 3, 2025

[BUG] using python -m cudf.pandas and calling hasattr converts NA to NaN #17666

Closed

mroeschke added the cudf.pandas Issues specific to cudf.pandas label Jan 3, 2025

github-project-automation bot added this to cuDF Python Jan 3, 2025

github-project-automation bot moved this to Todo in cuDF Python Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Don't fallback to pandas after simpe `hasattr` check #17678

[FEA] Don't fallback to pandas after simpe `hasattr` check #17678

MarcoGorelli commented Jan 3, 2025 •

edited

Loading

mroeschke commented Jan 3, 2025 •

edited

Loading

vyasr commented Jan 15, 2025

MarcoGorelli commented Jan 16, 2025

[FEA] Don't fallback to pandas after simpe hasattr check #17678

[FEA] Don't fallback to pandas after simpe hasattr check #17678

Comments

MarcoGorelli commented Jan 3, 2025 • edited Loading

mroeschke commented Jan 3, 2025 • edited Loading

vyasr commented Jan 15, 2025

MarcoGorelli commented Jan 16, 2025

[FEA] Don't fallback to pandas after simpe `hasattr` check #17678

[FEA] Don't fallback to pandas after simpe `hasattr` check #17678

MarcoGorelli commented Jan 3, 2025 •

edited

Loading

mroeschke commented Jan 3, 2025 •

edited

Loading