Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Ergonomic casts from Categorical to Enum #20645

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Jan 9, 2025

ref: #19868 (comment).

Allow for stable ergonomic casts from Categorical to Enum.

If no categories are defined, we read the Categorical rev_map and normalise to a lexically sorted Enum (which ensures that if the same categories appear in a different order they will return the same Enum).

Example

import polars as pl

df = pl.DataFrame({
  "id": [0, 1, 2, 3],
  "foo": pl.Series(["aaa", "bbb", "aaa", "bbb"], dtype=pl.Categorical),
  "bar": pl.Series(["zzz", "xxx", "xxx", "yyy"], dtype=pl.Categorical),
  "baz": pl.Series(["xxx", "yyy", "zzz", "xxx"], dtype=pl.Categorical),
})

Before:
(note that we have to sort the categories to ensure a stable Enum)

df.cast({
  "foo": pl.Enum(sorted(df["foo"].cat.get_categories())),
  "bar": pl.Enum(sorted(df["bar"].cat.get_categories())),
  "baz": pl.Enum(sorted(df["baz"].cat.get_categories())),
})

After:
(no boilerplate, and the resulting enums will be stable/sorted)

df.cast({pl.Categorical: pl.Enum})

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Jan 9, 2025
@ritchie46
Copy link
Member

I am not sure about this. The fields in an enum belong to the data type. In a categorical they are inferred at runtime.

Copy link

codecov bot commented Jan 9, 2025

Codecov Report

Attention: Patch coverage is 93.75000% with 1 line in your changes missing coverage. Please review.

Project coverage is 79.01%. Comparing base (6cd9988) to head (90521e2).
Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
...-core/src/chunked_array/logical/categorical/mod.rs 92.85% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main   #20645   +/-   ##
=======================================
  Coverage   79.01%   79.01%           
=======================================
  Files        1557     1557           
  Lines      220552   220560    +8     
  Branches     2514     2514           
=======================================
+ Hits       174274   174284   +10     
+ Misses      45705    45703    -2     
  Partials      573      573           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jan 9, 2025

I am not sure about this. The fields in an enum belong to the data type. In a categorical they are inferred at runtime.

Yup; it's an ergonomic improvement to something that may not actually be a good idea, heh. I'll park in Draft so you can have a think about it; not a big PR - feel free to close if you prefer not to merge ;)

@alexander-beedie alexander-beedie marked this pull request as draft January 9, 2025 15:32
@coastalwhite
Copy link
Collaborator

I don't think "lexically sorted enum"s exist. That will panic in many places.

@mcrumiller
Copy link
Contributor

FYI there's a little relevant discussion in #14952.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jan 9, 2025

I don't think "lexically sorted enum"s exist. That will panic in many places.

As in, the Enum gets created with the values having been sorted (as the same Categorical values coming from several sources could have a very different physical ordering, but -if we were to merge this- we'd want the Enum equivalent to resolve consistently). I could have worded that better ;)

@s-banach
Copy link
Contributor

s-banach commented Jan 9, 2025

I just wanted to freeze an existing categorical encoding, e.g. the output of cut(). Definitely didn’t imagine it would change the ordering.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jan 10, 2025

I just wanted to freeze an existing categorical encoding, e.g. the output of cut(). Definitely didn’t imagine it would change the ordering.

They're not quite the same thing; casting an arbitrary Categorical vs freezing a specific Categorical.

@ritchie46: What do you think about a dedicated .cat.to_enum() for this use-case instead? Should avoid most of the issues raised here as it would be very explicit, eg: "I know these categories should become an enum" vs the "I hope all the categories I need are here" that comes from integrating as top-level cast support 🤔

A dedicated method could also have a parameter to indicate whether the categories should be sorted or taken "as-is" too. Something like this:

pl.col("countries").cat.to_enum(order="lexical")
pl.col("countries").cat.to_enum(sort=True)  # or just a bool?

@ritchie46
Copy link
Member

There is no thing as freezing the categorical to enum. In a categorical, order of categories doesn't matter (currently it does, but we will fix that, in global it doesn't). In an enum the categories are ordered.

I want people to be deliberate about that. If they are casting to enum currently because they can circumvent some issues with categorical, that's not a good reason to implement this. We need to improve categorical.

@alexander-beedie
Copy link
Collaborator Author

I want people to be deliberate about that.

Indeed; hence the suggestion to move it out into an explicit/dedicated method that allows them to do that, rather than have it live in cast. Not to worry, will close this out and it can be revisited some other time 👌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants