Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could we defer to flox for GroupBy.first? #9647

Open
max-sixty opened this issue Oct 18, 2024 · 4 comments · Fixed by #9986
Open

Could we defer to flox for GroupBy.first? #9647

max-sixty opened this issue Oct 18, 2024 · 4 comments · Fixed by #9986

Comments

@max-sixty
Copy link
Collaborator

Is your feature request related to a problem?

I was wondering why a groupby("foo").first() call was going so slowly — I think we run a python loop for this, rather than calling into flox:

xarray/xarray/core/groupby.py

Lines 1218 to 1231 in b9780e7

def _first_or_last(self, op, skipna, keep_attrs):
if all(
isinstance(maybe_slice, slice)
and (maybe_slice.stop == maybe_slice.start + 1)
for maybe_slice in self.encoded.group_indices
):
# NB. this is currently only used for reductions along an existing
# dimension
return self._obj
if keep_attrs is None:
keep_attrs = _get_keep_attrs(default=True)
return self.reduce(
op, dim=[self._group_dim], skipna=skipna, keep_attrs=keep_attrs
)

Describe the solution you'd like

Could we call into flox? Numbagg has the routines...

Describe alternatives you've considered

No response

Additional context

No response

@dcherian
Copy link
Contributor

dcherian commented Oct 20, 2024

Yes , the minor complication is that we should dispatch nanfirst and nanlast but not first, last. The latter are simply indexing using an indexer we already know, so the reduction approach is overkill.

Closing #8025 in favor of this one.

Out of curiosity how many groups does your problem have?

@max-sixty
Copy link
Collaborator Author

Sorry I missed #8025, I thought I searched; I guess first hit lots of unrelated issues and I missed it.

Out of curiosity how many groups does your problem have?

About 15K...

@dcherian
Copy link
Contributor

About 15K...

Do you end up using dask for this, or just numbagg? Are these groups randomly distributed along the dimension, or are there patterns to how they are distributed (e.g. are they sequential)?

Just curious...

@max-sixty
Copy link
Collaborator Author

Do you end up using dask for this, or just numbagg?

I ended up just leaving it running for hours!

Are these groups randomly distributed along the dimension, or are there patterns to how they are distributed (e.g. are they sequential)?

Yes they're largely sequential!

dcherian added a commit to dcherian/xarray that referenced this issue Jan 25, 2025
1. Use flox where possible.
2. Use simple indexing where possible.

Closes pydata#9647
dcherian added a commit to dcherian/xarray that referenced this issue Jan 29, 2025
@dcherian dcherian reopened this Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants