Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using oindex and vindex instead of ExplicitIndexer objects. #9273

Draft
wants to merge 48 commits into
base: main
Choose a base branch
from

Conversation

dcherian
Copy link
Contributor

@dcherian dcherian commented Jul 24, 2024

Continues the refactoring of backends and lazy indexing classes

andersy005 and others added 12 commits May 9, 2024 21:45
* add .oindex and .vindex to BackendArray

* Add support for .oindex and .vindex in H5NetCDFArrayWrapper

* Add support for .oindex and .vindex in NetCDF4ArrayWrapper, PydapArrayWrapper, NioArrayWrapper, and ZarrArrayWrapper

* add deprecation warning

* Fix deprecation warning message formatting

* add tests

* Update xarray/core/indexing.py

Co-authored-by: Deepak Cherian <[email protected]>

* Update ZarrArrayWrapper class in xarray/backends/zarr.py

Co-authored-by: Deepak Cherian <[email protected]>

---------

Co-authored-by: Deepak Cherian <[email protected]>
…dexing adapters and explicitly indexed arrays (#8870)

* pass key tuple to indexing adapters and explicitly indexed arrays

* update indexing in StackedBytesArray

* Update indexing in StackedBytesArray

* Add _IndexerKey type to _typing.py

* Update indexing in StackedBytesArray

* use tuple indexing in test_backend_array_deprecation_warning

* Add support for CompatIndexedTuple in explicit indexing adapter

This commit updates the `explicit_indexing_adapter` function to accept both
`ExplicitIndexer` and the new `CompatIndexedTuple`. The `CompatIndexedTuple` is
designed to facilitate the transition towards using raw tuples by carrying
additional metadata about the indexing type (basic, vectorized, or outer).

* remove unused code

* type hint fixes

* fix docstrings

* fix tests

* fix docstrings

* Apply suggestions from code review

Co-authored-by: Deepak Cherian <[email protected]>

* update docstrings and pass tuples directly

* Some test cleanup

* update docstring

* use `BasicIndexer` instead of `CompatIndexedTuple`

* support explicit indexing with tuples

* fix mypy errors

* remove unused IndexerMaker

* Update LazilyIndexedArray._updated_key to support explicit indexing with tuples

---------

Co-authored-by: Deepak Cherian <[email protected]>
Co-authored-by: Deepak Cherian <[email protected]>
* Trigger CI only if code files are modified.

Fixes #8705

* Apply suggestions from code review

Co-authored-by: Maximilian Roos <[email protected]>

---------

Co-authored-by: Maximilian Roos <[email protected]>
* main:
  [pre-commit.ci] pre-commit autoupdate (#9005)
* main:
  Add whatsnew entry for #8974 (#9022)
  Zarr: Optimize appending (#8998)
Co-authored-by: Deepak Cherian <[email protected]>
Co-authored-by: Deepak Cherian <[email protected]>
Co-authored-by: Anderson Banihirwe <[email protected]>
Co-authored-by: Anderson Banihirwe <[email protected]>
* main:
  New whatsnew section
  Release summary for v2024.05.0 (#9021)
* main:
  Bump codecov/codecov-action from 4.3.1 to 4.4.0 in the actions group (#9036)
  Use ME in test_plot instead of M (#9035)
  Fix numbagg or bottlekneck skip (#9034)
  [skip-ci] min_deps_check: show age of required pkg on error (#9025)
  attempt to get colour output in CI (#9031)
* main:
  array api-related upstream-dev failures (#8854)
  (fix): equality check against singleton `PandasExtensionArray` (#9032)
* main: (48 commits)
  Add test for #9155 (#9161)
  Remove mypy exclusions for a couple more libraries (#9160)
  Include numbagg in type checks (#9159)
  Improve zarr chunks docs (#9140)
  groupby: remove some internal use of IndexVariable (#9123)
  Improve `to_zarr` docs (#9139)
  Split out distributed writes in zarr docs (#9132)
  Update zendoo badge link (#9133)
  Support duplicate dimensions in `.chunk` (#9099)
  Bump the actions group with 2 updates (#9130)
  adjust repr tests to account for different platforms (#9127) (#9128)
  Grouper refactor (#9122)
  Update docstring in api.py for open_mfdataset(), clarifying "chunks" argument (#9121)
  Add test for rechunking to a size string (#9117)
  Move Sphinx directives out of `See also` (#8466)
  new whats-new section (#9115)
  release v2024.06.0 (#9113)
  release notes for 2024.06.0 (#9092)
  [skip-ci] Try fixing hypothesis CI trigger (#9112)
  Undo custom padding-top. (#9107)
  ...
* main: (54 commits)
  Adding `open_datatree` backend-specific keyword arguments (#9199)
  [pre-commit.ci] pre-commit autoupdate (#9202)
  Restore ability to specify _FillValue as Python native integers (#9258)
  add backend intro and how-to diagram (#9175)
  Fix copybutton for multi line examples in double digit ipython cells (#9264)
  Update signature for _arrayfunction.__array__ (#9237)
  Add encode_cf_datetime benchmark (#9262)
  groupby, resample: Deprecate some positional args (#9236)
  Delete ``base`` and ``loffset`` parameters to resample (#9233)
  Update dropna docstring (#9257)
  Grouper, Resampler as public api (#8840)
  Fix mypy on main (#9252)
  fix fallback isdtype method (#9250)
  Enable pandas type checking (#9213)
  Per-variable specification of boolean parameters in open_dataset (#9218)
  test push
  Added a space to the documentation (#9247)
  Fix typing for test_plot.py (#9234)
  Allow mypy to run in vscode (#9239)
  Revert "Test main push"
  ...
@dcherian
Copy link
Contributor Author

Could use some input here.

The reason #8909 snuck in is because it only manifests with engine="scipy" when netcdf is not installed.

I haven't tracked down why but it shows that we should be testing the backends individually, they should not depend on another backend being installed (at least not for the bundled engines). How can we do that nicely?

@keewis
Copy link
Collaborator

keewis commented Jul 31, 2024

How can we do that nicely?

If I understand the backends correctly, we could monkeypatch list_engines to only return a single BackendEntrypoint. Something like this, maybe?

def test_backend_scipy(..., monkeypatch):
    monkeypatch.setattr(xr.backends.plugins, "list_engines", partial(custom_list_engines, ScipyEntrypoint()))

(I didn't verify this actually works, and whether there's a better way depends on the test, I guess)

monkeypatching in general is something to be avoided usually, but in this case I don't think we can, unless we want to refactor the plugin machinery.

Copy link
Collaborator

@headtr1ck headtr1ck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get some return types on these methods? Thanks!

@dcherian
Copy link
Contributor Author

@andersy005 are you able to add some typing here?

@headtr1ck
Copy link
Collaborator

@andersy005 are you able to add some typing here?

Shouldn't be a blocker. It's already better than before! But nice to have anyway :)

@andersy005
Copy link
Member

@andersy005 are you able to add some typing here?

absolutely! my time is quite limited today, but i intend to prioritize working on this tomorrow

@dcherian
Copy link
Contributor Author

Could use some help with the typing here

cc @headtr1ck @andersy005 @Illviljan

@@ -824,14 +822,14 @@ def test_create_mask_outer_indexer() -> None:
def test_create_mask_vectorized_indexer() -> None:
indexer = indexing.VectorizedIndexer((np.array([0, -1, 2]), np.array([0, 1, -1])))
expected = np.array([False, True, True])
actual = indexing.create_mask(indexer, (5,))
actual = indexing.create_mask(indexer, (5, 5))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is right. THe indexer is a 2-element tuple, so the array's shape should be 2D

np.testing.assert_array_equal(expected, actual)

indexer = indexing.VectorizedIndexer(
(np.array([0, -1, 2]), slice(None), np.array([0, 1, -1]))
)
expected = np.array([[False, True, True]] * 2).T
actual = indexing.create_mask(indexer, (5, 2))
actual = indexing.create_mask(indexer, (5, 2, 5))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly here indexer is a 3-element tuple, so the shape should be 3D

@dcherian dcherian requested review from TomNicholas and andersy005 and removed request for TomNicholas November 19, 2024 04:57
@@ -695,141 +837,6 @@ def test_roundtrip_boolean_dtype(self) -> None:
assert_identical(original, actual2)
assert actual2["x"].dtype == "bool"

def test_orthogonal_indexing(self) -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simply moved to a new mixin class.

pass


# @indexing_tests(indexing_support=IndexingSupport.OUTER)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will need a different _getitem implementation for these.

return self[key] # type: ignore [index]


class NewBackendArray(NdimSizeLenMixin, indexing.ExplicitlyIndexed):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think its possible to do this without creating a new BackendArray class for backends.

@@ -351,7 +353,147 @@ def test_dtype_coercion_error(self) -> None:
ds.to_netcdf(path, format=format)


class DatasetIOBase:
class BackendIndexingTestsMixin:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simply moved out from DatasetIOBase. I can make a separate PR for this,

@@ -6489,3 +6496,83 @@ def test_zarr_safe_chunk_region(tmp_path):
chunk = ds.isel(region)
chunk = chunk.chunk()
chunk.chunk().to_zarr(store, region=region)


class LegacyBackendArrayWrapper(LegacyBackendArray):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is what cfgrib's backend array looks like

Comment on lines +275 to +281
@property
def oindex(self) -> indexing.IndexCallable:
return indexing.IndexCallable(self._oindex_get)

@property
def vindex(self) -> indexing.IndexCallable:
return indexing.IndexCallable(self._vindex_get)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as reminder the goal is to use .oindex and .vindex to be explicit about which indexing to use instead of IndexingAdapter subclasses.

I think this gets us nearly all the way to removing the special case for ExplicitlyIndexed in as_compatible_data

return indexing.explicit_indexing_adapter(
key, self.shape, indexing.IndexingSupport.OUTER_1VECTOR, self._getitem
def _oindex_get(self, key: _OuterIndexerKey) -> Any:
return indexing.outer_indexing_adapter(
Copy link
Contributor Author

@dcherian dcherian Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now we have three different adapters . This kind of dispatching was happening in explicit_indexing_adapter using the IndexingAdapter subclasses. Now it is more explicit

@@ -1014,35 +1138,60 @@ def explicit_indexing_adapter(
-------
Indexing result, in the form of a duck numpy-array.
"""

# If the array is not an ExplicitlyIndexedNDArrayMixin,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's uncomment this in a future PR once we are sure nothing is broken

Comment on lines +98 to 103
_BasicIndexerKey = tuple[int | np.integer | slice, ...]
_OuterIndexerKey = tuple[
int | np.integer | slice | np.ndarray[Any, np.dtype[np.integer]], ...
]
_VectorizedIndexerKey = tuple[slice | np.ndarray[Any, np.dtype[np.integer]], ...]

Copy link
Contributor Author

@dcherian dcherian Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure these types are fully accurate

* main: (79 commits)
  fix mean for datetime-like using the respective time resolution unit (#9977)
  Add `time_unit` argument to `CFTimeIndex.to_datetimeindex` (#9965)
  remove gate and add a test (#9958)
  Remove repetitive that (replace it with the) (#9994)
  add shxarray to the xarray ecosystem list (#9995)
  Add `shards` to `valid_encodings` to enable sharded Zarr writing (#9948)
  Use flox for grouped first, last (#9986)
  Bump the actions group with 2 updates (#9989)
  Fix some typing (#9988)
  Remove unnecessary a article (#9980)
  Fix test_doc_example on big-endian systems (#9949)
  fix weighted polyfit for arrays with more than 2 dimensions (#9974)
  Use zarr-fixture to prevent thread leakage errors (#9967)
  remove dask-expr from CI runs, fix related tests (#9971)
  Update time coding tests to assert exact equality (#9961)
  cast type to PDDatetimeUnitOptions (#9963)
  Suggest the correct name when no key matches in the dataset (#9943)
  fix upstream dev issues (#9953)
  Relax nanosecond datetime restriction in CF time decoding (#9618)
  Remove outdated quantile test. (#9945)
  ...
@dcherian dcherian changed the title Pass around tuples instead of ExplicitIndexer objects. Using oindex and vindex instead of ExplicitIndexer objects. Jan 30, 2025
@dcherian dcherian requested a review from shoyer January 30, 2025 05:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

ScipyArrayWrapper' object has no attribute 'oindex'
6 participants