Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up isel and __getitem__ #3375

Merged
merged 12 commits into from
Oct 9, 2019
Merged

Speed up isel and __getitem__ #3375

merged 12 commits into from
Oct 9, 2019

Conversation

crusaderky
Copy link
Contributor

@crusaderky crusaderky commented Oct 6, 2019

First iterative improvement for #2799.

Speed up Dataset.isel up to 33% and DataArray.isel up to 25% (when there are no indices and the numpy array is small). 15% speedup when there are indices.

Benchmarks can be found in #2799.

@crusaderky crusaderky self-assigned this Oct 6, 2019
@crusaderky crusaderky requested a review from shoyer October 6, 2019 21:27
@jhamman
Copy link
Member

jhamman commented Oct 7, 2019

Thanks @crusaderky. Do you think the indexing benchmarks we have in https://github.com/pydata/xarray/blob/master/asv_bench/benchmarks/indexing.py are sufficient? Anything you think would be worth adding to cover performance regressions here?

@crusaderky
Copy link
Contributor Author

crusaderky commented Oct 7, 2019

@jhamman hm. I'm looking at it now for the first time.
On first sight, it's a good start, but it's missing some important use cases:

  • DataArray slicing
  • slicing when there are no IndexVariables
  • no benchmarks for .sel()

@crusaderky
Copy link
Contributor Author

crusaderky commented Oct 7, 2019

I see that all tests in benchmarks/indexing.py use arrays with 2~6 million points.
While this is important to spot any case where the numpy underlying functions start being unnecessarily called more than once, it also means any performance improvement or degradation in any of the pure-Python code will be completely drowned out.

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks a lot @crusaderky .
I don't know this code 100% so I may have missed bits, but reading through the code I think I can follow it all, and looks more efficient + better code now.
And I'm learning more mypy!
I added one comment re a missing keyword. Seems odd that tests pass.

doc/whats-new.rst Outdated Show resolved Hide resolved
xarray/core/variable.py Show resolved Hide resolved
@jhamman
Copy link
Member

jhamman commented Oct 7, 2019

On first sight, it's a good start, but it's missing some important use cases:

  • DataArray slicing
  • slicing when there are no IndexVariables
  • no benchmarks for .sel()

Would you be willing to add a few of these cases to the benchmarks?

@jhamman jhamman mentioned this pull request Oct 7, 2019
9 tasks
Co-Authored-By: Maximilian Roos <[email protected]>
@crusaderky
Copy link
Contributor Author

Would you be willing to add a few of these cases to the benchmarks?

Yes, in due time. It's out of scope for this PR though.

xarray/core/dataset.py Show resolved Hide resolved
xarray/core/variable.py Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
@jhamman jhamman merged commit 3f0049f into pydata:master Oct 9, 2019
@crusaderky crusaderky deleted the isel branch October 10, 2019 09:21
dcherian added a commit that referenced this pull request Oct 22, 2019
commit b0c336f
Author: Maximilian Roos <[email protected]>
Date:   Mon Oct 21 04:52:36 2019 -0400

    Whatsnew for #3419 (#3422)

    * pyupgrade --py36-plus

    * Update xarray/core/nputils.py

    Co-Authored-By: crusaderky <[email protected]>

    * Update xarray/core/parallel.py

    Co-Authored-By: crusaderky <[email protected]>

    * Update xarray/tests/test_cftime_offsets.py

    Co-Authored-By: crusaderky <[email protected]>

    * Update xarray/tests/test_cftime_offsets.py

    Co-Authored-By: crusaderky <[email protected]>

    * whatsnew

commit 2984415
Author: Rhys Doyle <[email protected]>
Date:   Mon Oct 21 01:17:47 2019 +0100

    Revert changes made in #3358 (#3411)

    * Revert #3358

    * Revision

    * Code review

    * Merge from master

    * Obsolescence note

commit 3c462b9
Author: Maximilian Roos <[email protected]>
Date:   Sun Oct 20 20:16:58 2019 -0400

    Python3.6 idioms (#3419)

    * pyupgrade --py36-plus

    * Update xarray/core/nputils.py

    Co-Authored-By: crusaderky <[email protected]>

    * Update xarray/core/parallel.py

    Co-Authored-By: crusaderky <[email protected]>

    * Update xarray/tests/test_cftime_offsets.py

    Co-Authored-By: crusaderky <[email protected]>

    * Update xarray/tests/test_cftime_offsets.py

    Co-Authored-By: crusaderky <[email protected]>

commit 9886e3c
Author: crusaderky <[email protected]>
Date:   Sun Oct 20 23:42:36 2019 +0100

    Temporarily mark pseudonetcdf-3.1 as incompatible (#3420)

commit 0f7ab0e
Author: Dan Nowacki <[email protected]>
Date:   Thu Oct 17 14:13:44 2019 -0700

    Fix and add test for groupby_bins() isnan TypeError. (#3405)

    * Fix and add test for groupby_bins() isnan TypeError.

    * Better testing

    * black

commit 6cd50cc
Author: pmallas <[email protected]>
Date:   Thu Oct 17 10:15:47 2019 -0400

    Update where docstring to make return value type more clear (#3408)

    * Update where docstring to make return value type more clear

    The where docstring was a little confusing to me.  I misunderstood "Same type as caller' to mean the values in the xarray not the xarray itself.  I think this small change will clean it up for most users.  Thanks.

    * Update xarray/core/common.py

    Co-Authored-By: Maximilian Roos <[email protected]>

commit 55b1ac0
Author: keewis <[email protected]>
Date:   Thu Oct 17 05:13:39 2019 +0200

    tests for arrays with units (#3238)

    * create the empty test file

    * add tests for data array aggregation functions

    * include pint in the ci

    * ignore missing type annotations for pint

    * really skip the tests if pint is not available

    * remove the reason from the importorskip call

    * test that the dataarray constructor does not strip the unit

    * convert every unit stripped warning to an error

    * work around pint not implementing np.allclose yet

    * remove the now unnecessary filterwarnings decorator

    * xfail all tests that depend on pint having a __array_function__

    * treat nans as equal

    * implement tests for simple arithmetic operations

    * use param's id argument to assign readable names

    * add tests for sel() and isel()

    * add more readable names for the unary arithmetics

    * xfail every test that is not yet xfailing

    These don't pass because the constructor raises a unit stripped
    warning - fixed in pint#764.

    * only xfail if pint is not the current development version

    This is test is not really reliable, but sufficient for now.

    * always use lists instead of tuples for indexing

    * add tests for loc and squeeze

    * black

    * add names and xfail marks to the parameters

    * add tests for interp and interp_like

    * implement tests for reindex

    * remove the xfail marks where it is not clear yet that pint is to blame

    * add tests for reindex_like

    * don't pass the new DataArray to a kwarg

    * xfail if not on pint dev

    * refactor the tests

    * add tests for univariate and bivariate ufuncs

    * black

    * xfail aggregation only if pint does not implement __array_function__ yet

    * remove the global filterwarnings mark

    apparently, this caused the tests to change behavior, resulting in
    different errors, or causing tests to pass that should actually fail.

    * add a test case for the repr

    * create a pytest mark that explicitly requires pint's __array_function__

    * also check the string representation in addition to the repr

    * add helpers for creating method tests

    * check that simple aggregation methods work

    * use format() instead of format strings

    * make sure the repr of method calls is different from functions

    * merge the two aggregation tests

    * explicitly check whether pint supports __array_function__

    relying on versions is somewhat fragile.

    * provide a fallback for the new base quantity

    * check that no warning is raised for both with and without coords

    * also check that the repr works both with and without coords

    * wrap all aggregation function calls

    * xfail every call that fails because of something outside xarray

    * xfail tests related to dimension coordinates and indexes

    * use the dimensions from the original array

    * allow passing arguments to the method on call

    * add tests for comparisons

    * add tests for detecting, filling and dropping missing values

    * mark the missing value tests as requiring pint to support duck arrays

    * add tests for isin, where and interpolate_na

    * reformat unit ids and add a test parameter for compatible units

    * remove an unnecessary xfail

    * add tests for the top-level replication functions (*_like)

    * check for whatever pint does with *_like functions

    * add tests for combine_first

    * xfail the bivariate ufunc tests

    * xfail the call to np.median

    * move the top-level function tests out of the DataArray namespace class

    * add cumsum and cumprod to the list of aggregation functions

    * add tests for the numpy methods

    * check for equal units directly after checking the magnitude

    * add tests for content manipulation methods

    * add tests for comparing DataArrays (equals, indentical)

    * add a test for broadcast_equals

    * refactor the comparison operation tests

    * rewrite the strip, attach and assert_equal functions and add extract

    * preserve multiindex in strip and attach

    * attach the unit from element "data" as fallback

    * fix some small typos

    * compare QuantityScalar and QuantitySequence based on their values

    * make the isel test more robust

    * add tests for reshaping and reordering

    * unify the structure of the tests

    * mark the remaining tests as requiring a recent pint version, too

    * explicitly handle quantities as parameters

    * change the repr of the function / method wrappers

    * check whether __init__ and repr / str handle units in data and coords

    * generalize array_attach_units

    * move the redefinition of DimensionalityError

    * identify quantities using isinstance

    * typo

    * skip tests with a pint version without __array_function__

    * compare DataArrays where possible

    * mark only the compatible unit as xfailing

    * preserve the name of data arrays

    * also attach units to x_mm

    * Test in more CI environments; documentation

    * What's New

    * remove a stale function

    * use Quantity directly for instance tests

    * explicitly set roll_coords to silence a deprecation warning

    * skip the whole module if pint does not implement __array_function__

    the advantage is that now forgetting to decorate a test case is not possible.

    * allow to attach units using the mapping from extract_units

    * add tests for computation methods

    resampling fails until I figure out how to use it with non-datetime coords.

    * add tests for grouped operations

    * add a test for rolling_exp

    * add a todo note for the module level skip on __array_function__

    * add a test for dot

    * use attach_units instead of manually attaching

    * modify the resample test to actually work

    * add a test for to_unstacked_dataset

    * update whats-new.rst and installing.rst

    * reformat the whats-new.rst entry

    * What's New

commit 1f81338
Author: keewis <[email protected]>
Date:   Wed Oct 16 20:54:27 2019 +0200

    Fixes to the resample docs (#3400)

    * add a missing newline to make sphinx detect the code block

    * update the link to the pandas documentation

    * explicitly state that this only works with datetime dimensions

    * also put the datetime dim requirement into the function description

    * add Series.resample and DataFrame.resample as reference

    * add the changes to whats-new.rst

    * move references to the bottom of the docstring

commit 3f9069b
Author: Joseph Hamman <[email protected]>
Date:   Mon Oct 14 14:22:08 2019 -0700

    Revert to dev version

commit 62943e2
Author: Joseph Hamman <[email protected]>
Date:   Mon Oct 14 12:51:46 2019 -0700

    Release v0.14.0

commit 30472ec
Author: Joseph Hamman <[email protected]>
Date:   Mon Oct 14 12:24:05 2019 -0700

    updates for 0.14 release [black only]

commit 4519843
Author: Joseph Hamman <[email protected]>
Date:   Mon Oct 14 12:18:54 2019 -0700

    updates for 0.14 release

commit 4f5ca73
Author: Deepak Cherian <[email protected]>
Date:   Mon Oct 14 18:06:53 2019 +0000

    Make concat more forgiving with variables that are being merged. (#3364)

    * Make concat more forgiving with variables that are being merged.

    * rename test.

    * simplify test.

    * make diff smaller.

commit ae1d8c7
Author: Crypto Jerônimo <[email protected]>
Date:   Sun Oct 13 15:38:36 2019 +0100

    Fix documentation typos (#3396)

commit 863e490
Author: Joe Hamman <[email protected]>
Date:   Sat Oct 12 17:33:33 2019 -0400

    OrderedDict --> dict, some python3.5 cleanup too (#3389)

    * OrderedDict --> dict, some python3.5 cleanup too

    * respond to part of @shoyer's review

    * fix set attr syntax on netcdf4 vars

    * fix typing errors

    * update whats new and todo comments

    * Typing annotations

    * Typing annotations

    * Fix regression

    * More substantial changes

    * More polish

    * Typing annotations

    * Rerun notebooks

commit 6851e3e
Author: crusaderky <[email protected]>
Date:   Sat Oct 12 21:05:32 2019 +0100

    Annotate LRUCache (#3395)

commit 4c05d38
Author: Stephan Hoyer <[email protected]>
Date:   Fri Oct 11 08:47:57 2019 -0700

    BUG: overrides to a dimension coordinate do not get aligned (#3393)

    Fixes GH3377

commit 3f29551
Author: Deepak Cherian <[email protected]>
Date:   Thu Oct 10 23:44:19 2019 +0000

    map_blocks (#3276)

    * map_block attempt 2

    * Address reviews: errors, args + kwargs support.

    * Works with datasets!

    * remove wrong comment.

    * Support chunks.

    * infer template.

    * cleanup

    * cleanup2

    * api.rst

    * simple shape change error check.

    * Make test more complicated.

    * Fix for when user function doesn't set DataArray.name

    * Now _to_temp_dataset works.

    * Add whats-new

    * chunks kwarg makes no sense right now.

    * review feedback:

    1. skip index graph nodes.
    2. var → name
    3. quicker dataarray creation.
    4. Add restrictions to docstring.
    5. rename chunk construction task.
    6. error when non-xarray object is returned.
    7. restore non-coord dims.

    review

    * Support nondim coords in make_meta.

    * Add Dataset.unify_chunks

    * doc updates.

    * minor.

    * update comment.

    * More complicated test dataset. Tests fail :X

    * Don't know why compute is needed.

    * work with DataArray nondim coords.

    * fastpath unify_chunks

    * comment.

    * much improved tests.

    * Change args, kwargs syntax.

    * Add dataset, dataarray methods.

    * api.rst

    * docstrings.

    * Fix unify_chunks.

    * Move assert_chunks_equal to xarray.testing.

    * minor changes.

    * Better error handling when inferring returned object

    * wip

    * wip

    * better to_array

    * Docstrings + nicer error message.

    * remove unify_chunks in map_blocks + better tests.

    * typing for unify_chunks

    * address more review comments.

    * more unify_chunks tests.

    * Just use dask.core.utils.meta_from_array

    * get tests working. assert_equal needs a lot of fixing.

    * more unify_chunks test.

    * assert_chunks_equal fixes.

    * copy over meta_from_array.

    * minor fixes.

    * raise chunks error earlier and test for map_blocks raising chunk error

    * fix.

    * Type annotations

    * py35 compat

    * make sure unify_chunks does not compute.

    * Make tests functional by call compute before assert_equal

    * Update whats-new

    * Work with attributes.

    * Support attrs and name changes.

    * more assert_equal

    * test changing coord attribute

    * fix whats new

    * rework tests to use fixtures (kind of)

    * more review changes.

    * cleanup

    * more review feedback.

    * fix unify_chunks.

    * read dask_array_compat :)

    * Dask 1.2.0 compat.

    * documentation polish

    * make_meta reflow

    * cosmetic

    * polish

    * Fix tests

    * isort

    * isort

    * Add func call to docstrings.

commit 291cb80
Author: Deepak Cherian <[email protected]>
Date:   Thu Oct 10 18:23:20 2019 +0000

    Add groupby.dims & Fix groupby reduce for DataArray (#3338)

    * Fix groupby reduce for DataArray

    * bugfix.

    * another bugfix.

    * bugfix unique_and_monotonic for object indexes (uniqueness is enough)

    * Add groupby.dims property.

    * update reduce docstring to point to xarray.ALL_DIMS

    * test for object index dims.

    * test reduce dimensions error.

    * Add whats-new

    * fix docs build

    * sq whats-new

    * one more test.

    * fix test.

    * undo monotonic change.

    * Add dimensions to repr.

    * Raise error if no bins.

    * Raise nice error if no groups were formed.

    * Some more error raising and testing.

    * Add dataset tests.

    * update whats-new.

    * fix tests.

    * make dims a cached lazy property.

    * fix whats-new.

    * whitespace

    * fix whats-new

commit 3f0049f
Author: crusaderky <[email protected]>
Date:   Wed Oct 9 19:01:29 2019 +0100

    Speed up isel and __getitem__ (#3375)

    * Variable.isel cleanup/speedup

    * Dataset.isel code cleanup

    * Speed up isel

    * What's New

    * Better error checks

    * Speedup

    * type annotations

    * Update doc/whats-new.rst

    Co-Authored-By: Maximilian Roos <[email protected]>

    * What's New

    * What's New

    * Always shallow-copy variables

commit 132733a
Author: Deepak Cherian <[email protected]>
Date:   Tue Oct 8 22:13:47 2019 +0000

    Fix concat bug when concatenating unlabeled dimensions. (#3362)

    * Fix concat bug when concatenating unlabeled dimensions.

    * Add whats-new

    * Add back older test.

    * fix test

    * Revert "fix test"

    This reverts commit c33ca34.

    * better fix

commit 6fb272c
Author: crusaderky <[email protected]>
Date:   Tue Oct 8 22:23:46 2019 +0100

    Rolling minimum dependency versions policy (#3358)

    * - Downgrade numpy to 1.14, pandas to 0.20, scipy to 0.19 (24 months old)
    - Downgrade dask to 1.1 (6 months old)
    - Don't pin patch versions

    * Apply rolling policy (see #3222)

    * Automated tool to verify the minimum versions

    * Drop Python 3.5

    * lint

    * Trivial cosmetic

    * Cosmetic

    * (temp) debug CI failure

    * Parallelize versions check script

    * Remove hacks for legacy dask

    * Documentation

    * Assorted cleanup

    * Assorted cleanup

    * Fix regression

    * Cleanup

    * type annotations upgraded to Python 3.6

    * count_not_none backport

    * pd.Index.equals on legacy pandas returned False when comparing vs. a ndarray

    * Documentation

    * pathlib cleanup

    * Slide deprecations from 0.14 to 0.15

    * More cleanups

    * More cleanups

    * Fix min_deps_check

    * Fix min_deps_check

    * Set policy of 12 months for pandas and scipy

    * Cleanup

    * Cleanup

    * Sphinx fix

    * Overhaul readthedocs environment

    * Fix test crash

    * Fix test crash

    * Prune readthedocs environment

    * Cleanup

    * Hack around versioneer bug on readthedocs CI

    * Code review

    * Prevent random timeouts in the readthedocs CI

    * What's New polish

    * Merge from Master

    * Trivial cosmetic

    * Reimplement pandas.core.common.count_not_none

commit 3e2a754
Author: Alan D. Snow <[email protected]>
Date:   Tue Oct 8 09:36:52 2019 -0500

    added geocube and rioxarray to related projects (#3383)

commit 4254b4a
Author: crusaderky <[email protected]>
Date:   Fri Oct 4 23:17:56 2019 +0100

    Lint (#3373)

    * raise exception instance, not class

    * isort

    * isort

    * Bump mypy version

commit 283b4fe
Author: Deepak Cherian <[email protected]>
Date:   Fri Oct 4 17:04:36 2019 +0000

    Docs/more fixes (#2934)

    * Move netcdf to beginning of io.rst

    * Better indexing example.

    * Start de-emphasizing pandas

    * misc.

    * compute, load, persist docstrings + text.

    * split-apply-combine.

    * np.newaxis.

    * misc.

    * some dask stuff.

    * Little more dask.

    * undo index.rst changes.

    * link to dask docs on chunks

    * Fix io.rst.

    * small changes.

    * rollingupdate.

    * joe's review

commit dfdeef7
Author: Stephan Hoyer <[email protected]>
Date:   Thu Oct 3 21:42:50 2019 -0700

    Explicitly keep track of indexes with merging (#3234)

    * Explicitly keep track of indexes in merge.py

    * Typing fixes

    * More tying fixes

    * more typing fixes

    * fixup

commit 86fb71d
Author: Deepak Cherian <[email protected]>
Date:   Thu Oct 3 15:41:50 2019 +0000

    groupby repr (#3344)

    * groupby repr

    * add test.

    * test datetime and nondim coord

    * rename test_da → repr_da

    * Add whats-new

    * Update doc/whats-new.rst

commit dd2b803
Author: Ryan May <[email protected]>
Date:   Wed Oct 2 15:43:44 2019 -0600

    Remove setting of universal wheels (#3367)

    Universal wheels indicate that one wheel supports Python 2 and 3. This
    is no longer the case for xarray. This causes builds to generate files
    with names like xarray-0.13.0-py2.py3-none-any.whl, which can cause
    pip to incorrectly install the wheel when installing from a list of
    wheel files.

commit 21705e6
Author: crusaderky <[email protected]>
Date:   Tue Oct 1 19:13:55 2019 +0100

    Revisit # noqa annotations (#3359)

commit fb575eb
Author: crusaderky <[email protected]>
Date:   Tue Oct 1 15:11:21 2019 +0100

    Fix codecov.io upload on Windows (#3360)

commit 1ab2279
Author: Deepak Cherian <[email protected]>
Date:   Mon Sep 30 21:12:22 2019 +0000

    Add how do I ... section (#3357)

    * Add how do I ... section

    * Bbugfix.

    * Update doc/howdoi.rst

    Co-Authored-By: Maximilian Roos <[email protected]>

    * Update doc/howdoi.rst

    Co-Authored-By: Maximilian Roos <[email protected]>

    * small updates.

    * Add more.

commit bd1069b
Author: Gregory Gundersen <[email protected]>
Date:   Sun Sep 29 19:39:53 2019 -0400

    Add glossary to documentation (#3352)

    * First draft at terminology glossary.

    * Made name matching rules more explicit and hopefully clearer.

    * Amended what's new.

    * Changes based on feedback.

    * More changed based on feedback.

commit b51683f
Author: Anderson Banihirwe <[email protected]>
Date:   Sun Sep 29 07:50:21 2019 -0600

    Documentation improvements (#3328)

    * Add examples for full_like, zeros_like, ones_like

    * Add examples for xr.align

    * Add examples for xr.merge

    * Update xr.where docstring

    * Update xr.dot docstring

    * Update xarray/core/common.py

    Co-Authored-By: Deepak Cherian <[email protected]>

    * Update xarray/core/common.py

    Co-Authored-By: Deepak Cherian <[email protected]>

    * Update xr.combine_by_coords docstring

    * Apply black formatting only

    * More black formatting

    * Remove unnecessary pandas bits

    * Fix indentation issues

    * Update assign and pipe

    * Update `Dataset.reindex` with examples

    * Update `Dataset.fillna` with examples

    * Address styling issues

    * Update docstring

    Co-Authored-By: Deepak Cherian <[email protected]>

commit f3c7da6
Author: Gregory Gundersen <[email protected]>
Date:   Sat Sep 28 15:57:36 2019 -0400

    Remove `complex.nc` from built docs (#3353)

    * Rolling back to prevent a different issue from leaking into this one.

    * Amended what's new.

commit 6ece6a1
Author: Tony Tung <[email protected]>
Date:   Thu Sep 26 22:45:26 2019 -0700

    Fix DataArray.to_netcdf type annotation (#3325)

    It calls DataSet.to_netcdf, which returns Union[bytes, "Delayed", None].  So this should as well.

commit 16fdac9
Author: crusaderky <[email protected]>
Date:   Thu Sep 26 10:38:46 2019 +0100

    CI test suites with pinned minimum dependencies (#3346)

    * CI test suites with pinned minimum dependencies

    * code review

    * Clarity re lxml

commit ea101f5
Author: Tom Nicholas <[email protected]>
Date:   Thu Sep 26 10:51:58 2019 +0200

    Bugfix/plot accept coord dim (#3345)

    Bug in plot.line fixed by ensuring 1D coords are cast down to their associated dims. There was previously two particular cases where this would not happen.

commit 85c9a40
Author: crusaderky <[email protected]>
Date:   Wed Sep 25 02:40:54 2019 +0100

    CI environments overhaul (#3340)

    * Rationalize and align CI environments. Add many optional dependencies to individual CI suites.

    * pynio and cdms2 are not available on Windows

    * cfgrib causes Python interpreter crash on Windows

    * dtype of np.arange defaults to int64 on Linux and int32 on Windows

    * Suppress failure to delete file on Windows

    * Mark hypotesis tests as @slow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants