Implement .blocks accessor #3689

mrocklin · 2018-06-29T15:51:20Z

>>> import dask.array as da
>>> x = da.arange(10, chunks=2)
>>> x.blocks[0].compute()
array([0, 1])
>>> x.blocks[:3].compute()
array([0, 1, 2, 3, 4, 5])
>>> x.blocks[::2].compute()
array([0, 1, 4, 5, 8, 9])
>>> x.blocks[[-1, 0]].compute()
array([8, 9, 0, 1])

Fixes #3684
Fixes #3274

Tests added / passed
Passes flake8 dask

cc @stuartarchibald and @jakirkham and @shoyer

```python >>> import dask.array as da >>> x = da.arange(10, chunks=2) >>> x.blocks[0].compute() array([0, 1]) >>> x.blocks[:3].compute() array([0, 1, 2, 3, 4, 5]) >>> x.blocks[::2].compute() array([0, 1, 4, 5, 8, 9]) >>> x.blocks[[-1, 0]].compute() array([8, 9, 0, 1]) ``` Fixes dask#3684 Fixes dask#3274

jakirkham · 2018-06-29T17:26:47Z

Thanks Matt. This seems like a good idea. Will try to review later.

Would it be possible to have iterator support? For example, am imagining a usage pattern like the following.

import dask.array as da
x = da.arange(10, chunks=2)
r = [e.sum() for e in x.blocks]

mrocklin · 2018-06-29T17:28:52Z

That's probably possible to implement. I would prefer not to implement it in this PR though if possible.

…

On Fri, Jun 29, 2018 at 1:26 PM, jakirkham ***@***.***> wrote: Thanks Matt. This seems like a good idea. Will try to review later. Would it be possible to have iterator support? For example, am imagining a usage pattern like the following. import dask.array as da x = da.arange(10, chunks=2) r = [e.sum() for e in x.blocks] — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3689 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszAV0Q0of-G2LSqTXmp2PFBTHm1rHks5uBmNYgaJpZM4U9P1H> .

martindurant · 2018-06-29T17:28:56Z

@jakirkham , how would you iterate in the case that there is more than one dimension?

shoyer · 2018-06-29T17:29:30Z

dask/array/core.py

+        from .slicing import normalize_index
+        if not isinstance(key, tuple):
+            key = (key,)
+        key = tuple([k] if isinstance(k, Number) else k for k in key)


We should consider using slice(k, k+1) here instead of [k]. I'm not entirely sure which is better, yet, but these can differ due to weird edge cases of fancy indexing.

OK, let's definitely switch to slice(k, k+1) here.

Consider the case of indexing a 3D array like array.blocks[0, :, 0] with chunks=1. Now compare what the result would look like:

>>> x = np.zeros((5, 6, 7)) >>> x[:1, :, :1].shape # seems reasonable (1, 6, 1) >>> x[[0], :, [0]].shape # where did the last dimension go? (1, 6)

I could go on to more examples of strange behavior, but basically we want to stay as close to "basic indexing" as possible, rather than inadvertently triggering "advanced indexing".

It would be good to note this explicitly in the docs, though (that integers get converted into slices), so users know how to get vectorized indexing if/when desired by replacing integers with lists/arrays.

jakirkham · 2018-06-29T17:36:21Z

What would this return? A list of Dask Arrays?

import dask.array as da
x = da.arange(10, chunks=2)
x.blocks[[0, 1]]

mrocklin · 2018-06-29T17:38:27Z

In [1]: import dask.array as da ...: x = da.arange(10, chunks=2) ...: x.blocks[[0, 1]] ...: ...: Out[1]: dask.array<getitem, shape=(4,), dtype=int64, chunksize=(2,)>

…

On Fri, Jun 29, 2018 at 1:36 PM, jakirkham ***@***.***> wrote: What would this return? A list of Dask Arrays? import dask.array as da x = da.arange(10, chunks=2) x.blocks[[0, 1]] — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3689 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszH0iE5r8GbvvyIkcGNKEAUttUyvCks5uBmWWgaJpZM4U9P1H> .

mrocklin · 2018-06-29T18:47:59Z

Done

…

On Fri, Jun 29, 2018 at 2:21 PM, Stephan Hoyer ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In dask/array/core.py <#3689 (comment)>: > @@ -1383,6 +1383,44 @@ def vindex(self): """ return IndexCallable(self._vindex) + def _blocks(self, key): + from .slicing import normalize_index + if not isinstance(key, tuple): + key = (key,) + key = tuple([k] if isinstance(k, Number) else k for k in key) OK, let's definitely switch to slice(k, k+1) here. Consider the case of indexing a 3D array like array.blocks[0, :, 0] with chunks=1. Now compare what the result would look like: >>> x = np.zeros((5, 6, 7)) >>> x[:1, :, :1].shape # seems reasonable (1, 6, 1) >>> x[[0], :, [0]].shape # where did the last dimension go? (1, 6) I could go on to more examples of strange behavior, but basically we want to stay as close to "basic indexing" as possible, rather than inadvertently triggering "advanced indexing". It would be good to note this explicitly in the docs, though (that integers get converted into slices), so users know how to get vectorized indexing if/when desired by replacing integers with lists/arrays. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3689 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszH3fwKmCRB6M8hIkBvbTXI8uMepyks5uBnA-gaJpZM4U9P1H> .

mrocklin · 2018-06-29T18:48:24Z

Thanks for the test case by the way. The previous solution definitely failed. On Fri, Jun 29, 2018 at 2:47 PM, Matthew Rocklin <[email protected]> wrote:

…

Done On Fri, Jun 29, 2018 at 2:21 PM, Stephan Hoyer ***@***.***> wrote: > ***@***.**** commented on this pull request. > ------------------------------ > > In dask/array/core.py > <#3689 (comment)>: > > > @@ -1383,6 +1383,44 @@ def vindex(self): > """ > return IndexCallable(self._vindex) > > + def _blocks(self, key): > + from .slicing import normalize_index > + if not isinstance(key, tuple): > + key = (key,) > + key = tuple([k] if isinstance(k, Number) else k for k in key) > > OK, let's definitely switch to slice(k, k+1) here. > > Consider the case of indexing a 3D array like array.blocks[0, :, 0] with > chunks=1. Now compare what the result would look like: > > >>> x = np.zeros((5, 6, 7)) > >>> x[:1, :, :1].shape # seems reasonable > (1, 6, 1) > >>> x[[0], :, [0]].shape # where did the last dimension go? > (1, 6) > > I could go on to more examples of strange behavior, but basically we want > to stay as close to "basic indexing" as possible, rather than inadvertently > triggering "advanced indexing". > > It would be good to note this explicitly in the docs, though (that > integers get converted into slices), so users know how to get vectorized > indexing if/when desired by replacing integers with lists/arrays. > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#3689 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AASszH3fwKmCRB6M8hIkBvbTXI8uMepyks5uBnA-gaJpZM4U9P1H> > . >

jakirkham · 2018-06-29T19:04:37Z

What sort of performance do you see for the following?

import dask.array as da

x = da.arange(10, chunks=2)
[x.blocks[i] for i in range(x.numblocks[0])]

mrocklin · 2018-06-29T19:06:20Z

In [1]: import dask.array as da ...: ...: x = da.arange(10, chunks=2) ...: In [2]: %timeit [x.blocks[i] for i in range(x.numblocks[0])] 264 µs ± 3.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

…

On Fri, Jun 29, 2018 at 3:05 PM, jakirkham ***@***.***> wrote: What sort of performance do you see for the following? import dask.array as da x = da.arange(10, chunks=2) [x.blocks[i] for i in range(x.numblocks[0])] — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3689 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszA5oUEfnAzlHVEoPJgpW4HZuUf_lks5uBnpggaJpZM4U9P1H> .

mrocklin · 2018-06-29T19:06:29Z

The implementation here is decently simple On Fri, Jun 29, 2018 at 3:06 PM, Matthew Rocklin <[email protected]> wrote:

…

In [1]: import dask.array as da ...: ...: x = da.arange(10, chunks=2) ...: In [2]: %timeit [x.blocks[i] for i in range(x.numblocks[0])] 264 µs ± 3.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) On Fri, Jun 29, 2018 at 3:05 PM, jakirkham ***@***.***> wrote: > What sort of performance do you see for the following? > > import dask.array as da > > x = da.arange(10, chunks=2) > [x.blocks[i] for i in range(x.numblocks[0])] > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#3689 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AASszA5oUEfnAzlHVEoPJgpW4HZuUf_lks5uBnpggaJpZM4U9P1H> > . >

jakirkham · 2018-06-29T19:14:23Z

Honestly it's more advanced than what I initially had in mind. Though expect this to be incredibly useful.

jakirkham · 2018-06-29T19:16:57Z

how would you iterate in the case that there is more than one dimension?

Fair point, @martindurant. Certainly tricky.

Would probably turn to NumPy for ideas. Admittedly our use cases are not exactly the same. So some options would be dropped (writing, buffering details, etc.). The rough outline seems to good initially though. My current thinking is this probably is best as a totally separate thing from blocks (maybe iter_blocks?).

mrocklin · 2018-06-29T20:03:21Z

Any further comments here? If not then I plan to merge tomorrow.

jakirkham · 2018-06-29T20:11:03Z

dask/array/core.py

+        index = normalize_index(index, self.numblocks)
+        index = tuple(slice(k, k + 1) if isinstance(k, Number) else k
+                      for k in index)
+        name = 'getitem-' + tokenize(self, index)


Would this clash with other __getitem__ calls? Should we consider a different name here?

I don't think it matters much in this case, but I've renamed this to block

shoyer · 2018-06-29T20:21:09Z

dask/array/core.py

+        Numpy-style slicing but now rather than slice elements of the array you
+        slice along blocks so, for example, ``x.blocks[0, ::2]`` produces a new
+        dask array with every other block in the first row of blocks.
+


Maybe add:

You can index blocks in any way that could index a numpy array of shape tuple(map(len, array.chunks)). Integer indices k are converted internally into a slice object k:k+1, to ensure that the array does not lose dimensions.

Would simplify tuple(map(len, array.chunks)) to array.numblocks.

Added something similar. I added a reference to array.numblocks. I spoke a bit more about how we don't change the dimension of the array rather than how we achieve that with integers to slices.

shoyer · 2018-06-29T20:24:27Z

OK, what happens if you actually try to do general vectorized indexing that changes dimensionality? e.g.,

x = da.ones((2, 2), chunks=1)
x.blocks[[0, 1], [0, 1]]

Does this work like numpy, which would return a diagonal array of blocks? Depending on the chunks, this might not be possible to represent as a single dask array.

If not, perhaps we should explicitly exclude "vectorized indexing" use cases.

mrocklin · 2018-06-30T01:01:12Z

We can probably make the blocks work this way. I'm finding that it's tricky to get the chunks right. Help would be welcome here on how to properly get new chunks from old chunks and the index. I suspect that there is some information I can place into a numpy array, apply the index, and then apply that information against the old chunks to get the new chunks.

mrocklin · 2018-06-30T11:43:13Z

Ah, realized that indexing like that won't work in general. It would assume that the chunk shapes in the leading dimensions are the same along that axis, which isn't true in general. We now error if there is more than one list in the input.

jakirkham · 2018-06-30T16:41:24Z

dask/array/tests/test_array_core.py

+    assert_eq(x.blocks[0], x[:2])
+    assert_eq(x.blocks[-1], x[-2:])
+    assert_eq(x.blocks[:3], x[:6])
+    assert_eq(x.blocks[[0, 1, 2]], x[:6])


Can we have a test where these are not sequential or otherwise nearby?

jakirkham · 2018-06-30T21:06:37Z

LGTM other than the small suggestion above.

shoyer · 2018-07-01T01:50:32Z

Looks good to me!

mrocklin · 2018-07-01T11:19:25Z

Thanks for the review all

….com/convexset/dask into fix-tsqr-case-chunk-with-zero-height * 'fix-tsqr-case-chunk-with-zero-height' of https://github.com/convexset/dask: fixed typo in documentation and improved clarity Implement .blocks accessor (dask#3689) Fix wrong names (dask#3695) Adds endpoint and retstep support for linspace (dask#3675) Add the @ operator to the delayed objects (dask#3691) Align auto chunks to provided chunks, rather than shape (dask#3679) Adds quotes to source pip install (dask#3678) Prefer end-tasks with low numbers of dependencies when ordering (dask#3588) Reimplement argtopk to release the GIL (dask#3610) Note `da.pad` can be used with `map_overlap` (dask#3672) Allow tasks back onto ordering stack if they have one dependency (dask#3652) Fix extra progressbar (dask#3669) Break apart uneven array-of-int slicing to separate chunks (dask#3648) fix for `dask.array.linalg.tsqr` fails tests (intermittently) with arrays of uncertain dimensions (dask#3662)

mrocklin mentioned this pull request Jun 29, 2018

API for selecting arrays by block #3684

Closed

shoyer reviewed Jun 29, 2018

View reviewed changes

avoid fancy indexing issues with lists

3752090

[skip ci] expand docstring

387b6d5

jakirkham reviewed Jun 29, 2018

View reviewed changes

shoyer reviewed Jun 29, 2018

View reviewed changes

raise ValueError if slicing with more than one list

408c5fc

mrocklin force-pushed the array-block-accessor branch from d3b76e7 to 408c5fc Compare June 30, 2018 11:44

jakirkham reviewed Jun 30, 2018

View reviewed changes

mrocklin added 2 commits June 30, 2018 17:18

add extra test

09aa337

add to docstring and tests

9e9d276

mrocklin merged commit 6feffaa into dask:master Jul 1, 2018

mrocklin deleted the array-block-accessor branch July 1, 2018 11:19

This was referenced Jul 18, 2019

Implementing map_overlap pydata/xarray#3147

Open

Add .blocks to the docs #5116

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement .blocks accessor #3689

Implement .blocks accessor #3689

mrocklin commented Jun 29, 2018 •

edited

Loading

jakirkham commented Jun 29, 2018

mrocklin commented Jun 29, 2018 via email

martindurant commented Jun 29, 2018

shoyer Jun 29, 2018

shoyer Jun 29, 2018

jakirkham commented Jun 29, 2018

mrocklin commented Jun 29, 2018 via email

mrocklin commented Jun 29, 2018 via email

mrocklin commented Jun 29, 2018 via email

jakirkham commented Jun 29, 2018

mrocklin commented Jun 29, 2018 via email

mrocklin commented Jun 29, 2018 via email

jakirkham commented Jun 29, 2018

jakirkham commented Jun 29, 2018

mrocklin commented Jun 29, 2018

jakirkham Jun 29, 2018

mrocklin Jun 30, 2018

shoyer Jun 29, 2018

jakirkham Jun 30, 2018

mrocklin Jun 30, 2018

shoyer commented Jun 29, 2018

mrocklin commented Jun 30, 2018

mrocklin commented Jun 30, 2018

jakirkham Jun 30, 2018

mrocklin Jun 30, 2018

jakirkham commented Jun 30, 2018

shoyer commented Jul 1, 2018

mrocklin commented Jul 1, 2018

Implement .blocks accessor #3689

Implement .blocks accessor #3689

Conversation

mrocklin commented Jun 29, 2018 • edited Loading

jakirkham commented Jun 29, 2018

mrocklin commented Jun 29, 2018 via email

martindurant commented Jun 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakirkham commented Jun 29, 2018

mrocklin commented Jun 29, 2018 via email

mrocklin commented Jun 29, 2018 via email

mrocklin commented Jun 29, 2018 via email

jakirkham commented Jun 29, 2018

mrocklin commented Jun 29, 2018 via email

mrocklin commented Jun 29, 2018 via email

jakirkham commented Jun 29, 2018

jakirkham commented Jun 29, 2018

mrocklin commented Jun 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Jun 29, 2018

mrocklin commented Jun 30, 2018

mrocklin commented Jun 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakirkham commented Jun 30, 2018

shoyer commented Jul 1, 2018

mrocklin commented Jul 1, 2018

mrocklin commented Jun 29, 2018 •

edited

Loading