-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement .blocks accessor #3689
Conversation
```python >>> import dask.array as da >>> x = da.arange(10, chunks=2) >>> x.blocks[0].compute() array([0, 1]) >>> x.blocks[:3].compute() array([0, 1, 2, 3, 4, 5]) >>> x.blocks[::2].compute() array([0, 1, 4, 5, 8, 9]) >>> x.blocks[[-1, 0]].compute() array([8, 9, 0, 1]) ``` Fixes dask#3684 Fixes dask#3274
Thanks Matt. This seems like a good idea. Will try to review later. Would it be possible to have iterator support? For example, am imagining a usage pattern like the following. import dask.array as da
x = da.arange(10, chunks=2)
r = [e.sum() for e in x.blocks] |
That's probably possible to implement. I would prefer not to implement it
in this PR though if possible.
…On Fri, Jun 29, 2018 at 1:26 PM, jakirkham ***@***.***> wrote:
Thanks Matt. This seems like a good idea. Will try to review later.
Would it be possible to have iterator support? For example, am imagining a
usage pattern like the following.
import dask.array as da
x = da.arange(10, chunks=2)
r = [e.sum() for e in x.blocks]
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3689 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszAV0Q0of-G2LSqTXmp2PFBTHm1rHks5uBmNYgaJpZM4U9P1H>
.
|
@jakirkham , how would you iterate in the case that there is more than one dimension? |
dask/array/core.py
Outdated
from .slicing import normalize_index | ||
if not isinstance(key, tuple): | ||
key = (key,) | ||
key = tuple([k] if isinstance(k, Number) else k for k in key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider using slice(k, k+1)
here instead of [k]
. I'm not entirely sure which is better, yet, but these can differ due to weird edge cases of fancy indexing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's definitely switch to slice(k, k+1)
here.
Consider the case of indexing a 3D array like array.blocks[0, :, 0]
with chunks=1
. Now compare what the result would look like:
>>> x = np.zeros((5, 6, 7))
>>> x[:1, :, :1].shape # seems reasonable
(1, 6, 1)
>>> x[[0], :, [0]].shape # where did the last dimension go?
(1, 6)
I could go on to more examples of strange behavior, but basically we want to stay as close to "basic indexing" as possible, rather than inadvertently triggering "advanced indexing".
It would be good to note this explicitly in the docs, though (that integers get converted into slices), so users know how to get vectorized indexing if/when desired by replacing integers with lists/arrays.
What would this return? A list of Dask Arrays? import dask.array as da
x = da.arange(10, chunks=2)
x.blocks[[0, 1]] |
In [1]: import dask.array as da
...: x = da.arange(10, chunks=2)
...: x.blocks[[0, 1]]
...:
...:
Out[1]: dask.array<getitem, shape=(4,), dtype=int64, chunksize=(2,)>
…On Fri, Jun 29, 2018 at 1:36 PM, jakirkham ***@***.***> wrote:
What would this return? A list of Dask Arrays?
import dask.array as da
x = da.arange(10, chunks=2)
x.blocks[[0, 1]]
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3689 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszH0iE5r8GbvvyIkcGNKEAUttUyvCks5uBmWWgaJpZM4U9P1H>
.
|
Done
…On Fri, Jun 29, 2018 at 2:21 PM, Stephan Hoyer ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In dask/array/core.py
<#3689 (comment)>:
> @@ -1383,6 +1383,44 @@ def vindex(self):
"""
return IndexCallable(self._vindex)
+ def _blocks(self, key):
+ from .slicing import normalize_index
+ if not isinstance(key, tuple):
+ key = (key,)
+ key = tuple([k] if isinstance(k, Number) else k for k in key)
OK, let's definitely switch to slice(k, k+1) here.
Consider the case of indexing a 3D array like array.blocks[0, :, 0] with
chunks=1. Now compare what the result would look like:
>>> x = np.zeros((5, 6, 7))
>>> x[:1, :, :1].shape # seems reasonable
(1, 6, 1)
>>> x[[0], :, [0]].shape # where did the last dimension go?
(1, 6)
I could go on to more examples of strange behavior, but basically we want
to stay as close to "basic indexing" as possible, rather than inadvertently
triggering "advanced indexing".
It would be good to note this explicitly in the docs, though (that
integers get converted into slices), so users know how to get vectorized
indexing if/when desired by replacing integers with lists/arrays.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3689 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszH3fwKmCRB6M8hIkBvbTXI8uMepyks5uBnA-gaJpZM4U9P1H>
.
|
Thanks for the test case by the way. The previous solution definitely
failed.
On Fri, Jun 29, 2018 at 2:47 PM, Matthew Rocklin <[email protected]>
wrote:
… Done
On Fri, Jun 29, 2018 at 2:21 PM, Stephan Hoyer ***@***.***>
wrote:
> ***@***.**** commented on this pull request.
> ------------------------------
>
> In dask/array/core.py
> <#3689 (comment)>:
>
> > @@ -1383,6 +1383,44 @@ def vindex(self):
> """
> return IndexCallable(self._vindex)
>
> + def _blocks(self, key):
> + from .slicing import normalize_index
> + if not isinstance(key, tuple):
> + key = (key,)
> + key = tuple([k] if isinstance(k, Number) else k for k in key)
>
> OK, let's definitely switch to slice(k, k+1) here.
>
> Consider the case of indexing a 3D array like array.blocks[0, :, 0] with
> chunks=1. Now compare what the result would look like:
>
> >>> x = np.zeros((5, 6, 7))
> >>> x[:1, :, :1].shape # seems reasonable
> (1, 6, 1)
> >>> x[[0], :, [0]].shape # where did the last dimension go?
> (1, 6)
>
> I could go on to more examples of strange behavior, but basically we want
> to stay as close to "basic indexing" as possible, rather than inadvertently
> triggering "advanced indexing".
>
> It would be good to note this explicitly in the docs, though (that
> integers get converted into slices), so users know how to get vectorized
> indexing if/when desired by replacing integers with lists/arrays.
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#3689 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AASszH3fwKmCRB6M8hIkBvbTXI8uMepyks5uBnA-gaJpZM4U9P1H>
> .
>
|
What sort of performance do you see for the following? import dask.array as da
x = da.arange(10, chunks=2)
[x.blocks[i] for i in range(x.numblocks[0])] |
In [1]: import dask.array as da
...:
...: x = da.arange(10, chunks=2)
...:
In [2]: %timeit [x.blocks[i] for i in range(x.numblocks[0])]
264 µs ± 3.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
…On Fri, Jun 29, 2018 at 3:05 PM, jakirkham ***@***.***> wrote:
What sort of performance do you see for the following?
import dask.array as da
x = da.arange(10, chunks=2)
[x.blocks[i] for i in range(x.numblocks[0])]
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3689 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszA5oUEfnAzlHVEoPJgpW4HZuUf_lks5uBnpggaJpZM4U9P1H>
.
|
The implementation here is decently simple
On Fri, Jun 29, 2018 at 3:06 PM, Matthew Rocklin <[email protected]>
wrote:
… In [1]: import dask.array as da
...:
...: x = da.arange(10, chunks=2)
...:
In [2]: %timeit [x.blocks[i] for i in range(x.numblocks[0])]
264 µs ± 3.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
On Fri, Jun 29, 2018 at 3:05 PM, jakirkham ***@***.***>
wrote:
> What sort of performance do you see for the following?
>
> import dask.array as da
>
> x = da.arange(10, chunks=2)
> [x.blocks[i] for i in range(x.numblocks[0])]
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#3689 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AASszA5oUEfnAzlHVEoPJgpW4HZuUf_lks5uBnpggaJpZM4U9P1H>
> .
>
|
Honestly it's more advanced than what I initially had in mind. Though expect this to be incredibly useful. |
Fair point, @martindurant. Certainly tricky. Would probably turn to NumPy for ideas. Admittedly our use cases are not exactly the same. So some options would be dropped (writing, buffering details, etc.). The rough outline seems to good initially though. My current thinking is this probably is best as a totally separate thing from |
Any further comments here? If not then I plan to merge tomorrow. |
dask/array/core.py
Outdated
index = normalize_index(index, self.numblocks) | ||
index = tuple(slice(k, k + 1) if isinstance(k, Number) else k | ||
for k in index) | ||
name = 'getitem-' + tokenize(self, index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this clash with other __getitem__
calls? Should we consider a different name here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it matters much in this case, but I've renamed this to block
Numpy-style slicing but now rather than slice elements of the array you | ||
slice along blocks so, for example, ``x.blocks[0, ::2]`` produces a new | ||
dask array with every other block in the first row of blocks. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add:
You can index blocks in any way that could index a numpy array of shape
tuple(map(len, array.chunks))
. Integer indicesk
are converted internally into a slice objectk:k+1
, to ensure that the array does not lose dimensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would simplify tuple(map(len, array.chunks))
to array.numblocks
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added something similar. I added a reference to array.numblocks. I spoke a bit more about how we don't change the dimension of the array rather than how we achieve that with integers to slices.
OK, what happens if you actually try to do general vectorized indexing that changes dimensionality? e.g.,
Does this work like numpy, which would return a diagonal array of blocks? Depending on the chunks, this might not be possible to represent as a single dask array. If not, perhaps we should explicitly exclude "vectorized indexing" use cases. |
We can probably make the blocks work this way. I'm finding that it's tricky to get the chunks right. Help would be welcome here on how to properly get new chunks from old chunks and the index. I suspect that there is some information I can place into a numpy array, apply the index, and then apply that information against the old chunks to get the new chunks. |
Ah, realized that indexing like that won't work in general. It would assume that the chunk shapes in the leading dimensions are the same along that axis, which isn't true in general. We now error if there is more than one list in the input. |
d3b76e7
to
408c5fc
Compare
assert_eq(x.blocks[0], x[:2]) | ||
assert_eq(x.blocks[-1], x[-2:]) | ||
assert_eq(x.blocks[:3], x[:6]) | ||
assert_eq(x.blocks[[0, 1, 2]], x[:6]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a test where these are not sequential or otherwise nearby?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
LGTM other than the small suggestion above. |
Looks good to me! |
Thanks for the review all |
….com/convexset/dask into fix-tsqr-case-chunk-with-zero-height * 'fix-tsqr-case-chunk-with-zero-height' of https://github.com/convexset/dask: fixed typo in documentation and improved clarity Implement .blocks accessor (dask#3689) Fix wrong names (dask#3695) Adds endpoint and retstep support for linspace (dask#3675) Add the @ operator to the delayed objects (dask#3691) Align auto chunks to provided chunks, rather than shape (dask#3679) Adds quotes to source pip install (dask#3678) Prefer end-tasks with low numbers of dependencies when ordering (dask#3588) Reimplement argtopk to release the GIL (dask#3610) Note `da.pad` can be used with `map_overlap` (dask#3672) Allow tasks back onto ordering stack if they have one dependency (dask#3652) Fix extra progressbar (dask#3669) Break apart uneven array-of-int slicing to separate chunks (dask#3648) fix for `dask.array.linalg.tsqr` fails tests (intermittently) with arrays of uncertain dimensions (dask#3662)
Fixes #3684
Fixes #3274
flake8 dask
cc @stuartarchibald and @jakirkham and @shoyer