Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement .blocks accessor #3689

Merged
merged 6 commits into from
Jul 1, 2018
Merged

Conversation

mrocklin
Copy link
Member

@mrocklin mrocklin commented Jun 29, 2018

>>> import dask.array as da
>>> x = da.arange(10, chunks=2)
>>> x.blocks[0].compute()
array([0, 1])
>>> x.blocks[:3].compute()
array([0, 1, 2, 3, 4, 5])
>>> x.blocks[::2].compute()
array([0, 1, 4, 5, 8, 9])
>>> x.blocks[[-1, 0]].compute()
array([8, 9, 0, 1])

Fixes #3684
Fixes #3274

  • Tests added / passed
  • Passes flake8 dask

cc @stuartarchibald and @jakirkham and @shoyer

```python
>>> import dask.array as da
>>> x = da.arange(10, chunks=2)
>>> x.blocks[0].compute()
array([0, 1])
>>> x.blocks[:3].compute()
array([0, 1, 2, 3, 4, 5])
>>> x.blocks[::2].compute()
array([0, 1, 4, 5, 8, 9])
>>> x.blocks[[-1, 0]].compute()
array([8, 9, 0, 1])
```

Fixes dask#3684
Fixes dask#3274
@jakirkham
Copy link
Member

Thanks Matt. This seems like a good idea. Will try to review later.

Would it be possible to have iterator support? For example, am imagining a usage pattern like the following.

import dask.array as da
x = da.arange(10, chunks=2)
r = [e.sum() for e in x.blocks]

@mrocklin
Copy link
Member Author

mrocklin commented Jun 29, 2018 via email

@martindurant
Copy link
Member

@jakirkham , how would you iterate in the case that there is more than one dimension?

from .slicing import normalize_index
if not isinstance(key, tuple):
key = (key,)
key = tuple([k] if isinstance(k, Number) else k for k in key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider using slice(k, k+1) here instead of [k]. I'm not entirely sure which is better, yet, but these can differ due to weird edge cases of fancy indexing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's definitely switch to slice(k, k+1) here.

Consider the case of indexing a 3D array like array.blocks[0, :, 0] with chunks=1. Now compare what the result would look like:

>>> x = np.zeros((5, 6, 7))
>>> x[:1, :, :1].shape  # seems reasonable
(1, 6, 1)
>>> x[[0], :, [0]].shape  # where did the last dimension go?
(1, 6)

I could go on to more examples of strange behavior, but basically we want to stay as close to "basic indexing" as possible, rather than inadvertently triggering "advanced indexing".

It would be good to note this explicitly in the docs, though (that integers get converted into slices), so users know how to get vectorized indexing if/when desired by replacing integers with lists/arrays.

@jakirkham
Copy link
Member

What would this return? A list of Dask Arrays?

import dask.array as da
x = da.arange(10, chunks=2)
x.blocks[[0, 1]]

@mrocklin
Copy link
Member Author

mrocklin commented Jun 29, 2018 via email

@mrocklin
Copy link
Member Author

mrocklin commented Jun 29, 2018 via email

@mrocklin
Copy link
Member Author

mrocklin commented Jun 29, 2018 via email

@jakirkham
Copy link
Member

What sort of performance do you see for the following?

import dask.array as da

x = da.arange(10, chunks=2)
[x.blocks[i] for i in range(x.numblocks[0])]

@mrocklin
Copy link
Member Author

mrocklin commented Jun 29, 2018 via email

@mrocklin
Copy link
Member Author

mrocklin commented Jun 29, 2018 via email

@jakirkham
Copy link
Member

Honestly it's more advanced than what I initially had in mind. Though expect this to be incredibly useful.

@jakirkham
Copy link
Member

how would you iterate in the case that there is more than one dimension?

Fair point, @martindurant. Certainly tricky.

Would probably turn to NumPy for ideas. Admittedly our use cases are not exactly the same. So some options would be dropped (writing, buffering details, etc.). The rough outline seems to good initially though. My current thinking is this probably is best as a totally separate thing from blocks (maybe iter_blocks?).

@mrocklin
Copy link
Member Author

Any further comments here? If not then I plan to merge tomorrow.

index = normalize_index(index, self.numblocks)
index = tuple(slice(k, k + 1) if isinstance(k, Number) else k
for k in index)
name = 'getitem-' + tokenize(self, index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this clash with other __getitem__ calls? Should we consider a different name here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it matters much in this case, but I've renamed this to block

Numpy-style slicing but now rather than slice elements of the array you
slice along blocks so, for example, ``x.blocks[0, ::2]`` produces a new
dask array with every other block in the first row of blocks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add:

You can index blocks in any way that could index a numpy array of shape tuple(map(len, array.chunks)). Integer indices k are converted internally into a slice object k:k+1, to ensure that the array does not lose dimensions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would simplify tuple(map(len, array.chunks)) to array.numblocks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added something similar. I added a reference to array.numblocks. I spoke a bit more about how we don't change the dimension of the array rather than how we achieve that with integers to slices.

@shoyer
Copy link
Member

shoyer commented Jun 29, 2018

OK, what happens if you actually try to do general vectorized indexing that changes dimensionality? e.g.,

x = da.ones((2, 2), chunks=1)
x.blocks[[0, 1], [0, 1]]

Does this work like numpy, which would return a diagonal array of blocks? Depending on the chunks, this might not be possible to represent as a single dask array.

If not, perhaps we should explicitly exclude "vectorized indexing" use cases.

@mrocklin
Copy link
Member Author

We can probably make the blocks work this way. I'm finding that it's tricky to get the chunks right. Help would be welcome here on how to properly get new chunks from old chunks and the index. I suspect that there is some information I can place into a numpy array, apply the index, and then apply that information against the old chunks to get the new chunks.

@mrocklin
Copy link
Member Author

Ah, realized that indexing like that won't work in general. It would assume that the chunk shapes in the leading dimensions are the same along that axis, which isn't true in general. We now error if there is more than one list in the input.

@mrocklin mrocklin force-pushed the array-block-accessor branch from d3b76e7 to 408c5fc Compare June 30, 2018 11:44
assert_eq(x.blocks[0], x[:2])
assert_eq(x.blocks[-1], x[-2:])
assert_eq(x.blocks[:3], x[:6])
assert_eq(x.blocks[[0, 1, 2]], x[:6])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a test where these are not sequential or otherwise nearby?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jakirkham
Copy link
Member

LGTM other than the small suggestion above.

@shoyer
Copy link
Member

shoyer commented Jul 1, 2018

Looks good to me!

@mrocklin mrocklin merged commit 6feffaa into dask:master Jul 1, 2018
@mrocklin mrocklin deleted the array-block-accessor branch July 1, 2018 11:19
@mrocklin
Copy link
Member Author

mrocklin commented Jul 1, 2018

Thanks for the review all

convexset added a commit to convexset/dask that referenced this pull request Jul 1, 2018
….com/convexset/dask into fix-tsqr-case-chunk-with-zero-height

* 'fix-tsqr-case-chunk-with-zero-height' of https://github.com/convexset/dask:
  fixed typo in documentation and improved clarity
  Implement .blocks accessor (dask#3689)
  Fix wrong names (dask#3695)
  Adds endpoint and retstep support for linspace (dask#3675)
  Add the @ operator to the delayed objects (dask#3691)
  Align auto chunks to provided chunks, rather than shape (dask#3679)
  Adds quotes to source pip install (dask#3678)
  Prefer end-tasks with low numbers of dependencies when ordering (dask#3588)
  Reimplement argtopk to release the GIL (dask#3610)
  Note `da.pad` can be used with `map_overlap` (dask#3672)
  Allow tasks back onto ordering stack if they have one dependency (dask#3652)
  Fix extra progressbar (dask#3669)
  Break apart uneven array-of-int slicing to separate chunks (dask#3648)
  fix for `dask.array.linalg.tsqr` fails tests (intermittently) with arrays of uncertain dimensions (dask#3662)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants