Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use None for unknown lengths (1 of 2) #2168

Merged
merged 21 commits into from
Jan 30, 2023

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Jan 27, 2023

TLDR

  • This PR makes None the token for unknown lengths. Only TypeTracer actually produces a shape with Nones.
  • Remove UnknownLength = UnknownScalar(np.int64) alias
  • Breaking change: using length=None for unknowns means that RecordArray(..., length=None) needs to change behavior.1
  • Extend nplike to understand how to manipulate known and unknown lengths
  • NumpyLike.known_shape now has concrete meaning: "does this NumpyLike support unknown dimension sizes?"

DR

This adds some verbosity to our length handling code in the Content methods, but this comes at a benefit in terms of explicitness. I chose to make shape-item-handling an nplike detail, rather than a a cross-nplike standard. This means that nplike X should not work with shapes from nplike Y. Instead, the nplike conversion routines should handle this.

The previous approach set UnknownLength = UnknownScalar(np.int64). Benefits include

  • no need to coerce lengths to valid/invalid values (unknown scalars behave like integers)
  • single family of types (external unknown values are the same types as those returned by internal kernel operations and Content.length

The downsides of this are:

  • form serialisation needs conversion to/from unknown scalars
  • differs from Array API choice of None for unknown dimensions
  • couples unknown shape-ness to unknown data-ness2

This PR uses UnknownLength = None, making unknown sizes more obvious, and further increasing the "safety" aspect by using a non-numeric type. None is easily serialised, and can be included in forms without serialisation.

The use of None means that external data needs to be treated carefully if it is used to compute lengths.

Footnotes

  1. This PR requires that length is unset in order to ignore it. We could use if length is None and backend.nplike.known_shape to warn users of this existing usage (as you can't have None lengths for known-shape nplikes)

  2. I can't actually think of a need to create a new nplike that has one but not the other, but I digress.

@agoose77 agoose77 marked this pull request as draft January 27, 2023 17:27
@agoose77 agoose77 temporarily deployed to docs-preview January 27, 2023 17:43 — with GitHub Actions Inactive
@agoose77 agoose77 marked this pull request as ready for review January 27, 2023 22:01
Comment on lines +692 to +699
length = unset
for x in inputs:
if isinstance(x, Content):
if length is None:
if length is unset:
length = x.length
elif backend.nplike.known_shape:
assert length == x.length
assert length is not None
assert length is not unset
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm uneasy about how we handle unknown lengths here. There's a lot to consider in broadcasting, and I've not spent a lot of time in my mind on it. I'm hopeful that this PR mostly just renames UnknownLength, and therefore we can review this logic down the road.

@agoose77 agoose77 requested a review from jpivarski January 27, 2023 22:21
@agoose77 agoose77 temporarily deployed to docs-preview January 27, 2023 22:31 — with GitHub Actions Inactive
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad that this is relative to the state in which UnknownLength still exists, so that it can be a renaming of that token, rather than UnknownLengthUnknownScalarNone.

So far it looks good, and we discussed it on Slack. I see there are a few test errors still, but you have my approval.

It's good to see more typing, too.

@agoose77
Copy link
Collaborator Author

@jpivarski note that this PR doesn't yet address usage of size=None of RegularArray; it just sets the ground work for it.

@codecov
Copy link

codecov bot commented Jan 27, 2023

Codecov Report

Merging #2168 (f470e61) into main (3bab57d) will decrease coverage by 0.02%.
The diff coverage is 86.94%.

❗ Current head f470e61 differs from pull request most recent head 0682a21. Consider uploading reports for the commit 0682a21 to get more accurate results

Additional details and impacted files
Impacted Files Coverage Δ
src/awkward/_slicing.py 85.20% <25.00%> (-0.07%) ⬇️
src/awkward/contents/emptyarray.py 72.82% <50.00%> (-0.26%) ⬇️
src/awkward/contents/indexedarray.py 77.69% <50.00%> (-0.14%) ⬇️
src/awkward/contents/unmaskedarray.py 74.27% <50.00%> (-0.21%) ⬇️
src/awkward/_nplikes/typetracer.py 71.04% <74.60%> (-0.08%) ⬇️
src/awkward/_nplikes/numpylike.py 74.81% <75.00%> (-0.38%) ⬇️
src/awkward/contents/recordarray.py 84.50% <82.22%> (+0.10%) ⬆️
src/awkward/contents/bitmaskedarray.py 68.22% <83.33%> (-0.13%) ⬇️
src/awkward/_nplikes/array_module.py 89.83% <84.21%> (-2.17%) ⬇️
src/awkward/contents/numpyarray.py 91.31% <85.71%> (-0.15%) ⬇️
... and 18 more

@agoose77 agoose77 force-pushed the agoose77/feat-shape-item-none branch from fbe7108 to 7fa2d99 Compare January 27, 2023 22:43
@agoose77 agoose77 temporarily deployed to docs-preview January 27, 2023 22:50 — with GitHub Actions Inactive
@agoose77 agoose77 closed this Jan 27, 2023
@agoose77 agoose77 reopened this Jan 27, 2023
Comment on lines 355 to 373
_numnull = ak.index.Index64.empty(1, nplike=self._backend.index_nplike)

assert (
numnull.nplike is self._backend.index_nplike
_numnull.nplike is self._backend.index_nplike
and self._mask.nplike is self._backend.index_nplike
)
self._handle_error(
self._backend[
"awkward_ByteMaskedArray_numnull",
numnull.dtype.type,
_numnull.dtype.type,
self._mask.dtype.type,
](
numnull.data,
_numnull.data,
self._mask.data,
self._mask.length,
self._valid_when,
)
)
numnull = self._backend.index_nplike.as_shape_item(_numnull[0])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new "pattern". Any kernels that are used to determine a scalar (invariably a length) are prefixed with _, before conversion to a shape item with as_shape_item. This ensures we have only integers or None.

@agoose77 agoose77 temporarily deployed to docs-preview January 27, 2023 23:10 — with GitHub Actions Inactive
@agoose77 agoose77 temporarily deployed to docs-preview January 27, 2023 23:44 — with GitHub Actions Inactive
@agoose77 agoose77 temporarily deployed to docs-preview January 28, 2023 00:52 — with GitHub Actions Inactive
@agoose77
Copy link
Collaborator Author

agoose77 commented Jan 28, 2023

With this PR, the distinction between indices (data) and lengths (metadata) is much stronger. Aligning closer with @jpivarski's original separation of length metadata from "data".

They are now explicit groups, (ShapeItem and ArrayLike-scalars) with one-to-many mapping between the shape information and the offsets (shape items are non-negative, or None, with no fixed dtype)

Merging this PR will raise questions about _getitem. At present we implicitly assume _getitem only handles Python scalars or NumPy generic values. We need to broaden this both for all of the other nplikes, that use 0-D arrays as their scalar type.

@agoose77
Copy link
Collaborator Author

agoose77 commented Jan 28, 2023

@jpivarski what do you think about the future-breaking change to RecordArray? It only breaks existing users who pass in length=None under the typetracer backend; for nplike.known_shape=True we currently convert length=None to length=unset, because the length=None has no meaning for known_shape=True.

  • Update deprecation schedule.

@agoose77 agoose77 temporarily deployed to docs-preview January 28, 2023 12:19 — with GitHub Actions Inactive
@agoose77 agoose77 changed the title feat: use None for unknown lengths feat: use None for unknown lengths (1 of 2) Jan 28, 2023
@jpivarski
Copy link
Member

We have a policy of not making backward-breaking changes without a deprecation period because it disrupts users and makes a tangle of which versions of interrelated libraries can work together. It is possible to be 100% strict about it, since we have defined what "public" means in a mechanically verifiable way (the underscores), but I'd rather be pragmatic.

This change would be hard to put through a deprecation cycle (waiting two minor versions before this PR can be merged, and all of the other work that depends on it), and the case that it affects is obscure: developers using typetracers (that's probably only Coffea and dask-awkward; it's a new interface) who explicitly pass length=None to the RecordArray constructor, rather than letting it be the default. If somebody's actually interested in a RecordArray's length, it's because it might have zero fields (an unusual situation, but it's common to want to be defensive about such a possibility), and then one would want to pass an integer length to not let it be derived from the possibly-nonexistent contents.

Even if we don't actually see a path to breakage through an interface change, we should still be careful because it could break in a way that we don't foresee. But this case is so obscure that it's a risk I think we can take.

Here's a way that an unforseen issue could come in from something like this: if dask-awkward were mirroring the RecordArray interface with a length=None default, and they don't update to a length=unset default along with Awkward. Then any code that uses their RecordArray without setting length would be saying that the length is unknown, rather than deriving it from contents. But dask-awkward doesn't wrap layouts, and it's hard to imagine anything that does in this way (and is also sensitive to typetracer).

We should call this kind of situation "reasonable risk" and not let it slide to such an extent that we're doing it all the time. In other words, we can only cheat if we think we can get away with it. :)

@agoose77 agoose77 force-pushed the agoose77/feat-shape-item-none branch 2 times, most recently from 7bb2c00 to f470e61 Compare January 30, 2023 19:08
@agoose77 agoose77 temporarily deployed to docs-preview January 30, 2023 19:17 — with GitHub Actions Inactive
@agoose77 agoose77 force-pushed the agoose77/feat-shape-item-none branch from f470e61 to 02f62d3 Compare January 30, 2023 19:23
@pre-commit-ci pre-commit-ci bot temporarily deployed to docs-preview January 30, 2023 19:32 Inactive
@agoose77 agoose77 enabled auto-merge (squash) January 30, 2023 20:22
@agoose77 agoose77 temporarily deployed to docs-preview January 30, 2023 20:45 — with GitHub Actions Inactive
@agoose77 agoose77 disabled auto-merge January 30, 2023 20:50
@agoose77 agoose77 merged commit 71875df into main Jan 30, 2023
@agoose77 agoose77 deleted the agoose77/feat-shape-item-none branch January 30, 2023 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants