Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: extend to_layout options #2763

Merged
merged 17 commits into from
Oct 26, 2023
Merged

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Oct 20, 2023

At present, to_layout eagerly converts anything it can recognise into an array. Yet, there are several places where we want to be stricter about what is supported and to be able to identify scalar inputs.

This PR extends the configurability of to_layout, to permit finer control of scalar input handling. It adds new parameters to ak.to_layout:

  • allow_none — whether None values are permitted as inputs
  • use_from_iter — whether iterable values are permitted as inputs
  • primitive_policy — the policy governing handling of scalars, e.g. promotion, pass-through, error
  • string_policy — the policy governing handling of strings, e.g. promotion, pass-through, as-characters, error
    I also removed the __cast__ default that casts str to length-1 arrays of strings.

@agoose77 agoose77 temporarily deployed to docs October 20, 2023 17:11 — with GitHub Actions Inactive
@agoose77 agoose77 force-pushed the agoose77/feat-extend-to-layout-options branch from b17a58d to bba3276 Compare October 22, 2023 17:58
@agoose77 agoose77 temporarily deployed to docs October 22, 2023 18:08 — with GitHub Actions Inactive
@agoose77 agoose77 force-pushed the agoose77/feat-extend-to-layout-options branch from 2b2b5ae to 3ce35d4 Compare October 22, 2023 18:15
@codecov
Copy link

codecov bot commented Oct 22, 2023

Codecov Report

Merging #2763 (175c52f) into main (b333fe4) will decrease coverage by 0.04%.
Report is 2 commits behind head on main.
The diff coverage is 91.66%.

Additional details and impacted files
Files Coverage Δ
src/awkward/_connect/numba/arrayview.py 93.37% <100.00%> (ø)
src/awkward/_connect/numexpr.py 90.66% <ø> (ø)
src/awkward/_nplikes/typetracer.py 76.58% <100.00%> (+0.12%) ⬆️
src/awkward/_slicing.py 86.86% <100.00%> (+0.26%) ⬆️
src/awkward/contents/content.py 76.25% <100.00%> (+0.09%) ⬆️
src/awkward/highlevel.py 76.53% <ø> (ø)
src/awkward/operations/ak_all.py 96.00% <100.00%> (ø)
src/awkward/operations/ak_any.py 96.00% <100.00%> (ø)
src/awkward/operations/ak_argcartesian.py 88.57% <ø> (ø)
src/awkward/operations/ak_argcombinations.py 86.95% <ø> (ø)
... and 67 more

@agoose77 agoose77 temporarily deployed to docs October 22, 2023 18:26 — with GitHub Actions Inactive
@agoose77 agoose77 temporarily deployed to docs October 22, 2023 19:20 — with GitHub Actions Inactive
@agoose77
Copy link
Collaborator Author

Annoyingly, I've run into the problem of specifying scalar-promotion rules again. We kicked the bucket down the road somewhat, but I think it's time to robustly solve this. At this stage, I think we can just replicate NumPy's rules for Python scalars, because Python scalars are expected to be non-kernel outputs (i.e. static).

However, if we have places where this doesn't hold, we might need to think about implementing NEP-50 (https://numpy.org/neps/nep-0050-scalar-promotion.html#motivation-and-scope) which proposes to remove the value-dependence that we're worried about in Awkward

@agoose77 agoose77 force-pushed the agoose77/feat-extend-to-layout-options branch from 9fec71e to 2cf737b Compare October 23, 2023 15:03
@agoose77 agoose77 temporarily deployed to docs October 23, 2023 15:20 — with GitHub Actions Inactive
@agoose77 agoose77 requested a review from jpivarski October 23, 2023 17:29
@agoose77 agoose77 marked this pull request as draft October 23, 2023 17:33
@agoose77 agoose77 temporarily deployed to docs October 23, 2023 17:35 — with GitHub Actions Inactive
@jpivarski
Copy link
Member

I also removed the __cast__ default that casts str to length-1 arrays of strings.

This might be needed for some commonsense broadcasting cases. For instance,

some_array["new_field"] = "a single string"

may need to be presented as ak.Array(["a single string"]) so that broadcasting sees this as one atom to duplicate len(some_array) times.

@agoose77 agoose77 force-pushed the agoose77/feat-extend-to-layout-options branch from 6bdd5eb to 75c70dc Compare October 24, 2023 00:26
@agoose77 agoose77 changed the base branch from main to agoose77/refactor-nplike-ufunc-type-promotion October 24, 2023 00:26
@agoose77 agoose77 marked this pull request as ready for review October 24, 2023 00:26
@agoose77
Copy link
Collaborator Author

agoose77 commented Oct 24, 2023

@jpivarski the failing test in 3.12 on windows is due to the change in scalar promotion; Awkward promotes scalars to the ArrayBuilder default type (int64) rather than the NumPy default for the platform. I don't think that is necessarily a problem, though I wonder whether ArrayBuilder should default to the NumPy default dtype for the current platform rather than hard-coding int64. Any strong feelings here? The real question, though, is are you OK with changing this behavior such that untyped scalars go via ArrayBuilder rather than using asarray?

@agoose77 agoose77 temporarily deployed to docs October 24, 2023 00:46 — with GitHub Actions Inactive
@jpivarski
Copy link
Member

We should be interpreting Python int as int64. That's a widespread convention; only NumPy on Windows doesn't do it and I think that's a mistake.

If the scalars are really untyped (i.e. Python int or float), they can go via ArrayBuilder, but NumPy-typed scalars (i.e. np.int32 and np.float64) would lose their types if they went through ArrayBuilder. ArrayBuilder takes any integer as int64 and any floating point number as float64; it just doesn't have other input types compiled into it (which would dramatically increase its surface area, since its number of methods has to be equal to the number of classes, so handling $n$ input types means $n^2$ method-class pairs).

@agoose77
Copy link
Collaborator Author

We should be interpreting Python int as int64. That's a widespread convention; only NumPy on Windows doesn't do it and I think that's a mistake.

Fine! Although, now that's also a list that includes Emscripten whose default is 32 bit.

If the scalars are really untyped (i.e. Python int or float), they can go via ArrayBuilder, but NumPy-typed scalars (i.e. np.int32 and np.float64) would lose their types if they went through ArrayBuilder.

Yes, the np.generic scalars are handled via np.asarray :)

@jpivarski
Copy link
Member

Emscripten whose default is 32 bit

I ran into JavaScript being limited to 32-bit in the Kaitai web tool. That was quite a surprise.

Okay, there are two exceptions: NumPy in Windows and JavaScript/Emscripten.

Base automatically changed from agoose77/refactor-nplike-ufunc-type-promotion to main October 24, 2023 17:00
@agoose77 agoose77 force-pushed the agoose77/feat-extend-to-layout-options branch from 75c70dc to 5b8ed8c Compare October 24, 2023 17:16
@agoose77
Copy link
Collaborator Author

I just changed strings_as_characters to a new literal string_policy, and renamed scalar_policy to primitive_policy.

This allows finer control of strings, and brings some symmetry to the API

@agoose77 agoose77 temporarily deployed to docs October 24, 2023 17:30 — with GitHub Actions Inactive
@agoose77 agoose77 force-pushed the agoose77/feat-extend-to-layout-options branch from f74cb28 to 271e679 Compare October 25, 2023 13:05
@agoose77 agoose77 temporarily deployed to docs October 25, 2023 13:19 — with GitHub Actions Inactive
@agoose77 agoose77 temporarily deployed to docs October 25, 2023 14:26 — with GitHub Actions Inactive
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More granularity in the options is good.

Although a Python string is now by default turned into an array of characters, which is not good for broadcasting, the functions that use it set the argument to turn the Python string into a length-1 array of strings, so that nothing breaks.

We talked about how Python numbers are not always promoted to int64 and float64, but are promoted to whatever they need to be to not change the array they're combined with, which is what NumPy does, and since NumPy does it, we'll do it, too.

This allow_other has changed its meaning, so it should also change its name: it will become allow_unknown (Angus and I are on Zoom now). Then, if anybody is using allow_other, they'll be alerted to the change by a TypeError. When we scanned for usage in the Awkward-dependent repos, none of them are using ak.to_layout, so we don't need a deprecation cycle.

So one change is needed before merging: rename allow_otherallow_unknown. Other than that, it's done.

@agoose77 agoose77 enabled auto-merge (squash) October 25, 2023 15:36
@agoose77 agoose77 disabled auto-merge October 25, 2023 17:15
@agoose77 agoose77 temporarily deployed to docs October 25, 2023 17:22 — with GitHub Actions Inactive
@agoose77 agoose77 enabled auto-merge (squash) October 26, 2023 09:40
@agoose77 agoose77 merged commit 26cfc12 into main Oct 26, 2023
36 checks passed
@agoose77 agoose77 deleted the agoose77/feat-extend-to-layout-options branch October 26, 2023 09:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants