Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing ufunc-broadcasting across record fields #457

Closed
jpivarski opened this issue Sep 17, 2020 · 11 comments · Fixed by #510
Closed

Removing ufunc-broadcasting across record fields #457

jpivarski opened this issue Sep 17, 2020 · 11 comments · Fixed by #510
Labels
bug The problem described is something that must be fixed

Comments

@jpivarski
Copy link
Member

Currently, all ufuncs are broadcasted across all fields of a record:

>>> ak_array = ak.Array([{"x": 1, "y": 1.1}, {"x": 2, "y": 2.2}, {"x": 3, "y": 3.3}])
>>> ak_array
<Array [{x: 1, y: 1.1}, ... {x: 3, y: 3.3}] type='3 * {"x": int64, "y": float64}'>

>>> ak_array + 1
<Array [{x: 2, y: 2.1}, ... {x: 4, y: 4.3}] type='3 * {"x": int64, "y": float64}'>

This is causing some confusion because the fields of a record have qualitatively different meanings. Some are trigger booleans, some are momenta, some are ML-derived isolation variables, some are strings...

>>> ak.Array(["HAL"]) + 1                      # should this even work?
<Array [[73, 66, 77]] type='1 * var * uint8'>

>>> [chr(x) for x in (ak.Array(["HAL"]) + 1)[0].tolist()]
['I', 'B', 'M']

Furthermore, when @henryiii is writing vector, he has to distinguish between LorentzVector + LorentzVector accidentally working because they're Cartesian (but not preserving their Lorentzness) and getting the wrong answer because they're not Cartesian. Even though the + behavior is defined, due to the fact that they are records, he has to be sure to override every case.

I think there would be fewer surprises for both users and developers if broadcasting a ufunc through a record were an error (withe a nice error message). Custom behaviors for specialized records, like LorentzVectors, would still be possible to define, as they are now, but instead of replacing wrong behavior, they'd be replacing no behavior.

Note that NumPy does not define such an operation on structured arrays:

>>> np_array = np.array([(1, 1.1), (2, 2.2), (3, 3.3)], [("x", int), ("y", float)])
>>> np_array
array([(1, 1.1), (2, 2.2), (3, 3.3)], dtype=[('x', '<i8'), ('y', '<f8')])

>>> np_array + 1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: invalid type promotion

Although it does work for Pandas:

>>> df = pd.DataFrame({"x": [1, 2, 3], "y": [1.1, 2.2, 3.3]})
>>> df
   x    y
0  1  1.1
1  2  2.2
2  3  3.3
>>> df + 1
   x    y
0  2  2.1
1  3  3.2
2  4  4.3

it is not our intention to generalize from Pandas, only NumPy.

This would also affect ufuncs that return booleans, like comparison operators. For these, the argument isn't as strong. Maybe we want

>>> ak_array > 1
<Array [{x: False, y: True}, ... y: True}] type='3 * {"x": bool, "y": bool}'>

to work, maybe we don't.

I'm considering removing ufuncs-through-records for all ufuncs, without affecting the custom ufunc behavior that can be assigned to any record with a name. (I'm not considering ufuncs-on-strings right now, though that's something to think about.) Does anyone have a strong argument about that?

(I suppose this needs a deprecation cycle, though it would be a little difficult getting a warning into the middle of the broadcast-and-apply. I'm tempted to remove it all at once, like a band-aid...)

@jpivarski jpivarski added the policy Choice of behavior label Sep 17, 2020
@nsmith-
Copy link
Member

nsmith- commented Sep 18, 2020

For the record I did propose also to only disable ufuncs on arrays where layout.purelist_parameter("__record__") == None but I learned that this name is used for other purposes than just defining the desired behavioral mixin (e.g. if multiple differently-named record is encountered in ArrayBuilder, then the resulting array will be a union of each record type rather than a single record type with optional subfields).

@jpivarski
Copy link
Member Author

Also, I want to keep these rules of "what applies when" as simple as possible, to minimize surprising behavior.

@jpivarski
Copy link
Member Author

I'm not going to change the policy on this: the behavior will stay the same. (Although NumPy ufuncs won't be allowed on strings, as in the above example: #504.)

@nsmith-
Copy link
Member

nsmith- commented Oct 30, 2020

@lgray FYI
Isn't the risk of user error extremely large here?

@lgray
Copy link
Contributor

lgray commented Oct 30, 2020

So yeah, my 2 cents here is that the probability of the user doing something they don't intend is incredibly large.

Nick ran into an interesting case where NanoEvents jets (which are four vectors beneath) can be element-wise multiplied by each other. This doesn't have a defined behavior from a physics standpoint and would also be very easy to accidentally do or do intentionally without knowing what it meant. In either case it can result in technically valid but wrong behavior when writing analysis - which is a much less preferred outcome than throwing an error.

@jpivarski
Copy link
Member Author

The way it works now is more prone to error for the interpretation of records as objects. "Extremely" is subjective.

Heck, in crafting any kind of response to this, I'm only convincing myself that it needs to be done. It is the case that examples of this antifeature are actually in some of my slides, so "fixing" it will make my slides wrong.

But it is an important point. I think this issue could be implemented as an inhibitor that prevents broadcasting through any records unless a customization is defined. Note that that even means this:

>>> array = ak.Array([{"only": 1}, {"only": 2}, {"only": 3}])
>>> array + 10
<Array [{only: 11}, {only: 12}, {only: 13}] type='3 * {"only": int64}'>

If/when people complain that what used to work no longer works, I'll just have to explain that. (In the above example, they'd have to do array.only + 10.)

@jpivarski jpivarski reopened this Oct 30, 2020
@jpivarski jpivarski added bug The problem described is something that must be fixed and removed policy Choice of behavior labels Oct 30, 2020
@jpivarski
Copy link
Member Author

I'm calling it a bug because I've just defined the current behavior as wrong. (Even though I've presented it in talks.)

(A motivator for being short with these things is that the list of issues is a lot longer than I thought, and I have only until December 1 to make this awkward==1.0.0.)

@lgray
Copy link
Contributor

lgray commented Oct 31, 2020

Yeah - I agree that a good policy is to not allow broadcasting ufuncs through records unless that ufunc has been defined for that record type. This has the added benefit that people have to agree on the ufuncs/interface for whatever domain-specific set of records they want to use.

@jpivarski
Copy link
Member Author

@henryiii I'm working on this now: broadcasting will be prevented from passing through records unless an overload has been implemented. For example, array_of_records * 10 won't multiply all fields of the records by 10 unless those records have explicitly defined a behavior for np.multiply. Thus, a Lorentz vector object in cylindrical won't accidentally get its phi member multiplied by 10, but the cartesian object would have to explicitly define this.

It is a change in interface, and I've even presented examples of this interface in public, so I have to Do The Right Thing and give it a deprecation cycle. It's removing a feature, which is good because changing a feature doesn't have a way to provide both old and new behaviors for an overlapping time period. What will happen is this: if any function attempts to broadcast through a record, it will raise a warning now, saying that this will become an error in version 1.0.0 on Dec 1, 2020, unless that broadcast is an overloaded ufunc.

Then I'm going to push this change into a released version of Awkward Array right away, so that about a month of releases will get the message. I'm sure there are users who update less frequently than that, and unfortunately, they won't get the message. Post-1.0.0, I'll have to start publishing regular release schedules, so that deprecation warnings can refer to both a version number and a date.

@jpivarski
Copy link
Member Author

In Awkward Array 0.4.1, which I want to release later today, the warning that you get from

one = awkward1.Array([{"x": 1}, {"x": 2}, {"x": 3}])
two = awkward1.Array([{"x": 1.1}, {"x": 2.2}, {"x": 3.3}])
one + two

is

DeprecationWarning: In version 1.0.0 (target date: 2020-12-01), this will be an error.
(Set ak.deprecations_as_errors = True to get a stack trace now.)

ValueError: cannot broadcast: <class 'awkward1._ext.RecordArray'>, <class 'awkward1._ext.RecordArray'>

(https://github.com/scikit-hep/awkward-1.0/blob/0.4.1/src/awkward1/_util.py#L886)

But

def overload_add(left, right):
    return awkward1.Array({"x": left.x + right.x})

behavior = {}
behavior[numpy.add, "Overload", "Overload"] = overload_add

one = awkward1.Array([{"x": 1}, {"x": 2}, {"x": 3}], with_name="Overload", behavior=behavior)
two = awkward1.Array([{"x": 1.1}, {"x": 2.2}, {"x": 3.3}], with_name="Overload", behavior=behavior)

assert (one + two).tolist() == [{"x": 2.1}, {"x": 4.2}, {"x": 6.3}]

raises no warnings or errors.

@lgray
Copy link
Contributor

lgray commented Nov 3, 2020

Looks like the correct behavior to me. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The problem described is something that must be fixed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants