-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] [strict mode] Allow returning lists instead of arrays for numpy batches #34734
Conversation
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
awesome, this is great! are the internal numpy arrays converted back into python lists when iterating over the outputs via |
Wow, this is genius, maybe it allows us to have the best of all worlds :D |
You still get np.arrays as output, but those are also Iterable (so if you try to use it as a list, it might be 95% the same?). Not sure if we can do better here without something like type extensions. |
One drastic possibility: Always convert np arrays of objects to lists. I.e. if map_batches returns a list, convert to np array of objects under the hood (this PR), if dataset has np arrays of objects, expose them to map_batches as a python list on the reading path. It is unconventional. We have to be careful. But it might be very ergonomic. If we do this, we would need to make it very clear in the docs, and tell people how to keep using numpy arrays of objects if they want to. If we do this, it needs careful user testing :) |
Alternatively, we document well how to use |
Yeah, I'm kind of skeptical of that because of the overhead. I think converting a dense numpy array into a python list may be zero copy if you use slice views, but if the user tries to concatenate that into a numpy array again that would almost certainly induce a copy of all the data. If we have to choose one format, the numpy one is the one that allows the best performance in all cases. |
Sounds good to me. When I was playing around with the hugging face code, the return codepath was the most annoying thing to deal with, which is 100% addressed by this PR. Great job. The |
On the efficiency, one thought though: I'm pretty sure if you convert to a list, it will just do shallow copies of the objects and numpy arrays of objects are already not-so-performant (because the values are boxed, there is no vectorization and little cache locality). Just wanted to throw this out. So I would mostly focus on user experience here and less on efficiency (if we still want to experiment with this option). |
The main argument against having special handling for np arrays of objects for me would be it will be inconsistent between numpy arrays of int / float / etc and numpy arrays of objects, so might be confusing :) |
Ah I think I get your argument better now. If we do the conversion by default / standardize too much on this, all our users' data will be boxed and nobody will use numpy arrays of primitive types. |
@@ -51,17 +51,29 @@ def validate_batch(batch: Block) -> None: | |||
) | |||
|
|||
if isinstance(batch, collections.abc.Mapping): | |||
for key, value in batch.items(): | |||
if not isinstance(value, np.ndarray): | |||
for key, value in list(batch.items()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why you need the list(
here -- shouldn't .items()
always return an object that can be iterated on with the for key, value in batch.items()
pattern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was to avoid mutating the dict we are iterating over. Though, maybe in this case it is ok, since it doesn't change the keys.
) | ||
if not isinstance(value, np.ndarray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if isinstance(value, list)
would be a little clearer here (or is there a difference?)
Yup the issue precisely is how to "unbox" the array data without too much overheads. If you default to arrays, then:
If you default to list, then:
|
Signed-off-by: Eric Liang <[email protected]>
…y batches (ray-project#34734) Allow map_batches UDFs to return {"foo": [1, 2, 3]} in addition to {"foo": np.array([1, 2, 3])} by implicitly casting lists to arrays. Signed-off-by: Jack He <[email protected]>
…y batches (ray-project#34734) Allow map_batches UDFs to return {"foo": [1, 2, 3]} in addition to {"foo": np.array([1, 2, 3])} by implicitly casting lists to arrays.
Why are these changes needed?
Allow map_batches UDFs to return
{"foo": [1, 2, 3]}
in addition to{"foo": np.array([1, 2, 3])}
by implicitly casting lists to arrays.