Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which ak.Array constructor signatures should be allowed? e.g. [np.array([1, 2, 3]), np.array([4, 5, 6, 7])] #146

Closed
tamasgal opened this issue Mar 5, 2020 · 10 comments
Labels
good first issue Good for newcomers

Comments

@tamasgal
Copy link

tamasgal commented Mar 5, 2020

Maybe it's too early for such bug reports and I am also not using the latest master but 0.1.137 from PyPI but I started migrating and discovered that nested Numpy arrays are causing issues, as seen here:

In [30]: ak.Array([[1,2,3], [4,5,6]])
Out[30]: <Array [[1, 2, 3], [4, 5, 6]] type='2 * var * int64'>

In [31]: ak.Array([np.array([1,2,3]), np.array([4,5,6])])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-31-27b5fb98a962> in <module>
----> 1 ak.Array([np.array([1,2,3]), np.array([4,5,6])])

~/Dev/km3io/venv/lib/python3.7/site-packages/awkward1/highlevel.py in __init__(self, data, behavior)
     30             layout = awkward1.operations.convert.fromjson(data).layout
     31         else:
---> 32             layout = awkward1.operations.convert.fromiter(data).layout
     33         if not isinstance(layout, awkward1.layout.Content):
     34             raise TypeError("could not convert data into an awkward1.Array")

~/Dev/km3io/venv/lib/python3.7/site-packages/awkward1/operations/convert.py in fromiter(iterable, highlevel, behavior, initial, resize)
     53     out = awkward1.layout.ArrayBuilder(initial=initial, resize=resize)
     54     for x in iterable:
---> 55         out.fromiter(x)
     56     layout = out.snapshot()
     57     if highlevel:

ValueError: cannot convert 1 to an array element

In [32]: ak.Array([np.array([1,2,3]), np.array([4,5,6])][0])
Out[32]: <Array [1, 2, 3] type='3 * int64'>

In [33]: ak.Array([np.array([1,2,3]), np.array([4,5,6])][1])
Out[33]: <Array [4, 5, 6] type='3 * int64'>

In [34]: ak.__version__
Out[34]: '0.1.137'
@jpivarski
Copy link
Member

jpivarski commented Mar 5, 2020

The latest release and master are not far apart—usually there's no difference between them.

About this error, you gave it a Python list of two NumPy arrays. That pattern isn't convertible to an array, but I suppose it could be. Do you want it to mean a regular array with shape (2, 3) or a jagged array of length 2 that happens to have 3 elements in each subarray? It maybe you intended to write

ak.Array([[1, 2, 3], [4, 5, 6]])

(which gives you a jagged array) or

ak.Array(np.array([[1, 2, 3], [4, 5, 6]]))

(which gives you a regular array).

It's not too early to submit bug reports like this. This constructor goes through

awkward1.operations.convert.fromiter

Check that and see if it's what you want. You might want to add another case, it a better error message.

@tamasgal
Copy link
Author

tamasgal commented Mar 5, 2020

Thanks for the quick answer, as always!

My MWE was a bit too minimal ;) I am dealing with differing array lengths and even nested structures, so wrapping it into a numpy.array doesn't work as it yields dtype=object:

[ins] In [20]: a = ak.Array(np.array([[1,2,3], [3,4,5,6]]))                

[ins] In [21]: a                                                           
Out[21]: ---------------------------------------------------------------------------
...
...
ValueError: Numpy format "O" cannot be expressed as a PrimitiveType

But coming back to this numpy vs list interpretation, I think intuitively I'd expect that something like [np.array(...), np.array(...)] would help even more since it carries type information, so I'd expect to behave exactly like [list(), list()] with less type inference logic behind the scenes.

@jpivarski
Copy link
Member

Now that you've put different lengths into each subarray, it can't be a "normal" NumPy array and NumPy represents it with dtype "O" (an array of pointers to Python objects; useless for accelerating numerical code). So

ak.Array(np.array([[1,2,3], [3,4,5,6]]))

can't be an allowed procedure—NumPy has already broken the data before we get to it. On the other hand, maybe what we should do with NumPy "O" arrays is just treat them like Python lists and iterate over them? It's all a question of what you want the interface to do.

You're right that

ak.Array(np.array([[1, 2, 3], [4, 5, 6]])     # internally calls fromnumpy

is more lightweight than

ak.Array([[1, 2, 3], [4, 5, 6]])              # internally calls fromiter

The first casts the NumPy array (without even copying it, using fromnumpy) and the second iterates and performs type-inference on the fly (using fromiter). However, they result in different types:

>>> ak.typeof(ak.Array(np.array([[1, 2, 3], [4, 5, 6]])))
2 * 3 * int64
>>> ak.typeof(ak.Array([[1, 2, 3], [4, 5, 6]]))
2 * var * int64

The first array has size=3 burned into its type, whereas the second is size=var. It happens to have equal sizes in this case, but since it's not part of its type, it behaves differently—it behaves as though the sizes are variable (i.e. it's jagged).

When I say, define what you'd like it to do, I mean open a PR and propose a behavior for a case that currently errors-out. If you want NumPy dtype "O" to defer to fromiter, that's adding a one line check in fromnumpy. If you want lists of NumPy arrays to also iterate, then that's a one line check in fromiter. (Maybe two lines—I exaggerate.)

@tamasgal
Copy link
Author

tamasgal commented Mar 5, 2020

Yes I agree ;) I will think about it.

For numpy O-dtypes I wouldn't care too much to be honest. I am just a bit confused about the behaviour of lists (or iterators) of numpy arrays.

I'll walk through the internals of awkward1 with the debugger, I still need to learn how it works before I can make meaningful suggestions, besides high-level API design requests 😉

@jpivarski jpivarski added the good first issue Good for newcomers label Mar 6, 2020
@jpivarski jpivarski changed the title Converting nested numpy arrays Which ak.Array constructor signatures should be allowed? e.g. [np.array([1, 2, 3]), np.array([4, 5, 6, 7])] Mar 10, 2020
@jpivarski
Copy link
Member

Let me know if you're still thinking about this. It's okay to come back to it later.

@tamasgal
Copy link
Author

Sorry for the late response: yes I am still thinking about it. Meanwhile I set up the dev environment and playing around with it and have more questions and ideas already 😉

I think in first place it would be nice if the fromiter method would first look up potential valid (not "O") dtypes, so those could be inferred. This would solve that [[1,2,3], [4,5,6]] would create the same awkward array as [np.array([1,2,3]), np.array([4,5,6])].

@jpivarski
Copy link
Member

One thing we should be careful to keep in mind is that var * float64 is a different type from N * float64 for some integer N—a difference that data analyst users must be aware of—and constructor signatures involving NumPy arrays are for producing N types while non-NumPy iterables are for producing var types. If NumPy arrays are inside a variable-length container like a Python list, they should probably be treated as generic iterables (i.e. produce var types). The rules for whether var types or N types are produced should be as guessable as possible.

If you want to follow this particular rabbit hole, the Python-side fromiter function is actually pretty short:

https://github.com/scikit-hep/awkward-1.0/blob/e60faf1bf1815b44124a71f444c969194118481d/src/awkward1/operations/convert.py#L52-L60

It goes straight into C++ (where iteration over the Python objects was considerably faster, even though they have to do the same Python introspection). It's here:

https://github.com/scikit-hep/awkward-1.0/blob/e60faf1bf1815b44124a71f444c969194118481d/src/python/layout/content.cpp#L512-L564

I'm a little surprised that a NumPy array doesn't count as a py::iterable:

https://github.com/scikit-hep/awkward-1.0/blob/e60faf1bf1815b44124a71f444c969194118481d/src/python/layout/content.cpp#L553

I must have been assuming that it would when I wrote that. However, we can add an explicit test for py::array and then iterate over it as though it were a py::iterable.

Anything produced by ak::ArrayBuilder is of var type, so that would handle my concern above.

@tamasgal
Copy link
Author

The var and N type thing makes of course perfect sense, I totally forgot about the implications. I just started using pybind11 in serious projects so my knowledge is very limited. No idea why a NumPy array is not < py::iterable, the public docs are also very sparse on it 😉 https://pybind11.readthedocs.io/en/stable/reference.html#_CPPv48iterable

@jpivarski
Copy link
Member

The specific issue you were asking about has been resolved, so I'll be closing this.

>>> import awkward1 as ak
>>> import numpy as np

>>> ak.Array([[1, 2, 3], [4, 5, 6, 7]])
<Array [[1, 2, 3], [4, 5, 6, 7]] type='2 * var * int64'>

>>> ak.Array([np.array([1, 2, 3]), np.array([4, 5, 6, 7])])
<Array [[1, 2, 3], [4, 5, 6, 7]] type='2 * var * int64'>

@tamasgal
Copy link
Author

Thank you Jim!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants