-
Notifications
You must be signed in to change notification settings - Fork 39
[WIP] more JaggedArray.__getitem__ cases in awk-cpp #156
Conversation
To work with Also, I found out that I don't need to add any code to handle negative >>> test = numpy.arange(5)
>>> test[::1].ctypes.data
109515352
>>> test[::-1].ctypes.data
109515368 That means I already properly handle negative strides, because I step through arrays like this: int N = starts_info.strides[0] / starts_info.itemsize;
for (ssize_t i = 0; i < starts_info.size; i++) {
if (starts_ptr[i * N] < 0) {
// ...
}
} |
It's great to hear that the code doesn't have to change for negative indexes, but do you have an explicit test? |
- Added a utility function in util.h to slice a py::array - Added a non-static function to check the starts/stops/content validity in jagged.h whenever a JaggedArray is instantiated - Added fromoffsets and python_fromoffsets - Added fromcounts and python_fromcounts
- Added fromparents and python_fromparents - Added fromuniques and python_fromuniques - Added copy - Added deepcopy in JaggedArray and NumpyArray, pyarray_deepcopy in util.h, and python_deepcopy
- Fixed issue where one of the negative tests didn't catch an exception fast enough (because of the constructor validity check) - Moved python-esque getitem(ssize_t) negative index handling to the python-wrapping version of getitem(ssize_t)
@jpivarski I've been working on implementing >>> test = awkward.cpp.JaggedArray.fromiter([[[1, 4, 5, 4], [1, 5, 5]], [[], [3]], []])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: The maximum of starts for non-empty elements must be less than the length of content The above array looks like it should be a valid jagged array, but the specifications document does say that the maximum of Is that okay as it is, or should I change something to allow that sort of jagged array to be created? |
I've got another problem: Right now, I'm overriding the I would say my options are:
I feel like 2 would be much simpler, but I don't know if it's even possible, from what I've seen. The documentation for the Here is some code showing what the issue is: >>> a = awkward.cpp.JaggedArray([0, 3, 4, 4], [3, 4, 4, 6], numpy.arange(8))
>>> a
<JaggedArray [[0 1 2] [3] [] [4 5]] at 0x71c0940>
>>> b = awkward.cpp.JaggedArray.fromcounts([3, 1, 0, 2], numpy.arange(8))
>>> b
<JaggedArray [[0 1 2] [3] [] [4 5]] at 0x71c0b68>
>>> a[0:2]
<JaggedArray [[0 1 2] [3]] at 0x71c0e38>
>>> b[0:2]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: __getitem__(): incompatible function arguments. The following argument types are supported:
1. (self: awkward.cpp.array.array_impl.JaggedArray, arg0: int) -> object
2. (self: awkward.cpp.array.array_impl.JaggedArray, arg0: int, arg1: int, arg2: int) -> object
Invoked with: <JaggedArray [[0 1 2] [3] [] [4 5]] at 0x71c0b68>, slice(0, 2, None) Where |
Don't start on >>> import awkward, awkward.numba
>>> awkward.fromiter([[[1, 4, 5, 4], [1, 5, 5]], [[], [3]], []], awkwardlib=awkward.numba)
<JaggedArrayNumba [[[1 4 5 4] [1 5 5]] [[] [3]] []] at 0x7f0bd26c6048> The Most or all of the functions that aren't attached to an awkward array but produce awkward arrays have an |
Oh, but your error is probably due to the fact that the last inner array is empty. Empty arrays can be represented by any It doesn't matter that there's no data there because the length of the inner array is zero. So instead of |
From those error messages, you need to have a clearer separation between the Python objects and the C++ objects, and they need to go through a boundary layer that wraps and unwraps the Pythonness. So first off, the C++ function that does getitem and takes different signatures for different index types (e.g. number, slice, ...) should not be named The function (probably written on the C++ side, but not necessarily) that interprets Pythonic indexes (numbers, slices, arrays, tuples of all of the above) should take a generic You might not want to write constructors (like There are many ways to do this. The two you mentioned don't sound like the easiest. Here are others.
Don't make things difficult on yourself by trying to make it all happen on the C++ side. |
I figured out how to unpack slices on the C++ side, and it's so conveniently made that it gives the |
I also figured out how to create default arguments in the binding code, so I carried over the other python-side function as well. Now everything is written in C++. |
Just to make sure: In |
You'll want to have at least one getitem interface that's extremely lightweight, with no bounds checking or negative index handling (and hopefully the compiler just inlines that), and you'll need to do these checks in another interface (to make sure the user doesn't do something dumb). Where you put them is up to you. |
Do you think it's worth it to make the C++ class inherit functions directly from awkward.array.jagged instead of having a python child do it? As shown here, it's possible to have a pybind11-bound C++ class inherit methods from a python class, and that would remove any need to have an in-between now that I have everything implemented on the C++ side. |
That example you're pointing to is just invoking Python statements from C++. It's equivalent to just writing the Python, but it might save you from having two classes to carry around. |
I think I've decided not to mess with the strategy shown in that other GitHub thread. The most perfect solution for creating a class which implements awkward array in C++ would be to fully recreate every function listed for awkward array in the C++ While that would be nice to see, it would not be possible until far in the future, so until then I'll just continue translating jagged array methods from |
I've spent a long time trying to figure out (from Here's how I thought it was working: Now, I'm pretty sure that was wrong. I wrote out that process on the examples in the bool array test, and I got far different results from the listed output, so I'm more or less back to square one on this. How exactly should these functions be called? Is it recursive, or just case-by-case? The bool array one is especially stumping me: def test_numba_getitem_tuple_slice_boolarray(self):
a = numpy.arange(36).reshape(4, 3, 3)
a2 = awkward.fromiter(a)
@numba.njit
def test1(x, i):
return x[1:3, i]
assert test1(a, numpy.array([True, False, True])).tolist() == [[[9, 10, 11], [15, 16, 17]], [[18, 19, 20], [24, 25, 26]]]
assert test1(a2, numpy.array([True, False, True])).tolist() == [[[9, 10, 11], [15, 16, 17]], [[18, 19, 20], [24, 25, 26]]]
@numba.njit
def test2(x, i, j):
return x[1:3, i, j]
assert test2.py_func(a, numpy.array([True, False, True]), numpy.array([True, True, False])).tolist() == [[9, 16], [18, 25]]
assert test2(a2, numpy.array([True, False, True]), numpy.array([True, True, False])).tolist() == [[9, 16], [18, 25]] |
@jpivarski What do you think of my most recent commit? I created a new C header file called In the C++ file, I made a method called I believe the C header file is C-compiled, because of the |
For standard integer sizes, you should use the Also, there are built-in headers for byte swapping—I just don't know what they're called. In your project, it's more important to find these standards than to correctly write your own, because we'll be using standards in the long run. |
The byte-swapping headers for C are all system-dependent (to make it more efficient), so I don't know if it's necessary to find something better than my utility functions. |
What about this? https://linux.die.net/man/3/htonl |
I think that's just for Linux systems. I looked up where to get that header and the solutions was that Microsoft has its own library for that purpose. |
Okay, fine. I guess this is simple enough to not need a header. I had just thought it would be in the same place on all systems (instead, we'd probably find ourselves putting in preprocessor directives like, "#if Windows, do this..."). |
Just to make sure: I can design this to only use signed types, right? |
I think I need to draw out a better structure for myself to visualize how this is going to be set up, because what I had before got too messy and I made some mistakes with the polymorphism. Generally, I'm thinking the CPU methods should only cover the looping, computation-intensive sections of code, with everything else continuing to exist how it has been in the C++. In addition to what is in What I covered yesterday and today was good practice, but I was too ambitious and more or less broke my code, so I reverted and will start again from where I was on Monday. |
We can focus on just signed types. It doesn't exclude much available range. |
@jpivarski I'd like your opinion on the solution I found for this bug: The issue was that I was taking the pointers from the beginnings of the struct c_array py2c(py::array input) {
py::buffer_info info = input.request();
if (info.ndim > 15) {
throw std::invalid_argument("Array cannot exceed 15 dimensions");
}
char format[15];
strcpy(format, info.format.c_str());
ssize_t shape[15];
std::copy(info.shape.begin(), info.shape.end(), shape);
ssize_t strides[15];
std::copy(info.strides.begin(), info.strides.end(), strides);
struct c_array out = {
info.ptr,
info.itemsize,
info.size,
format,
info.ndim,
&shape[0],
&strides[0]
};
return out;
} I feel like the copying being done here is too much strain on processing time and memory. Can you think of a more efficient way to deal with this issue? |
Also, is this PR good to merge? The tests that are failing are all in |
To fix the Arrow tests, do a I'll have to look at your Numpy fix later—it's directly relevant for data ownership, which is what you should be focusing on. |
Okay, I've read this comment and this is not a good solution. You're taking ownership of the data by copying it. Whether this is a performance problem for a specific application depends on the application, but some applications are going to want zero-copy adoption of memory, so that's something we want to learn how to do. Python defines data ownership in terms of a reference count. In the pure C API, there are When an object is created in Python, its reference count increments to When an object's reference count is This tutorial describes Python reference counting and introduces the relevant terminology: "new reference," "borrowed reference," and "stolen reference." These are the English terms we use when the language doesn't track ownership for us, as Python does and as C++ Numpy adds another layer to this. A Numpy array is a Python object that points to raw bytes somewhere in memory. The Python object has a reference count, like any other Python object, and it also has an Here's an example of two Numpy arrays, one owns the data, another does not. >>> import numpy
>>> a = numpy.arange(10)
>>> b = a[::2]
>>> a.flags.owndata
True
>>> b.flags.owndata
False
>>> a.base, b.base # base is None or the array that an array is derived from
(None, array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))
>>> b.base is a # Python's "is" means they are the same object in memory
True
>>> id(b.base), id(a) # Python's id(x) is a unique integer for each object
(138444522648560, 138444522648560)
>>> id(b.base) == id(a)
True
>>> bcopy = b.copy()
>>> bcopy.flags.owndata
True
>>> print(bcopy.base)
None Another way that a Numpy array can not own its data is if we got it from somewhere else, some arbitrary bytes that Python generically calls a "buffer." A buffer is not a Python type, but a type in the Python C API—objects that support the buffer protocol are anything that has a pointer to raw bytes somewhere, including strings, bytestrings, and Numpy arrays. Python 3 has a similar concept called a memoryview—I don't know the differences between buffers and memoryviews, but they're pretty similar. >>> x = b"got this from somewhere"
>>> y = numpy.frombuffer(x, dtype=numpy.uint8)
>>> y
array([103, 111, 116, 32, 116, 104, 105, 115, 32, 102, 114, 111, 109,
32, 115, 111, 109, 101, 119, 104, 101, 114, 101], dtype=uint8)
>>> y.flags.owndata
False We will frequently be viewing Apache Arrow buffers as arrays. This is something we want to do a lot, and we want to do it zero-copy. It's one of the selling points for awkward-array. So in your implementation, you really should not copy the data but take ownership of it. The Numpyish way to do it is to make our objects in C++ be reference-counted Python objects, hopefully by some easy connection between How exactly that works is up to you, but this is the requirement: own the data without copying it. |
No description provided.