Skip to content
This repository has been archived by the owner on Jun 21, 2022. It is now read-only.

[WIP] more JaggedArray.__getitem__ cases in awk-cpp #156

Merged
merged 27 commits into from
Jul 17, 2019
Merged

Conversation

EscottC
Copy link
Collaborator

@EscottC EscottC commented Jun 27, 2019

No description provided.

@jpivarski jpivarski changed the title [WIP] JaggedArrays in awk-cpp [WIP] more JaggedArray.__getitem__ cases in awk-cpp Jun 27, 2019
@EscottC
Copy link
Collaborator Author

EscottC commented Jun 29, 2019

To work with None input in python strides (and to actually unpack a slice, which I don't think is supported in py::slice), I receive the where input on the python side and then send that data to C++ with proper default values. After many many trials and errors, JaggedArray slicing should work on the python side exactly how a normal array would.

Also, I found out that I don't need to add any code to handle negative strides, because slicing an array with negative strides automatically has its start pointer move to the end, which is really convenient:

>>> test = numpy.arange(5)
>>> test[::1].ctypes.data
109515352
>>> test[::-1].ctypes.data
109515368

That means I already properly handle negative strides, because I step through arrays like this:

int N = starts_info.strides[0] / starts_info.itemsize;
for (ssize_t i = 0; i < starts_info.size; i++) {
    if (starts_ptr[i * N] < 0) {
        // ...
    }
}

@jpivarski
Copy link
Member

It's great to hear that the code doesn't have to change for negative indexes, but do you have an explicit test?

EscottC added 4 commits July 1, 2019 15:11
- Added a utility function in util.h to slice a py::array
- Added a non-static function to check the starts/stops/content validity
in jagged.h whenever a JaggedArray is instantiated
- Added fromoffsets and python_fromoffsets
- Added fromcounts and python_fromcounts
- Added fromparents and python_fromparents
- Added fromuniques and python_fromuniques
- Added copy
- Added deepcopy in JaggedArray and NumpyArray, pyarray_deepcopy in
util.h, and python_deepcopy
- Fixed issue where one of the negative tests didn't catch an exception
fast enough (because of the constructor validity check)
- Moved python-esque getitem(ssize_t) negative index handling to the
python-wrapping version of getitem(ssize_t)
duynht added a commit that referenced this pull request Jul 2, 2019
#156
- default constructors for Table and Table.Row
- added missing changes: version.py and tests
@EscottC
Copy link
Collaborator Author

EscottC commented Jul 2, 2019

@jpivarski I've been working on implementing fromiter() so I can start copying over identical tests from test_numba to test_cpp, but I came across this problem:

>>> test = awkward.cpp.JaggedArray.fromiter([[[1, 4, 5, 4], [1, 5, 5]], [[], [3]], []])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: The maximum of starts for non-empty elements must be less than the length of content

The above array looks like it should be a valid jagged array, but the specifications document does say that the maximum of starts must be less than the length of content.

Is that okay as it is, or should I change something to allow that sort of jagged array to be created?

@EscottC
Copy link
Collaborator Author

EscottC commented Jul 3, 2019

I've got another problem:

Right now, I'm overriding the __getitem__() method on the python side to be able to unbox slices and accept None type in slices. However, calling fromcounts() creates a new JaggedArray instead of the python-side JaggedArrayCpp which inherits from it. As a result, that object does not override the __getitem()__ function anymore.

I would say my options are:

  1. Implement virtual functions through pybind11 so that __getitem__() calls from the python side
  2. Figure out how to unpack a py::slice (if there even is a way) in C++, and move everything over to the C++ side

I feel like 2 would be much simpler, but I don't know if it's even possible, from what I've seen. The documentation for the py::slice class is in pybind11/pytypes.h.

Here is some code showing what the issue is:

>>> a = awkward.cpp.JaggedArray([0, 3, 4, 4], [3, 4, 4, 6], numpy.arange(8))
>>> a
<JaggedArray [[0 1 2] [3] [] [4 5]] at 0x71c0940>
>>> b = awkward.cpp.JaggedArray.fromcounts([3, 1, 0, 2], numpy.arange(8))
>>> b
<JaggedArray [[0 1 2] [3] [] [4 5]] at 0x71c0b68>
>>> a[0:2]
<JaggedArray [[0 1 2] [3]] at 0x71c0e38>
>>> b[0:2]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: __getitem__(): incompatible function arguments. The following argument types are supported:
    1. (self: awkward.cpp.array.array_impl.JaggedArray, arg0: int) -> object
    2. (self: awkward.cpp.array.array_impl.JaggedArray, arg0: int, arg1: int, arg2: int) -> object

Invoked with: <JaggedArray [[0 1 2] [3] [] [4 5]] at 0x71c0b68>, slice(0, 2, None)

Where a is a JaggedArrayCpp (Python side) which intercepts __getitem__ before being passed to C++, and b is a JaggedArray (C++ superclass).

@jpivarski
Copy link
Member

@jpivarski I've been working on implementing fromiter() so I can start copying over identical tests from test_numba to test_cpp

Don't start on fromiter. It's a major project that will involve all of the awkward classes, and whether the fromiter implementation is in Python or C++ is independent of whether the awkward classes are in Python or C++. For example, using the existing fromiter, we can already populate custom classes:

>>> import awkward, awkward.numba
>>> awkward.fromiter([[[1, 4, 5, 4], [1, 5, 5]], [[], [3]], []], awkwardlib=awkward.numba)
<JaggedArrayNumba [[[1 4 5 4] [1 5 5]] [[] [3]] []] at 0x7f0bd26c6048>

The awkwardlib parameter determines which module is used as a source of JaggedArray, Table, etc. Without specifying anything, it's awkward.JaggedArray, awkward.Table, etc., but since I put awkwardlib=awkward.numba, it's awkward.numba.JaggedArray, awkward.numba.Table, etc. That's why we're putting all of the classes into awkward.cpp with generic names.

Most or all of the functions that aren't attached to an awkward array but produce awkward arrays have an awkwardlib parameter (so that arrays read from files and such can be any flavor the user needs), and so does uproot.

@jpivarski
Copy link
Member

Oh, but your error is probably due to the fact that the last inner array is empty. Empty arrays can be represented by any start, stop pair in which start == stop, but the natural way (and the only way if you're coming from offsets) is for an empty inner array at the end to have stop[i - 1] == start[i] == stop[i] where i is the index of the last inner array. Thus, start[i] is equal to len(content), which you might be excluding because there's no data there.

It doesn't matter that there's no data there because the length of the inner array is zero.

So instead of assert start[i] < len(content), the rule should be assert start[i] == stop[i] or start[i] < len(content).

@jpivarski
Copy link
Member

  1. Implement virtual functions through pybind11 so that __getitem__() calls from the python side
  2. Figure out how to unpack a py::slice (if there even is a way) in C++, and move everything over to the C++ side

I feel like 2 would be much simpler, but I don't know if it's even possible, from what I've seen. The documentation for the py::slice class is in pybind11/pytypes.h.

From those error messages, you need to have a clearer separation between the Python objects and the C++ objects, and they need to go through a boundary layer that wraps and unwraps the Pythonness.

So first off, the C++ function that does getitem and takes different signatures for different index types (e.g. number, slice, ...) should not be named __getitem__ because that's the Python name. You can make your life easier by having a clear distinction between the C++ names and the Python names, so that an error message tells you which side the problem is on.

The function (probably written on the C++ side, but not necessarily) that interprets Pythonic indexes (numbers, slices, arrays, tuples of all of the above) should take a generic py::object and unpack it manually using Python functions. (You can write using Python functions in C++, it's just a little more difficult.) Then, after you've unpacked it and you know whether it's an index or a slice or whatever, you can call the appropriate C++ method.

You might not want to write constructors (like fromcounts) on the C++ side. You've already written counts2offsets on the C++ side (the part that can benefit from C++ speed), maybe then just wrap the resulting starts and stops in a Python object on the Python side.

There are many ways to do this. The two you mentioned don't sound like the easiest. Here are others.

  1. Make the C++ side just a bucket of accelerated functions, which the Python side calls into when it needs a speedup, but most of the objects are defined on the Python side. (That's the approach I was thinking about before this project started.)
  2. Make a C++ JaggedArray class usable on the C++ side, but the Python side wraps all of its entrances and exits with a subclass. (That's the approach we've been trying to follow up to now. It would be nice to just assign to the array.__class__ to change the array's class as you would if it were a pure Python object, but I don't think you can do that to objects made in C++.)
  3. Make a C++ JaggedArray class usable on the C++ side, but the Python encloses it and wraps all of its entrances and exits with an unrelated class (neither superclass nor subclass). By "enclosing," I mean the Python JaggedArrayCpp has one member: the pybind11 jagged array. Every method and property that the JaggedArrayCpp must provide is implemented by calling the C++ object. In that case, the Python-side object should probably also do the Pythonic unpacking, like determining whether a __getitem__ index is a number, slice, etc. and call a specialized getitem (no underscores) C++ method for each. Slices, in particular, would get translated to a three argument method (start, stop, step separately) or maybe six arguments to have a place for boolean hasstart, hasstop, hasstep. Or maybe the Python side fills in a concrete start, stop, and step because it can see the length of the array.

Don't make things difficult on yourself by trying to make it all happen on the C++ side.

@EscottC
Copy link
Collaborator Author

EscottC commented Jul 3, 2019

I figured out how to unpack slices on the C++ side, and it's so conveniently made that it gives the start, stop, step, and length based on the len() value of the class it's being called on. Because of this, I got to delete a lot of unnecessary code and it really cleaned things up.

@EscottC
Copy link
Collaborator Author

EscottC commented Jul 3, 2019

I also figured out how to create default arguments in the binding code, so I carried over the other python-side function as well. Now everything is written in C++.

@EscottC
Copy link
Collaborator Author

EscottC commented Jul 3, 2019

Just to make sure:

In getitem(int array), the code should not allow negative values in the array, right? I can handle negative indices like python if necessary, but from the numba tests, it looks like it's only designed for exact indices.

@jpivarski
Copy link
Member

You'll want to have at least one getitem interface that's extremely lightweight, with no bounds checking or negative index handling (and hopefully the compiler just inlines that), and you'll need to do these checks in another interface (to make sure the user doesn't do something dumb). Where you put them is up to you.

@EscottC
Copy link
Collaborator Author

EscottC commented Jul 3, 2019

Do you think it's worth it to make the C++ class inherit functions directly from awkward.array.jagged instead of having a python child do it? As shown here, it's possible to have a pybind11-bound C++ class inherit methods from a python class, and that would remove any need to have an in-between now that I have everything implemented on the C++ side.

@jpivarski
Copy link
Member

That example you're pointing to is just invoking Python statements from C++. It's equivalent to just writing the Python, but it might save you from having two classes to carry around.

@EscottC
Copy link
Collaborator Author

EscottC commented Jul 8, 2019

I think I've decided not to mess with the strategy shown in that other GitHub thread. The most perfect solution for creating a class which implements awkward array in C++ would be to fully recreate every function listed for awkward array in the C++ JaggedArray class, and have the __init__.py file point to that directly rather than the Python JaggedArrayCpp class which inherits from both C++ and Python implementations of jagged arrays.

While that would be nice to see, it would not be possible until far in the future, so until then I'll just continue translating jagged array methods from awkward-array to awkward-cpp.

@EscottC
Copy link
Collaborator Author

EscottC commented Jul 9, 2019

I've spent a long time trying to figure out (from test_numba) how the argument input works for __getitem__(tuple), and I thought I had it, until I got to the bool array tests and then I got lost.

Here's how I thought it was working:
Taking in a tuple as an argument, the code identifies the type of the [0] index and calls that specific function on the jagged array. For each element it selects through that function, it calls the next index in the tuple. This continues until the end of the tuple. The only time that a returned array has a reduced number of dimensions is when it's calling the single-int getitem function.

Now, I'm pretty sure that was wrong. I wrote out that process on the examples in the bool array test, and I got far different results from the listed output, so I'm more or less back to square one on this.

How exactly should these functions be called? Is it recursive, or just case-by-case? The bool array one is especially stumping me:

def test_numba_getitem_tuple_slice_boolarray(self):
    a = numpy.arange(36).reshape(4, 3, 3)
    a2 = awkward.fromiter(a)
    @numba.njit
    def test1(x, i):
        return x[1:3, i]
    assert test1(a, numpy.array([True, False, True])).tolist() == [[[9, 10, 11], [15, 16, 17]], [[18, 19, 20], [24, 25, 26]]]
    assert test1(a2, numpy.array([True, False, True])).tolist() == [[[9, 10, 11], [15, 16, 17]], [[18, 19, 20], [24, 25, 26]]]
    @numba.njit
    def test2(x, i, j):
        return x[1:3, i, j]
    assert test2.py_func(a, numpy.array([True, False, True]), numpy.array([True, True, False])).tolist() == [[9, 16], [18, 25]]
    assert test2(a2, numpy.array([True, False, True]), numpy.array([True, True, False])).tolist() == [[9, 16], [18, 25]]

@EscottC
Copy link
Collaborator Author

EscottC commented Jul 10, 2019

@jpivarski What do you think of my most recent commit?

I created a new C header file called CPU_methods.h which implements its own byte swap and native endian handling, along with creating four new structs: C_array_8, C_array_16, C_array_32, and C_array_64. After that, I rewrote offsets2parents_int64 using only variables being passed by reference.

In the C++ file, I made a method called numpy_to_c which converts a NumPy array to a C_array struct (for now just the 64-bit version), and then I replaced the looping content of offsets2parents with offsets2parents_int64.

I believe the C header file is C-compiled, because of the extern "C" { border I put around the contents of the file. If this looks correct, I can start carrying more functions over to C tomorrow.

duynht added a commit that referenced this pull request Jul 10, 2019
@jpivarski
Copy link
Member

extern "C" doesn't mean "C compiled," it just means that the interface is C (FFI). We can take advantage of that, because it isn't necessary for the functions to be implemented in C, just that they present an interface that I can use in Numba.

For standard integer sizes, you should use the stdint.h header. It corrects C's mistake of having integer sizes be platform-dependent.

Also, there are built-in headers for byte swapping—I just don't know what they're called. In your project, it's more important to find these standards than to correctly write your own, because we'll be using standards in the long run.

@EscottC
Copy link
Collaborator Author

EscottC commented Jul 10, 2019

The byte-swapping headers for C are all system-dependent (to make it more efficient), so I don't know if it's necessary to find something better than my utility functions.

@jpivarski
Copy link
Member

What about this? https://linux.die.net/man/3/htonl

@EscottC
Copy link
Collaborator Author

EscottC commented Jul 10, 2019

I think that's just for Linux systems. I looked up where to get that header and the solutions was that Microsoft has its own library for that purpose. <arpa/inet.h> is apparently included in an amd64 package.

@jpivarski
Copy link
Member

Okay, fine. I guess this is simple enough to not need a header. I had just thought it would be in the same place on all systems (instead, we'd probably find ourselves putting in preprocessor directives like, "#if Windows, do this...").

https://stackoverflow.com/a/10346046

@EscottC
Copy link
Collaborator Author

EscottC commented Jul 10, 2019

Just to make sure: I can design this to only use signed types, right?

@EscottC
Copy link
Collaborator Author

EscottC commented Jul 11, 2019

I think I need to draw out a better structure for myself to visualize how this is going to be set up, because what I had before got too messy and I made some mistakes with the polymorphism. Generally, I'm thinking the CPU methods should only cover the looping, computation-intensive sections of code, with everything else continuing to exist how it has been in the C++.

In addition to what is in awkward-cpp right now, I would expect there to be a CPU_methods.h header of some sort that contains the C functions, and then another header, CPU_cpp.h or something, that wraps the C in C++ code.

What I covered yesterday and today was good practice, but I was too ambitious and more or less broke my code, so I reverted and will start again from where I was on Monday.

@jpivarski
Copy link
Member

We can focus on just signed types. It doesn't exclude much available range.

@EscottC
Copy link
Collaborator Author

EscottC commented Jul 17, 2019

@jpivarski I'd like your opinion on the solution I found for this bug:

The issue was that I was taking the pointers from the beginnings of the strides and shape vectors to be used in the c_array struct, but those vectors didn't have a reference and so it would sometimes corrupt the data. I tried just creating vector variables and setting the strides and shape to those variables, but it didn't work. My solution was to copy the values over to arrays and attach those arrays to the c_array struct instead:

struct c_array py2c(py::array input) {
    py::buffer_info info = input.request();

    if (info.ndim > 15) {
        throw std::invalid_argument("Array cannot exceed 15 dimensions");
    }

    char format[15];
    strcpy(format, info.format.c_str());
    ssize_t shape[15];
    std::copy(info.shape.begin(), info.shape.end(), shape);
    ssize_t strides[15];
    std::copy(info.strides.begin(), info.strides.end(), strides);

    struct c_array out = {
        info.ptr,
        info.itemsize,
        info.size,
        format,
        info.ndim,
        &shape[0],
        &strides[0]
    };
    return out;
}

I feel like the copying being done here is too much strain on processing time and memory. Can you think of a more efficient way to deal with this issue?

@EscottC
Copy link
Collaborator Author

EscottC commented Jul 17, 2019

Also, is this PR good to merge? The tests that are failing are all in test_arrow.py. I don't know why they're failing, but my tests are all okay.

@jpivarski
Copy link
Member

To fix the Arrow tests, do a git merge master. Arrow updated in a backward-incompatible way and I put corrections into master. Then you should get a clean bill of health.

I'll have to look at your Numpy fix later—it's directly relevant for data ownership, which is what you should be focusing on.

@EscottC EscottC merged commit 6067a98 into master Jul 17, 2019
@EscottC EscottC deleted the awk-cpp-jagged branch July 17, 2019 04:29
@jpivarski
Copy link
Member

@EscottC

The issue was that I was taking the pointers from the beginnings of the strides and shape vectors to be used in the c_array struct, but those vectors didn't have a reference and so it would sometimes corrupt the data.
...
I feel like the copying being done here is too much strain on processing time and memory. Can you think of a more efficient way to deal with this issue?

Okay, I've read this comment and this is not a good solution. You're taking ownership of the data by copying it. Whether this is a performance problem for a specific application depends on the application, but some applications are going to want zero-copy adoption of memory, so that's something we want to learn how to do.

Python defines data ownership in terms of a reference count. In the pure C API, there are Py_INCREF and Py_DECREF macros to increase and decrease this integer, which extension developers have to manage manually. I've been told that one of the big advantages of Cython and pybind11 is that they do this in an automated or semi-automated way. It's certainly a mess when an extension developer doesn't handle the reference count correctly (segmentation faults in Python!)

When an object is created in Python, its reference count increments to 1. If that object is in an expression that never gets assigned to a name (e.g. myvar = make_whatever()), then it gets decremented to 0. If it gets assigned to several names, it gets several increments. When names get deleted with the Python del operator or they go out of scope because they were defined in a function, they get decremented. Some of these names might be attributes of other objects, like myobj.attribute = make_whatever(). Deleting an object decrements all the objects that are attached to it.

When an object's reference count is 0, there is no way to access it. Such a value in C or C++ would be a memory leak, but Python keeps a global list of objects and occasionally checks their reference counts for 0 and deletes those (garbage collection). It's a little more complicated than that for cyclic references: A could have B as an attribute, which has A as an attribute, and so both A and B would have a reference count of 1 even if they're both inaccessible—Python has to do a more computationally intensive search for such cases, but it does them. (In fact, this is a performance tip for Python: avoid cyclic references to make memory management quicker!)

This tutorial describes Python reference counting and introduces the relevant terminology: "new reference," "borrowed reference," and "stolen reference." These are the English terms we use when the language doesn't track ownership for us, as Python does and as C++ shared_ptr does. There's probably a one-to-one relationship between Python's reference counting and C++ smart pointers that pybind11 does automatically, which is precisely what we want to learn. (Remember that in the new architecture, the pure C functions do the work, the C++ layer does object ownership, and the Python object ownership is one-to-one coupled to C++. Your focus is this C++/Python layer, so your project is all about learning how this works.)

Numpy adds another layer to this. A Numpy array is a Python object that points to raw bytes somewhere in memory. The Python object has a reference count, like any other Python object, and it also has an owndata flag to say whether it "owns" the raw bytes. When a Numpy object's reference count is 0 and it gets garbage-collected, it will delete its memory buffer if and only if it owns it.

Here's an example of two Numpy arrays, one owns the data, another does not.

>>> import numpy
>>> a = numpy.arange(10)
>>> b = a[::2]
>>> a.flags.owndata
True
>>> b.flags.owndata
False
>>> a.base, b.base      # base is None or the array that an array is derived from
(None, array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))
>>> b.base is a         # Python's "is" means they are the same object in memory
True
>>> id(b.base), id(a)   # Python's id(x) is a unique integer for each object
(138444522648560, 138444522648560)
>>> id(b.base) == id(a)
True
>>> bcopy = b.copy()
>>> bcopy.flags.owndata
True
>>> print(bcopy.base)
None

Another way that a Numpy array can not own its data is if we got it from somewhere else, some arbitrary bytes that Python generically calls a "buffer." A buffer is not a Python type, but a type in the Python C API—objects that support the buffer protocol are anything that has a pointer to raw bytes somewhere, including strings, bytestrings, and Numpy arrays. Python 3 has a similar concept called a memoryview—I don't know the differences between buffers and memoryviews, but they're pretty similar.

>>> x = b"got this from somewhere"
>>> y = numpy.frombuffer(x, dtype=numpy.uint8)
>>> y
array([103, 111, 116,  32, 116, 104, 105, 115,  32, 102, 114, 111, 109,
        32, 115, 111, 109, 101, 119, 104, 101, 114, 101], dtype=uint8)
>>> y.flags.owndata
False

We will frequently be viewing Apache Arrow buffers as arrays. This is something we want to do a lot, and we want to do it zero-copy. It's one of the selling points for awkward-array.

So in your implementation, you really should not copy the data but take ownership of it. The Numpyish way to do it is to make our objects in C++ be reference-counted Python objects, hopefully by some easy connection between shared_ptr and Python reference counts through pybind11, then to add a reference-counted attribute named base that points back to the original Numpy array. That way, even if the original Numpy array goes out of scope, we have a reference to it named base that will keep Python's garbage collection from erasing that array or the raw bytes it points to.

How exactly that works is up to you, but this is the requirement: own the data without copying it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants