Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VirtualArray #57

Closed
jpivarski opened this issue Jan 14, 2020 · 1 comment · Fixed by #216
Closed

VirtualArray #57

jpivarski opened this issue Jan 14, 2020 · 1 comment · Fixed by #216
Assignees
Labels
feature New feature or request

Comments

@jpivarski
Copy link
Member

Can be passed through C++, but it's only functional in Python.

Have to think hard about how to do this in Numba. Materialize on entry or pass around PyObject pointer?

Might have to change (and rename) VirtualArray wiki page.

@jpivarski jpivarski self-assigned this Jan 14, 2020
@jpivarski jpivarski added the feature New feature or request label Jan 14, 2020
@jpivarski jpivarski added this to the Necessary for uproot milestone Jan 14, 2020
@jpivarski jpivarski changed the title PyVirtualArray VirtualArray Feb 21, 2020
@jpivarski jpivarski mentioned this issue Feb 21, 2020
@jpivarski
Copy link
Member Author

Outcome of a long conversation with @nsmith-, in which we went through all the consequences of the following:

  • VirtualArrays need to take a callable Closure C++ instance as an argument. The Closure abstract class will be instantiated for Python as a PyClosure (with appropriate data-translation and reference counting) and as a SlicedClosure (see below). In principle, pure-C++ closures would be supported, but that's not the main intent or first implementation.
  • All materialized data would be stored in a Cache C++ instance, similarly special-cased as PyCache, which supports the MutableMapping interface in Python. No data would be attached to the VirtualArray itself, as this has been a cause of memory leaks in Awkward0. ("Leaking" in the sense that the user didn't know where the memory was being held. By centralizing everything in a cache, we can cap it or explicitly flush it.)
  • All operations on a VirtualArray (except one) cause it to be materialized, which means the output of the operation is not a VirtualArray, but a node corresponding to the computed output.
  • The one exception is single-level getitem (which is why SlicedArray #55 was closed; VirtualArray will take over the proposed operation of SlicedArray). If a single-level getitem is received, the SliceItem is put in a SlicedClosure to make a new VirtualArray. In principle, repeated getitem attempts make larger SlicedClosures, but we don't want to compose them into a single SlicedClosure because (a) repeated filtering would be rare in a physics analysis—it complicates the analysis because each array of booleans would have to have a different length—and (b) such a composition would be independently applied at each VirtualArray in a RecordArray of many fields, duplicating effort and memory usage. It's better to let each element of the SlicedClosure chain point to the same shared_ptr<int64_t> for all fields in a RecordArray of VirtualArrays.

A getitem_at and getitem(SliceArray64) with advanced are not single-level slices, but SliceRange, SliceField, SliceFields, etc. are. SliceNewAxis can be applied without materializing the VirtualArray and SliceEllipsis requires materialization because we don't know how deep it goes. It's only a single-level slice if the tail is empty.

  • When a VirtualArray makes a new VirtualArray with a new SlicedClosure, it is not assigned any cache. If the sliced VirtualArray is repeatedly accessed, it will always reevaluate the slice (though the unsliced data may be fetched from its cache, depending on eviction policies).
  • Attempts to cache sliced VirtualArrays should be a separate application, which is why it's important for SlicedClosures to be inspectable externally, in Python. Each of the SliceItem types will have to be convertible back into their Python equivalents.
  • Persistent keys should be parameters, maybe named "__persistent_key__".
  • VirtualArrays in Numba can either materialize upon entry into the JIT-compiled function (bad for naive users, since that might mean for event in events of a lazy events is super-slow) or they have to contain a full layout description (all nodes but no lengths). That can wait.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant