Skip to content

Commit

Permalink
Replace to_arrayset/from_arrayset with to_buffers/from_buffers and de…
Browse files Browse the repository at this point in the history
…precate the original. (#592)

* Added ak.to_buffers with a new interface; old ak.to_arrayset uses it.

* Black and Flake8.

* Works again, without modifying to/from_arrayset.

* All the deprecation messages are in place.

* Pickle uses the new to_buffers/from_buffers, but old pickle files can still be read.

* Eliminated all warnings, references to 'arrayset' from the tests.

* Updated the documentation, too.

* Last touches: length sanity-checks at all levels.
  • Loading branch information
jpivarski authored Dec 11, 2020
1 parent 518232b commit 5fe31a9
Show file tree
Hide file tree
Showing 12 changed files with 1,028 additions and 559 deletions.
4 changes: 2 additions & 2 deletions docs-src/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@
title: "Arrow and Parquet"
- file: how-to-convert-pandas
title: "Pandas"
- file: how-to-convert-arrayset
title: "Generic array-sets"
- file: how-to-convert-buffers
title: "Generic buffers"

- file: how-to-create
title: "Creating new arrays"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@ kernelspec:
name: python3
---

Generic array-sets
==================
Generic buffers
===============

Most of the conversion functions target a particular library: NumPy, Arrow, Pandas, or Python itself. As a catch-all for other storage formats, Awkward Arrays can be converted to and from "array-sets," sets of named arrays with a schema that can be used to reconstruct the original array. This section will demonstrate how an array-set can be used to store an Awkward Array in an HDF5 file, which ordinarily wouldn't be able to represent nested, irregular data structures.
Most of the conversion functions target a particular library: NumPy, Arrow, Pandas, or Python itself. As a catch-all for other storage formats, Awkward Arrays can be converted to and from sets of named buffers. The buffers are not (usually) intelligible on their own; the length of the array and a JSON document are needed to reconstitute the original structure. This section will demonstrate how an array-set can be used to store an Awkward Array in an HDF5 file, which ordinarily wouldn't be able to represent nested, irregular data structures.

```{code-cell} ipython3
import awkward as ak
Expand All @@ -23,8 +23,8 @@ import h5py
import json
```

From Awkward to an array-set
----------------------------
From Awkward to buffers
-----------------------

Consider the following complex array:

Expand All @@ -37,18 +37,17 @@ ak_array = ak.Array([
ak_array
```

The [ak.to_arrayset](https://awkward-array.readthedocs.io/en/latest/_auto/ak.to_arrayset.html) function decomposes it into a set of one-dimensional arrays (a zero-copy operation).
The [ak.to_buffers](https://awkward-array.readthedocs.io/en/latest/_auto/ak.to_buffers.html) function decomposes it into a set of one-dimensional arrays (a zero-copy operation).

```{code-cell} ipython3
form, container, num_partitions = ak.to_arrayset(ak_array)
form, length, container = ak.to_buffers(ak_array)
```

The pieces needed to reconstitute this array are:

* the [Form](https://awkward-array.readthedocs.io/en/latest/ak.forms.Form.html), which defines how structure is built from one-dimensional arrays,
* the one-dimensional arrays in the `container` (a [MutableMapping](https://docs.python.org/3/library/collections.abc.html#collections-abstract-base-classes)),
* the number of partitions, if any,
* the length of the original array or lengths of all partitions ([ak.partitions](https://awkward-array.readthedocs.io/en/latest/_auto/ak.partitions.html)) are needed if we wish to read it back _lazily_ (more on that below).
* the length of the original array or lengths of all of its partitions ([ak.partitions](https://awkward-array.readthedocs.io/en/latest/_auto/ak.partitions.html)),
* the one-dimensional arrays in the `container` (a [MutableMapping](https://docs.python.org/3/library/collections.abc.html#collections-abstract-base-classes)).

The [Form](https://awkward-array.readthedocs.io/en/latest/ak.forms.Form.html) is like an Awkward [Type](https://awkward-array.readthedocs.io/en/latest/ak.types.Type.html) in that it describes how the data are structured, but with more detail: it includes distinctions such as the difference between [ListArray](https://awkward-array.readthedocs.io/en/latest/ak.layout.ListArray.html) and [ListOffsetArray](https://awkward-array.readthedocs.io/en/latest/ak.layout.ListOffsetArray.html), as well as the integer types of structural [Indexes](https://awkward-array.readthedocs.io/en/latest/ak.layout.Index.html).

Expand All @@ -58,48 +57,42 @@ It is usually presented as JSON, and has a compact JSON format (when [Form.tojso
form
```

This `container` is a new dict, but it could have been a user-specified [MutableMapping](https://docs.python.org/3/library/collections.abc.html#collections-abstract-base-classes).
In this case, the `length` is just an integer. It would be a list of integers if `ak_array` was partitioned.

```{code-cell} ipython3
container
length
```

This array has no partitions.
This `container` is a new dict, but it could have been a user-specified [MutableMapping](https://docs.python.org/3/library/collections.abc.html#collections-abstract-base-classes) if passed into [ak.to_buffers](https://awkward-array.readthedocs.io/en/latest/_auto/ak.to_buffers.html) as an argument.

```{code-cell} ipython3
num_partitions is None
```

This is also what we find from [ak.partitions](https://awkward-array.readthedocs.io/en/latest/_auto/ak.partitions.html).

```{code-cell} ipython3
ak.partitions(ak_array) is None
container
```

From array-set to Awkward
-------------------------
From buffers to Awkward
-----------------------

The function that reverses [ak.to_arrayset](https://awkward-array.readthedocs.io/en/latest/_auto/ak.to_arrayset.html) is [ak.from_arrayset](https://awkward-array.readthedocs.io/en/latest/_auto/ak.from_arrayset.html). Its first three arguments are `form`, `container`, and `num_partitions`.
The function that reverses [ak.to_buffers](https://awkward-array.readthedocs.io/en/latest/_auto/ak.to_buffers.html) is [ak.from_buffers](https://awkward-array.readthedocs.io/en/latest/_auto/ak.from_buffers.html). Its first three arguments are `form`, `length`, and `container`.

```{code-cell} ipython3
ak.from_arrayset(form, container, num_partitions)
ak.from_buffers(form, length, container)
```

Saving Awkward Arrays to HDF5
-----------------------------

The [h5py](https://www.h5py.org/) library presents each group in an HDF5 file as a [MutableMapping](https://docs.python.org/3/library/collections.abc.html#collections-abstract-base-classes), which we can use as a container for an array-set. We must also save the `form`, `num_partitions`, and `length` as metadata for the array to be retrievable.
The [h5py](https://www.h5py.org/) library presents each group in an HDF5 file as a [MutableMapping](https://docs.python.org/3/library/collections.abc.html#collections-abstract-base-classes), which we can use as a container for an array-set. We must also save the `form` and `length` as metadata for the array to be retrievable.

```{code-cell} ipython3
file = h5py.File("/tmp/example.hdf5", "w")
group = file.create_group("awkward")
group
```

We can fill this `group` as a `container` by passing it in to [ak.to_arrayset](https://awkward-array.readthedocs.io/en/latest/_auto/ak.to_arrayset.html).
We can fill this `group` as a `container` by passing it in to [ak.to_buffers](https://awkward-array.readthedocs.io/en/latest/_auto/ak.to_buffers.html).

```{code-cell} ipython3
form, container, num_partitions = ak.to_arrayset(ak_array, container=group)
form, length, container = ak.to_buffers(ak_array, container=group)
```

```{code-cell} ipython3
Expand All @@ -115,7 +108,7 @@ container.keys()
Here's one.

```{code-cell} ipython3
np.asarray(container["node0-offsets"])
np.asarray(container["part0-node0-offsets"])
```

Now we need to add the other information to the group as metadata. Since HDF5 accepts string-valued metadata, we can put it all in as JSON or numbers.
Expand All @@ -126,38 +119,27 @@ group.attrs["form"]
```

```{code-cell} ipython3
group.attrs["num_partitions"] = json.dumps(num_partitions)
group.attrs["num_partitions"]
```

```{code-cell} ipython3
group.attrs["partition_lengths"] = json.dumps(ak.partitions(ak_array))
group.attrs["partition_lengths"]
```

```{code-cell} ipython3
group.attrs["length"] = len(ak_array)
group.attrs["length"] = json.dumps(length) # JSON-encode it because it might be a list
group.attrs["length"]
```

Reading Awkward Arrays from HDF5
--------------------------------

With that, we can reconstitute the array by supplying [ak.from_arrayset](https://awkward-array.readthedocs.io/en/latest/_auto/ak.from_arrayset.html) the right arguments from the group and metadata.
With that, we can reconstitute the array by supplying [ak.from_buffers](https://awkward-array.readthedocs.io/en/latest/_auto/ak.from_buffers.html) the right arguments from the group and metadata.

The group can't be used as a `container` as-is, since subscripting it returns `h5py.Dataset` objects, rather than arrays.

```{code-cell} ipython3
reconstituted = ak.from_arrayset(
reconstituted = ak.from_buffers(
ak.forms.Form.fromjson(group.attrs["form"]),
json.loads(group.attrs["length"]),
{k: np.asarray(v) for k, v in group.items()},
)
reconstituted
```

Like [ak.from_parquet](https://awkward-array.readthedocs.io/en/latest/_auto/ak.from_parquet.html), [ak.from_arrayset](https://awkward-array.readthedocs.io/en/latest/_auto/ak.from_arrayset.html) has the option to read lazily, only accessing record fields and partitions that are accessed.

To do so, we need to pass `lazy=True`, but also the total length of the array (if not partitioned) or the lengths of all the partitions (if partitioned).
Like [ak.from_parquet](https://awkward-array.readthedocs.io/en/latest/_auto/ak.from_parquet.html), [ak.from_buffers](https://awkward-array.readthedocs.io/en/latest/_auto/ak.from_buffers.html) has the option to read lazily, only accessing record fields and partitions that are accessed.

```{code-cell} ipython3
class LazyGet:
Expand All @@ -168,11 +150,11 @@ class LazyGet:
print(key)
return np.asarray(self.group[key])
lazy = ak.from_arrayset(
lazy = ak.from_buffers(
ak.forms.Form.fromjson(group.attrs["form"]),
json.loads(group.attrs["length"]),
LazyGet(group),
lazy=True,
lazy_lengths = group.attrs["length"],
)
```

Expand Down
2 changes: 1 addition & 1 deletion docs-src/how-to-convert.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,4 @@ Converting arrays
* **[ROOT via Uproot](how-to-convert-uproot)**
* **[Arrow and Parquet](how-to-convert-arrow)**
* **[Pandas](how-to-convert-pandas)**
* **[Generic array-sets](how-to-convert-arrayset)**
* **[Generic array-sets](how-to-convert-buffers)**
32 changes: 31 additions & 1 deletion src/awkward/_util.py
Original file line number Diff line number Diff line change
Expand Up @@ -761,7 +761,9 @@ def apply(inputs, depth, user):
outcontent = apply(nextinputs, depth + 1, user)
assert isinstance(outcontent, tuple)

return tuple(ak.layout.RegularArray(x, maxsize, maxlen) for x in outcontent)
return tuple(
ak.layout.RegularArray(x, maxsize, maxlen) for x in outcontent
)

elif not all_same_offsets(nplike, inputs):
fcns = [
Expand Down Expand Up @@ -1695,3 +1697,31 @@ def union_to_record(unionarray, anonymous):
)

return ak.layout.RecordArray(all_fields, all_names, len(unionarray))


def adjust_old_pickle(form, container, num_partitions, behavior):
def key_format(**v):
if num_partitions is None:
if v["attribute"] == "data":
return "{form_key}".format(**v)
else:
return "{form_key}-{attribute}".format(**v)

else:
if v["attribute"] == "data":
return "{form_key}-part{partition}".format(**v)
else:
return "{form_key}-{attribute}-part{partition}".format(**v)

return ak.operations.convert.from_buffers(
form,
None,
container,
partition_start=0,
key_format=key_format,
lazy=False,
lazy_cache="new",
lazy_cache_key=None,
highlevel=False,
behavior=behavior,
)
34 changes: 25 additions & 9 deletions src/awkward/highlevel.py
Original file line number Diff line number Diff line change
Expand Up @@ -1386,16 +1386,24 @@ def numba_type(self):
return numba.typeof(self._numbaview)

def __getstate__(self):
form, container, num_partitions = ak.to_arrayset(self)
form, length, container = ak.operations.convert.to_buffers(self._layout)
if self._behavior is ak.behavior:
behavior = None
else:
behavior = self._behavior
return form, container, num_partitions, behavior
return form, length, container, behavior

def __setstate__(self, state):
form, container, num_partitions, behavior = state
layout = ak.from_arrayset(form, container, num_partitions, highlevel=False)
if isinstance(state[1], dict):
form, container, num_partitions, behavior = state
layout = ak._util.adjust_old_pickle(
form, container, num_partitions, behavior
)
else:
form, length, container, behavior = state
layout = ak.operations.convert.from_buffers(
form, length, container, highlevel=False, behavior=behavior
)
if self.__class__ is Array:
self.__class__ = ak._util.arrayclass(layout, behavior)
self.layout = layout
Expand Down Expand Up @@ -1975,17 +1983,25 @@ def numba_type(self):
return numba.typeof(self._numbaview)

def __getstate__(self):
form, container, num_partitions = ak.to_arrayset(self._layout.array)
form, length, container = ak.operations.convert.to_buffers(self._layout.array)
if self._behavior is ak.behavior:
behavior = None
else:
behavior = self._behavior
return form, container, num_partitions, behavior, self._layout.at
return form, length, container, behavior, self._layout.at

def __setstate__(self, state):
form, container, num_partitions, behavior, at = state
array = ak.from_arrayset(form, container, num_partitions, highlevel=False)
layout = ak.layout.Record(array, at)
if isinstance(state[1], dict):
form, container, num_partitions, behavior, at = state
layout = ak._util.adjust_old_pickle(
form, container, num_partitions, behavior
)
else:
form, length, container, behavior, at = state
layout = ak.operations.convert.from_buffers(
form, length, container, highlevel=False, behavior=behavior
)
layout = ak.layout.Record(layout, at)
if self.__class__ is Record:
self.__class__ = ak._util.recordclass(layout, behavior)
self.layout = layout
Expand Down
Loading

0 comments on commit 5fe31a9

Please sign in to comment.