The purpose of Awkward Array is to perform computations on complex data with an array programming idiom, not as a storage or interchange format to rival ROOT, Parquet, or Arrow. However, since the bulk data consist entirely of basic (Numpy) arrays, any storage format that can store a set of named arrays and the Awkward metadata can store one or more Awkward Arrays.
We define a protocol to store any Awkward Array structure in any named array container so that users have a way to stash and share intermediate results and, hopefully, to communicate between Awkward Array implementations. The definition consists of a JSON format with as little reference to Python specifics as possible, so that implementations of Awkward Array in other languages would be able to share data. This sharing could be through pointer locations in a process, shared memory on a single operating system, files on disk, or a distributed network.
This persistence protocol does not define a method for storing named arrays. We assume that an existing mechanism, such as ROOT, Parquet, or Arrow, or ZIP, HDF5, named files, or an object store manage that. For complete generality, the “named arrays” do not need to keep array metadata (the way HDF5 files do), but can store only named binary blobs (the way ZIP files do). We only fill this container with array data, assign names to them, and additionally provide a JSON object that can be used to reverse the process.
The Awkward Array library in Python uses this serialization mechanism to pickle array objects, save
them to and load
them from ZIP files, and to use HDF5 as a read-write database of Awkward Arrays. ZIP files of Awkward Array objects are identified with file extension .awkd
.
The JSON object is a nested expression to be interpreted as a command, like a LISP S-expression. This provides the most flexibility for backward and forward compatibility. For security, deserialization takes a whitelist of allowed functions: maliciously constructed data cannot execute arbitrary functions, only the ones in the whitelist. If a new feature is added that requires the execution of previously unforeseen functions, users may need to manually expand the whitelist for forward compatibility (or they may completely open the whitelist if they trust the data source).
A single JSON object represents a single Awkward Array. Multiple Awkward Arrays may be saved in the same namespace if they each have a different prefix. In general, it would be the user’s responsibility to ensure that prefixes don’t overlap, but the Numpy-only implementation of Awkward Array checks for name collisions when writing to ZIP files and separates namespaces with Groups in HDF5 files. Prefixes can also be filesystem paths up to a directory delimiter — arrays would not overlap if they are saved to different directories.
The top level of the JSON format defines the following fields. If a field has a default, it is optional. If it isn’t labeled with a default or as optional, it is required.
-
awkward0
(required, but for information purposes only): the version of the Awkward Array library that wrote this object. Only the existence of this field is checked as a format signature. -
doc
(optional, for information purposes only): should be a string -
metadata
(optional, for information purposes only): any JSON structure -
prefix
(default is""
): the prefix at the beginning of the name of each binary blob. -
schema
: the deserialization expression, which recursively defines the Awkward Array to build.
If additional fields are filled, the object is not considered invalid, though it risks conflicts with future versions of this specification.
A deserialization expression is a JSON object whose type is determined by the presence of an identifying key. If additional fields beyond the ones described below are filled, the object is not considered invalid, though it risks conflicts with future versions of this specification. However, a deserialization expression must not include more than one identifying key in the same JSON object.
Every deserialization expression has an "id"
field, which is a non-negative integer. These identifiers are used by "ref"
to build cross-references and cyclic references into the arrays within the object. These identifiers are not guaranteed to be in order; a reference to an unrecognized identifier can be deserialized as a VirtualArray
to break a cyclic dependency.
Each of the sections below is labeled with the identifying key first.
The expression is a function to be called if its function specification matches a pattern in the whitelist.
-
call
: function specification, described below. -
args
(default is[]
): list of [deserialization expressions] to use as positional arguments. -
kwargs
(default is{}
): JSON object of [deserialization expressions] to use as keyword, argument pairs. -
cacheable
(default isfalse
): iftrue
, pass the cache as acache
keyword argument to the function. -
whitelistable
(default isfalse
): iftrue
, pass the whitelist as awhitelist
keyword argument to the function.
Functions are specified by a list of at least two strings. The first is a fully qualified module path and rest are objects or objects within objects in that module that terminate in a callable.
For example, ["awkward0", "JaggedArray", "fromcounts"]
is the fromcounts
constructor of class JaggedArray
in module awkward0
.
The expression is a binary blob to read with a given name.
-
read
: string; the name of the binary blob to read. -
absolute
(defaultfalse
): iftrue
, the name is used as-is, iffalse
, the global prefix must be prepended before the name.
The expression is a list of [deserialization expressions].
The expression is a tuple of [deserialization expressions].
The expression is a dict of name (string), deserialization expression pairs.
The expression is a list of name (string), deserialization expression pairs. It is not a JSON object so that the order is maintained.
The expression is a Numpy dtype, expressed as JSON. Numpy distinguishes between Python tuples and lists, so the following patterns must all be converted to tuples before passing the result to the numpy.dtype
constructor:
-
[..., integer]
-
[..., [all integers]]
-
[string, ...]
The expression is a function object to pass as an argument, not to call. The function specification is the same as in the "call"
type, though.
The function whitelist is globally defined in awkward0.persist.whitelist
but it can also be passed into deserialization functions manually. The format is a list of function specifiers with glob-style wildcards. Function specifiers, as described in the "call"
type, are a fully qualified module name followed by a path of objects within objects leading to a callable, like
["awkward0", "JaggedArray", "fromcounts"]
for the fromcounts
constructor of the JaggedArray
class in the awkward0
module. Whitelist specifiers allow fnmatch wildcards, like
["awkward0", "*Array", "from*"]
to allow any array type’s non-primary constructor. A single string is promoted to a specifier and a single specifier is promoted to a list of specifiers, so "*"
by itself is a valid whitelist for allowing any function to run (for trusted data).
A function name that satisfies any wildcard expression is allowed.
awkward0.persist.topython
should not be in a default whitelist because unpickling untrusted data can call arbitrary Python functions.
An Awkward Array library’s default whitelist is not defined in this specification.
When serializing, users have the option to compress basic (Numpy) arrays. (When deserializing, whatever decompression functions are found are executed, if they are in the whitelist.) Compression has more value for some kinds of arrays than others, so the decision to compress or not compress is parameterized as in a policy.
The default policy is globally defined in awkward0.persist.compression
as a list of rules. Each rule has the following format:
-
minsize
: minimum size in bytes; if the basic array is smaller than this size, it is not compressed by this rule. -
types
: list of item types; if the basic array’s item type is not a subclass of one of these types, it is not compressed by this rule. -
contexts
: fnmatch wildcard string or list of such strings; if the basic array’s context (what parameter it belongs to in an Awkward Array) does not match any of these patterns, it is not compressed by this rule. -
pair
: 2-tuple of compression function, decompression function specifier. The compression function is a Python callable, which turns a buffer into compressed bytes. The decompression function is a tuple of strings naming the module and object where the function may be found (as in the"call"
type). Whereas the compression function is needed right away, the decompression function need only be specified so that it can be called during deserialization. The compression and decompression functions should be strict inverses of one another, with no parameters needed except the buffer to compress or decompress.
A single pair
is promoted to a rule and a single rule is promoted to a list of rules, so it would be sufficient to pass compression=(zlib.compress, ("zlib", "decompress"))
to a serialization function. If the compression pair is in the awkward0.persist.partner
dict, only the compression function is needed: compression=zlib.compress
.
serialize(obj, storage, name=None, delimiter="-", suffix=None, schemasuffix=None, compression=compression, **kwargs)
serializes obj
(an awkard-array) and puts all binary blobs into storage
using names derived from a name
prefix, separated by delimiter
. Binary blobs optionally may have a suffix
(such as ".raw"
) and the schema itself (also inserted into storage
as a binary blob) may have a schemasuffix
(such as ".json"
). The compression option is described above. No return value.
deserialize(storage, name="", whitelist=whitelist, cache=None)
returns an Awkward Array from storage
using name
as a prefix and an exact name for finding the schema. The whitelist option is described above. If a cache
is passed, that cache is passed as an argument to every VirtualArray
.
save(file, array, name=None, mode="a", **options)
Save an array
(an Awkward Array) into a file
specified as a name, a path, or as a file-like object. If it is a name that does not end in .awkd
, this suffix is appended. The mode
is passed to the zipfile.ZipFile
object; "a"
means append to an existing file or create a file. If there are name conflicts, an erorr is raised before writing anything to the file. No return value.
load(file, **options)
Open file
as a read-only dict-like object containing Awkward Arrays. The arrays may be found by asking for the dict-like object’s keys
and extracted with get-item.
hdf5(group, **options)
Interpret an HDF5 file or group (from the h5py library) as containing Awkward Arrays, rather than arrays. Low-level binary blobs are hidden in favor of logical Awkward Arrays. This object can be written to or read from as a dict-like object with get-item, set-item, and del-item.