On data model that the Wire-Cell Toolkit (WCT) supports is the tensor data model. This model is factored into two layers:
- The generic tensor data model is transiently represented as a
concrete implementation of
ITensorSet
andITensor
and persisted according to this document. - The specific tensor data model defines conventions on the generic tensor data model in order to map certain other data types to the generic tensor data model.
The generic tensor data model maps between the transient ITensorSet
and ITensor
model and a persistent one. The elements of the tensor
data model are:
- A tensor set is an aggregation of zero or more tensors and a optional metadata object.
- A tensor is the combination of an array and a metadata object.
- A metadata object associates attributes in a structure that follows the JSON data model.
- An array is a contiguous block of memory holding numeric data that is stored with associated shape, layout order and array element type and size. An array may be empty or null.
Two persistent variants are supported:
- serial
- data is sent through WCT iostreams to/from Zip or Tar archive files.
- hierarchical
- data is read/written in random access manner to/from HDF5 files.
In both persistent forms a hiearchy is expressed. In the serial form,
each ITensorSet
or ITensor
representation carries its location in the
hierarchy by a path-like stream in the reserved metadata object
attribute datapath. While, in HDF5 the datapath is inherently stored
as an first class HDF5 attribute of an HDF5 dataset and is not stored
in the HDF5 dataset metadata.
In the serial form, aggregates of tensors are collected by referencing their datapath values in a metadata object attribute. For example, a tensor set metadata object attribute tensors holds an array of datapath to its tensors (see below). Other aggregations are described in the specific tensor data model below. The hierarchy form may utilize HDF5 object or region references.
For I/O optimization, the serial variant will also associate tensor set with its tensors using a file naming and ordering convention as illustrated:
tensorset_0_metadata # the tensor set ident=0 metadata object tensor_0_0_metadata # the first tensor metadata object (no array) tensor_0_1_metadata # the second tensor metadata object tensor_0_1_array # the second tensor array object tensor_0_2_metadata # etc.... tensor_0_2_array tensor_0_3_metadata tensor_0_3_array tensor_0_4_metadata tensor_0_4_array
The specific tensor data model maps meaning to one or more tensors in terms of other WCT data types by defining a number of conventions on top of the generic tensor data model.
Every tensor has an associated datatype attribute of value string. Only values listed may be used:
- pcarray
- a
PointCloud::Array
(pointclouds/<ident>/arrays/<name>
) - pcdataset
- a
PointCloud::Dataset
(pointclouds/<ident>
) - pcgraph
- a
PointGraph
(pointgraphs/<ident>
withpointgraphs/<ident>/{nodes,edges}
) - trace
- one
ITrace
as 1D array or multipleITrace
as 2D array. (frames/<ident>/traces/<number>
) - tracedata
- tagged trace indices and summary data. (
frames/<ident>/tracedata
) - frame
- an
IFrame
as aggregate of traces and/or traceblocks. (frames/<ident>
) - cluster
- an
ICluster
(clusters/<ident>
) - clnodeset
- an array of attributes for set of monotypical
ICluster
graph nodes. - cledgeset
- an array describing a set of
ICluster
graph edges between all nodes of one type to all nodes of another.
Where pertinent, the recommended datapath root path for the type is given in parenthesis.
The tensor set has a datatype of tensorset and is merely a generic container of tensors produced in some context (eg an “event”). The tensorset shall have an attribute tensors of value array of string that provides the datapath of all tensors in the set.
A tensor type may be an aggregate. This is a tensor that represents other tensors by collecting their datapath into an metadata attribute.
The remaining sections describe additional requirements specific to for each datatype.
The datatype of pcarray indicates a tensor representing one
PointCloud::Array
.
The tensor array information shall map directly to that of Array
.
A pcarray places no additional requirements on its tensor MD.
The datatype of pcdataset indicates a tensor representing on
PointCloud::Dataset
.
The tensor array shall be empty.
The tensor MD shall have the following attributes:
- arrays
- an object representing the named arrays. Each attribute name provides the array name and each attribute value provides a datapath to a tensor of type pcarray holding the named array.
Additional user application Dataset
metadata may reside in the tensor
MD.
The datatype of pcgraph indicates a tensor representing a “point cloud graph”. This extends a point cloud to include relationships between pairs of points. The array part of a pcgraph tensor shall be empty. The MD part of a pcgraph tensor shall provide reference to two pcdataset instances with the following MD attributes:
- nodes
- a datapath refering to a pcdataset representing graph vertex features.
- edges
- a datapath refering to a pcdataset representing graph edges and their features.
In addition, the pcdataset referred to by the edges attribute shall provide two arrays of integer type with names tails and heads. Each shall provide indices into the nodes point cloud representing the tail and head endpoint of graph edges. A node or edge dataset may be shared between different pcgraph instances.
The datatype of trace indicates a tensor representing a single ITrace
or a collection of ITrace
which have been combined.
The tensor array shall represent the samples over a contiguous period of time from traces.
The tensor array shall have dimensionality of one when representing a
single ITrace
. A collection of ITrace
shall be represented with a
two-dimensional array with each row representing one or more traces
from a common channel. In such a case, the full trace content
associated with a given channel may be represented by one or more
rows.
The array element type shall be either "i2"
(int16_t
) or "f4"
(float
)
depending on if ADC or signals are represented, respectively.
The tensor MD may include the attribute tbin with integer value and providing the number of sample periods (ticks) between the frame reference time and the first sample (column) in the array.
The datatype of tracedata provides per-trace information for a subset of. It is similar to a pcdataset and in fact may carry that value as the datatype but it requires the following differences.
It defines additional MD attributes:
- tag
- optional, a trace tag. If omitted or empty string, dataset must span total trace ordering.
The following array names are recognized:
- chid
- channel ident numbers for the traces.
- index
- provides indices into the total trace ordering.
- summary
- trace summary values.
A chid value is require for every trace. If the tracedata has no tag then a chid array spanning the total trace ordering must be provided and neither index nor summary is recognized. If the tracedata has a tag it must provide an index array and may provide a summary array and may provide a chid array each corresponding to the traces identified by index.
The datatype of frame represents an IFrame
.
The tensor array shall be empty.
The tensor MD aggregates tensors of datatype trace and tracedata and provides other values as listed;
- ident
- the frame ident number (required)
- tags
- an array of string giving frame tags
- time
- the reference time of the frame (required)
- tick
- the sample period of the traces (required)
- masks
- channel mask map (optional)
- traces
- a sequence of datapath references to tensors of datatype trace. The order of this sequence, along with the order of rows in any 2D trace tensors determines the total order of traces.
- tracedata
- a sequence of datapath references to tensors of datatype tracedata
In converting an IFrame
to a frame tensor the sample values may be
truncated to type "i2"
.
A frame tensor of type "i2"
shall have its sample values inflated to
type float
when converted to an IFrame
.
The datatype of cluster indicates a tensor representing one ICluster
.
The tensor array shall be empty.
The tensor MD shall have the following attributes:
- ident
- the
ICluster::ident()
value. - nodes
- an object with attributes of cluster array schema node type code and values of a datapath of a clnodeset. The node type code is in single-letter string form, not ASCII char value.
- edges
- an object with attributes of cluster array schema edge type code and values of a datapath of a cledgeset. The edge type code is in double-letter string form, not packed short integer.
The cluster tensor MD holds all references required to assemble the nodes and edges into an ICluster
. The nodes and edges tensors hold no identifiers and require the cluster tensor to provide context.
The datatype of clnodeset indicates a tensor representing one type of node array in cluster array schema. The array is of type f8~~ and is 2D with each row representing one node and columns representing node attributes. The tensor MD may be empty.
The datatype of cledgeset indicates a tensor representing an edge array in cluster array schema.
The array is of type i4
and is 2D with each row representing one edge. First column represents edge tail and second column edge head. Values are row indices into a clnodeset array.
The tensor MD may be empty.
WCT provides the DFP graph node components TensorFileSink
and
TensorFileSource
that persist ITensorSet
through an archive file (Zip
or Tar, with optional compression) using WCT iostreams. The archive
file will contain files with names matching these patterns:
<prefix>tensorset_<ident>_metadata.json <prefix>tensor_<ident>_<index>_metadata.npy <prefix>tensor_<ident>_<index>_array.json
The <prefix>
is arbitrary, the <index>
identifies a tensor set and
<index>
identifies a tensor in a set.
Currently, only the serial variant of the persistent data model is implemented. The general data model is intentionally similar to HDF5 and there is a conceptual mapping between the two:
- HDF5 group hierarchy
$↔$ ITensor
metadata attribute providing a hierarchy path as array of string. - HDF5 group
$↔$ No direct equivalent in that datapath patterns do not imply grouping but rather explicit metadata arrays do. - HDF5 references
$↔$ Aggregation through array of datapath in metadata attribute. - HDF5 dataset
$↔$ ITensor
array. - HDF5 dataspace and datatype
$↔$ ITensor
methodsshape()
,dtype()
, etc. - HDF5 group or dataset attribute
$↔$ ITensor
metadata attribute