Skip to content

Latest commit

 

History

History
299 lines (211 loc) · 12 KB

tensor-data-model.org

File metadata and controls

299 lines (211 loc) · 12 KB

Tensor Data Model

Introduction

On data model that the Wire-Cell Toolkit (WCT) supports is the tensor data model. This model is factored into two layers:

  • The generic tensor data model is transiently represented as a concrete implementation of ITensorSet and ITensor and persisted according to this document.
  • The specific tensor data model defines conventions on the generic tensor data model in order to map certain other data types to the generic tensor data model.

Generic tensor data model

The generic tensor data model maps between the transient ITensorSet and ITensor model and a persistent one. The elements of the tensor data model are:

  • A tensor set is an aggregation of zero or more tensors and a optional metadata object.
  • A tensor is the combination of an array and a metadata object.
  • A metadata object associates attributes in a structure that follows the JSON data model.
  • An array is a contiguous block of memory holding numeric data that is stored with associated shape, layout order and array element type and size. An array may be empty or null.

Persistent tensor data models

Two persistent variants are supported:

serial
data is sent through WCT iostreams to/from Zip or Tar archive files.
hierarchical
data is read/written in random access manner to/from HDF5 files.

In both persistent forms a hiearchy is expressed. In the serial form, each ITensorSet or ITensor representation carries its location in the hierarchy by a path-like stream in the reserved metadata object attribute datapath. While, in HDF5 the datapath is inherently stored as an first class HDF5 attribute of an HDF5 dataset and is not stored in the HDF5 dataset metadata.

In the serial form, aggregates of tensors are collected by referencing their datapath values in a metadata object attribute. For example, a tensor set metadata object attribute tensors holds an array of datapath to its tensors (see below). Other aggregations are described in the specific tensor data model below. The hierarchy form may utilize HDF5 object or region references.

For I/O optimization, the serial variant will also associate tensor set with its tensors using a file naming and ordering convention as illustrated:

tensorset_0_metadata   # the tensor set ident=0 metadata object
tensor_0_0_metadata    # the first tensor metadata object (no array)
tensor_0_1_metadata    # the second tensor metadata object
tensor_0_1_array       # the second tensor array object
tensor_0_2_metadata    # etc....
tensor_0_2_array     
tensor_0_3_metadata  
tensor_0_3_array     
tensor_0_4_metadata  
tensor_0_4_array     

Specific tensor data model

The specific tensor data model maps meaning to one or more tensors in terms of other WCT data types by defining a number of conventions on top of the generic tensor data model.

Every tensor has an associated datatype attribute of value string. Only values listed may be used:

pcarray
a PointCloud::Array (pointclouds/<ident>/arrays/<name>)
pcdataset
a PointCloud::Dataset (pointclouds/<ident>)
pcgraph
a PointGraph (pointgraphs/<ident> with pointgraphs/<ident>/{nodes,edges})
trace
one ITrace as 1D array or multiple ITrace as 2D array. (frames/<ident>/traces/<number>)
tracedata
tagged trace indices and summary data. (frames/<ident>/tracedata)
frame
an IFrame as aggregate of traces and/or traceblocks. (frames/<ident>)
cluster
an ICluster (clusters/<ident>)
clnodeset
an array of attributes for set of monotypical ICluster graph nodes.
cledgeset
an array describing a set of ICluster graph edges between all nodes of one type to all nodes of another.

Where pertinent, the recommended datapath root path for the type is given in parenthesis.

The tensor set has a datatype of tensorset and is merely a generic container of tensors produced in some context (eg an “event”). The tensorset shall have an attribute tensors of value array of string that provides the datapath of all tensors in the set.

A tensor type may be an aggregate. This is a tensor that represents other tensors by collecting their datapath into an metadata attribute.

The remaining sections describe additional requirements specific to for each datatype.

pcarray

The datatype of pcarray indicates a tensor representing one PointCloud::Array. The tensor array information shall map directly to that of Array. A pcarray places no additional requirements on its tensor MD.

pcdataset

The datatype of pcdataset indicates a tensor representing on PointCloud::Dataset. The tensor array shall be empty. The tensor MD shall have the following attributes:

arrays
an object representing the named arrays. Each attribute name provides the array name and each attribute value provides a datapath to a tensor of type pcarray holding the named array.

Additional user application Dataset metadata may reside in the tensor MD.

pcgraph

The datatype of pcgraph indicates a tensor representing a “point cloud graph”. This extends a point cloud to include relationships between pairs of points. The array part of a pcgraph tensor shall be empty. The MD part of a pcgraph tensor shall provide reference to two pcdataset instances with the following MD attributes:

nodes
a datapath refering to a pcdataset representing graph vertex features.
edges
a datapath refering to a pcdataset representing graph edges and their features.

In addition, the pcdataset referred to by the edges attribute shall provide two arrays of integer type with names tails and heads. Each shall provide indices into the nodes point cloud representing the tail and head endpoint of graph edges. A node or edge dataset may be shared between different pcgraph instances.

trace

The datatype of trace indicates a tensor representing a single ITrace or a collection of ITrace which have been combined.

The tensor array shall represent the samples over a contiguous period of time from traces.

The tensor array shall have dimensionality of one when representing a single ITrace. A collection of ITrace shall be represented with a two-dimensional array with each row representing one or more traces from a common channel. In such a case, the full trace content associated with a given channel may be represented by one or more rows.

The array element type shall be either "i2" (int16_t) or "f4" (float) depending on if ADC or signals are represented, respectively.

The tensor MD may include the attribute tbin with integer value and providing the number of sample periods (ticks) between the frame reference time and the first sample (column) in the array.

tracedata

The datatype of tracedata provides per-trace information for a subset of. It is similar to a pcdataset and in fact may carry that value as the datatype but it requires the following differences.

It defines additional MD attributes:

tag
optional, a trace tag. If omitted or empty string, dataset must span total trace ordering.

The following array names are recognized:

chid
channel ident numbers for the traces.
index
provides indices into the total trace ordering.
summary
trace summary values.

A chid value is require for every trace. If the tracedata has no tag then a chid array spanning the total trace ordering must be provided and neither index nor summary is recognized. If the tracedata has a tag it must provide an index array and may provide a summary array and may provide a chid array each corresponding to the traces identified by index.

frame

The datatype of frame represents an IFrame.

The tensor array shall be empty.

The tensor MD aggregates tensors of datatype trace and tracedata and provides other values as listed;

ident
the frame ident number (required)
tags
an array of string giving frame tags
time
the reference time of the frame (required)
tick
the sample period of the traces (required)
masks
channel mask map (optional)
traces
a sequence of datapath references to tensors of datatype trace. The order of this sequence, along with the order of rows in any 2D trace tensors determines the total order of traces.
tracedata
a sequence of datapath references to tensors of datatype tracedata

In converting an IFrame to a frame tensor the sample values may be truncated to type "i2".

A frame tensor of type "i2" shall have its sample values inflated to type float when converted to an IFrame.

cluster

The datatype of cluster indicates a tensor representing one ICluster. The tensor array shall be empty. The tensor MD shall have the following attributes:

ident
the ICluster::ident() value.
nodes
an object with attributes of cluster array schema node type code and values of a datapath of a clnodeset. The node type code is in single-letter string form, not ASCII char value.
edges
an object with attributes of cluster array schema edge type code and values of a datapath of a cledgeset. The edge type code is in double-letter string form, not packed short integer.

The cluster tensor MD holds all references required to assemble the nodes and edges into an ICluster. The nodes and edges tensors hold no identifiers and require the cluster tensor to provide context.

clnodeset

The datatype of clnodeset indicates a tensor representing one type of node array in cluster array schema. The array is of type f8~~ and is 2D with each row representing one node and columns representing node attributes. The tensor MD may be empty.

cledgeset

The datatype of cledgeset indicates a tensor representing an edge array in cluster array schema. The array is of type i4 and is 2D with each row representing one edge. First column represents edge tail and second column edge head. Values are row indices into a clnodeset array. The tensor MD may be empty.

Tensor archive files

WCT provides the DFP graph node components TensorFileSink and TensorFileSource that persist ITensorSet through an archive file (Zip or Tar, with optional compression) using WCT iostreams. The archive file will contain files with names matching these patterns:

<prefix>tensorset_<ident>_metadata.json 
<prefix>tensor_<ident>_<index>_metadata.npy
<prefix>tensor_<ident>_<index>_array.json

The <prefix> is arbitrary, the <index> identifies a tensor set and <index> identifies a tensor in a set.

Hierarchical files

Currently, only the serial variant of the persistent data model is implemented. The general data model is intentionally similar to HDF5 and there is a conceptual mapping between the two:

  • HDF5 group hierarchy $↔$ ITensor metadata attribute providing a hierarchy path as array of string.
  • HDF5 group $↔$ No direct equivalent in that datapath patterns do not imply grouping but rather explicit metadata arrays do.
  • HDF5 references $↔$ Aggregation through array of datapath in metadata attribute.
  • HDF5 dataset $↔$ ITensor array.
  • HDF5 dataspace and datatype $↔$ ITensor methods shape(), dtype(), etc.
  • HDF5 group or dataset attribute $↔$ ITensor metadata attribute