ornladios · eisenhauer · Oct 18, 2024 · Oct 17, 2024
diff --git a/source/adios2/toolkit/format/bp5/BP5Base.h b/source/adios2/toolkit/format/bp5/BP5Base.h
@@ -19,6 +19,264 @@
 #pragma warning(disable : 4250)
 #endif
 
+/*
+ *  BP5 Metadata Marshalling is based upon FFS, which provides the
+ *  ability to serialize a C-style pointer-based data structure
+ *  (starting with a base struct) and to deserialize it in-place on
+ *  the receiving side.
+ *
+ *  Normally, in order to use FFS, an application must fully describe
+ *  the base structure using an FMFieldList, where each element
+ *  describes a field in the structure, including the field's name,
+ *  basic type (integer, float, etc.), size and offset from the start
+ *  of the structure.  In "normal" scenarios, like in SST this is
+ *  straightforward because we're describing a structure that exists
+ *  at compile-time and all of those things are compile-time static.
+ *  However, ADIOS metadata represents information about variables
+ *  that we don't know about until run-time, so if we're going to use
+ *  FFS here, things have to be a bit more dynamic.  In particular,
+ *  we'll represent ADIOS metadata with a "virtual" structure, one
+ *  whose description we'll construct on the fly and which will only
+ *  ever exist virtually, making up offsets as we go.  We just have to
+ *  be careful about keeping things aligned appropriately because we
+ *  want this to land on the receiver and be appropriately aligned
+ *  there.  (Normally the compiler takes care of this, but this
+ *  virtual structure is never seen by a compiler, so we're doing it.)
+ *  The field name that we specify to FFS is also important because we
+ *  use it to communicate a lot of information between writer and
+ *  reader.  While it always contains the variable name, it also
+ *  encodes the variable type (local or global, atomic or array,
+ *  compressed, derived, etc.).  Because the variable name only
+ *  appears in the metametadata (ffs format), this is a great place to
+ *  put more static information about the variable, specifically
+ *  anything that is fixed after definition and doesn't change on a
+ *  per-timestep basis.  More on names later.
+ *
+ *  To accomplish managing the structure on the writer side, we
+ *  principally track two things, the FMFieldList that represents the
+ *  description of the virtual struct, and a malloc'd region where we
+ *  build the virtual struct itself.  While the description is
+ *  interpreted by FFS, the most important thing for BP5 to remember
+ *  is this field's offset because that's where the (meta)data will
+ *  go.  When we Marshal a simple atomic value (local or global), we
+ *  calculate an appropriately aligned new offset in the buffer, add
+ *  to the FMFieldList (maintained in Info.MetaFields on the writer)
+ *  and copy the data into the virtual field at that offset in the
+ *  buffer.  On future timesteps, the field already exists, so we just
+ *  use the offset and copy the data into the buffer.  Arrays are a
+ *  bit more complex, but lets start with the simple case.  FFS
+ *  supports substructures, I.E. fields which themselves are a
+ *  structure and we use that feature for all array representations.
+ *  There are several things that may change on a per-timestep basis
+ *  for arrays, including Shape, Count and Offset values (which are
+ *  themselves arrays), and we also need to track the location of the
+ *  related data block (offset in this rank's data segment).  Except
+ *  for Shape (which we assume is set for at least this timestep), all
+ *  of these things are per-block.
+ *
+ *  Back to FFS capabilities for a moment.  FFS's pointer-based
+ *  structures include dynamically-sized arrays, and the size of those
+ *  arrays must be specified by an integer-typed field in that
+ *  structure.  There are three different array lengths required here.
+ *  Shape is of length Dims (how many dimensions the array has),
+ *  DataBlockLocation is of length BlockCount (how many blocks were
+ *  written on this rank), and for Count and Offsets we must have
+ *  those per-block, so the length is Dims*BlockCount.  To satisfy
+ *  FFS's constraints, that means we must have integer fields
+ *  representing all three lengths in the array metadata struct, and
+ *  we need pointers to the dynamic arrays representing Shape, Count,
+ *  Offsets, and DataBlockLocation.  These are the BASE_FIELDS below
+ *  and the FFS FMField entries are BASE_FIELD_ENTRIES in BP5Base.cpp.
+ *  While more complex arrays metadata entries are necessary, these
+ *  must be the first fields in those structures.  While there can't
+ *  be a static struct declaration for all of the metadata, there is a
+ *  static declaration for the array metadata substructure,
+ *  MetaArrayRec below.  Mostly you'll see this used like this:
+ *
+ *  MetaArrayRec *MetaEntry = (MetaArrayRec *)((char *)(MetadataBuf) +  Rec->MetaOffset);
+ *
+ *  This gives us a nice way of accessing the key fields in an array's
+ *  metadata entry.
+ *
+ *  So, what about more complex arrays?  All of our compression
+ *  operators require the length of the encrypted field as input to
+ *  the uncompress operator.  Generally we don't include data block
+ *  length as part of metadata because it's easily calculated from the
+ *  Count values and the length of the data type, but in order to
+ *  support compression we have to communicate it from the writer to
+ *  the reader so we can uncompress.  Therefore every field with an
+ *  operator has as its next field (after BASE_FIELDS) DataBlockSize.
+ *  Like DataBlockLocation, this is per block (and so it's FFS
+ *  description also uses BlockCount).  This arrangement is
+ *  represented by the struct MetaArrayRecOperator below.  Note that
+ *  BP5 does not itself use the DataBlockSize in the metadata.  The
+ *  size of the compressed data is returned from the compression
+ *  operator, and is used by BP5 to copy that data into the data
+ *  block, but after that it is only passed to the Uncompress operator
+ *  on the receiving side, so operators like MGard may choose to use
+ *  this differently.
+ *
+ *  The last case is arrays that also have Min/Max stats associated
+ *  with them.  Since this can be combined with operators, that gives
+ *  us two more possible structs for array metadata, a plain array
+ *  with Min/Max or an array with an operator and Min/Max, these are
+ *  represented by the structs MetaArrayRecMM and
+ *  MetaArrayRecOperatorMM below.  Note that MinMax in that struct is
+ *  a char*, but obviously the data type of Min/Max depends upon the
+ *  element type of the array.  How does that work?  The actual size
+ *  in bytes of the MinMax array is BlockCount * sizeof(array element)
+ *  * 2, but in order to avoid introducing yet another integer-typed
+ *  size value into the structure we've gone to some effort in order
+ *  to leverage the existing BlockCount value.  In particular, there
+ *  are a number of FMField lists for The MM and OperatorMM arrays,
+ *  each giving FFS a different element size for the MinMax Array.
+ *  ADIOS types of size 1 use MetarrayRecMM1List, those of size 2 use
+ *  MetaArrayRecMM2List, etc., up to MetaArrayRecMM16List, which would
+ *  be used by long double.  Note that BP5 doesn't define or support
+ *  MinMax for string, complex, or structure types.
+ *
+ *  For each of the array variations above, when we add the field
+ *  associated with that array to the metadata field list, we specify
+ *  the appropriate FieldList in the FFS "field_type" value, and
+ *  allocate space for the relevant structure in the virtual metadata
+ *  struct we're building.
+ *
+ *  We mentioned field names above, we actually encode a lot of
+ *  information into the FFS field names, including the variable name,
+ *  shape, element_size, ADIOS type, any operator that might be
+ *  applied, the name of the substructure (if the array is a struct
+ *  type), and even the expression that is to be used for derived
+ *  variables.  These are all encoded in different ways, for example
+ *  the basic shape of the variable is encoded in the three letter
+ *  prefix of the FFS fieldname: GlobalValue: = "BPg", GlobalArray =
+ *  "BPG"JoinedArray = "BPJ", LocalValue = "BPl", LocalArray = "BPL".
+ *  The details of the encoding are buried in the logic, but important
+ *  bit is knowing that there's a lot of information there and some of
+ *  it (like the expression) is base64 encoded to avoid having special
+ *  characters in the FFS field name.  From the BP5 point of view,
+ *  anything that can be encoded in the field name is a good thing
+ *  because it travels in the metametadata, not the metadata, so it
+ *  only gets moved around if the field set changes.
+ *
+ *  Speaking of changes, there are some details that are omitted above
+ *  to get the main points across, but lets talk about other details.
+ *  First, when you put a first block of an array, we fill out the
+ *  Dims field, init BlockCount to 1, DBCount (the Dims*BlockCount
+ *  value) to Dims and then we malloc memory to hold a copy of the
+ *  Shape, Count and Offset values.  (We need to copy these anyway as
+ *  part of serialization as they must be captured at the time of Put,
+ *  so we can't, say, just reference the values in the VariableBase
+ *  class.)  For LocalArrays, the Shape value stays at a NULL pointer,
+ *  as does the Start value.  If after the first there's another Put()
+ *  on that variable, we add 1 to BlockCount, increment DBCount by
+ *  Dims, and realloc() the Count and Offset arrays so that we can add
+ *  the new Count and Offset values after the ones that are already
+ *  there.  This means that the Count values for block 1 start at
+ *  Count[Dims], for block 2 they start at Count[2*Dims], etc.  At the
+ *  end of the timestep after using FFSencode() to serialize the
+ *  metadata, FMfree_var_rec_elements() is used to free() all these
+ *  subarrays that we've malloc'd.  It understands the structure of
+ *  our entire Metadata structure, walks the field list and
+ *  deallocates appropriately.  Once this has been done, we can
+ *  memset() the whole metadata structure back to zeros and we're
+ *  ready to start again.  (All pointers NULL and counts are zero.)
+ *
+ *  When we do start again with the next timestep, we don't start from
+ *  scratch with a new Fieldlist and virtual structure, but instead
+ *  try to reuse the old one.  The anticipation is that step-based HPC
+ *  applications are highly regular and the set of variables that are
+ *  output on step N+1 are likely the same as what they output for
+ *  step N.  So when we get a Put() for a variable, we look up it's
+ *  entry in internal bookkeeping and if it has an entry in the
+ *  structure we reuse it, putting the appropriate data in the virtual
+ *  structure as described above.  This is fine if we write the exact
+ *  same set of variables in subsequent steps, but what if we don't?
+ *  Well, if we write a new variable, then the procedure above
+ *  happens, but we also take steps to make sure that we generate new
+ *  MetaMetaData (I.E. re-register the format with FFS).  We do this
+ *  by setting the Info.MetaFormat value to NULL.
+ *
+ *  Handling a non-written variable is done differently. We don't
+ *  really want to bear the cost of new MetaMetaData frequently
+ *  (because MetaMetaData can be big), so instead we're willing to
+ *  bear the costs of not using some of the data in the virtual
+ *  structure.  So if the app Puts an atomic variable on timestep N,
+ *  but skips it on N+1, we essentially leave that fraction of the
+ *  metadata buffer unused in N+1.  It's transmitted or stored, but it
+ *  doesn't contain anything useful.  But the reader still needs to
+ *  know that it wasn't written, so BP5 metadata carries with it a
+ *  bitmap showing if a variable that is part of the metadata has
+ *  actually been written and is valid.  This bitmap, contained in the
+ *  BitField[BitFieldCount] fields in the MetadataFieldList is the
+ *  ultimate authority as to what has been written.  Variables are
+ *  assigned an index in order when they are first entered into
+ *  metadata and if the bit at that index isn't set, that variable
+ *  wasn't written on that timestep.
+ *
+ *  Now, this does bring up a vulnerability with BP5.  If an
+ *  application were to write a lot of variables on one step and then
+ *  never use them again, we might end up with a big metadata block
+ *  that mostly carried unused (junk) bytes.  We have not yet run into
+ *  this in a real application, so it isn't specifically handled.  In
+ *  an ideal world, one would look at the "occcupancy rate" of
+ *  metadata in EndStep() and make a decision that for either this
+ *  timestep or the next, we'd start from scratch with an empty field
+ *  list.  There's a tradeoff here.  Do this too often and we've got
+ *  big MetaMetadata costs, do it too little and our metadata has a
+ *  lot of useless bytes.  Future work.  Note that this is mostly a
+ *  writer-side thing to fix/optimize.  The reader will appropriately
+ *  handle new metadata, including new metametadata.
+ *
+ *  The stuff above applies to ADIOS variables, but attributes are
+ *  always handled separately.  In the initial FFS-marshalling
+ *  implementation, Attributes, while separate, were handled very
+ *  similarly to variables.  That is, there was a field list and
+ *  virtual structure maintained where we entered attributes much like
+ *  Global and local values are described above.  There was a
+ *  metametadata generated it it and it was moved around like other
+ *  metametadata blocks.  This old way of doing things is still
+ *  present in the code and gets used if MarshalAttribute is called by
+ *  the engine.  Engines that use this marshall all attributes in
+ *  Endstep(), calling MarshalAttribute for all attributes and only
+ *  doing this when some attribute has changed.  The resulting
+ *  Attribute data always contains *all* the current attribute values,
+ *  a situation that works out well for engines like SST where readers
+ *  might join after timestep 0.  The SST writer can save the most
+ *  recent Attribute data block and provide it to a newly-joined
+ *  reader so that it has all available attributes.
+ *
+ *  However, this encoding mechanism has some significant
+ *  disadvantages under almost all situations.  This separation of
+ *  metametadata and metadata was designed for Variables, where the
+ *  set of variables was likely to be reused without changes
+ *  repeatedly.  However, attributes aren't like that, particularly in
+ *  the original situation where attributes once set can never change.
+ *  Then we're only doing this when we add an attribute, we're always
+ *  generating new MetaMetadata whenever we have a change, and
+ *  MetaMetadata + Metadata size is always going to be bigger than
+ *  some simpler encoding mechanism.  So, BP5 file engine now does
+ *  things differently.  It calls OnetimeMarshalAttribute() which uses
+ *  a simpler FFS representation for attributes with the attribute
+ *  "name" being part of the data, not part of the metametadata as it
+ *  is with variables.  This means that the metametadata never
+ *  changes, so we don't have the same issues as with the prior
+ *  approach.  That metametadata struct (BP5AttrStruct) describes a
+ *  relatively simple structure with two lists, one for attributes of
+ *  any non-string type, and the other a list of string and
+ *  array-of-string attributes.  Generally we only want attributes to
+ *  appear here when they change, so the BP5Writer calls
+ *  OnetimeMarshlAttribute whenever it gets the NotifyEngineAttribute
+ *  call (whenever an attribute changes).  However it also gets called
+ *  in BeginStep if that step is the first every called, because some
+ *  attributes may have been defined before the engine was ever
+ *  created.  In BP5 file, attribute blocks then only every contain an
+ *  attribute once, unless the attribute changes in which case it will
+ *  appear again.  This is not such a good situation for SST because
+ *  of the late-coming-reader issue, so that still uses the old
+ *  marshaling mechanism.
+ *
+ */
+
 namespace adios2
 {
 namespace format