Preface: purpose of this document:
This document is intended for developers of new (or renovating) IPLD libraries. It contains design suggestions based on the experience of building (and rebuilding) IPLD libraries in various languages, and reflecting on the lessons learned. It also contains both notes on practical limitations we've found for implementations, and reflections on how to express things clearly within the type systems of the host language a library is implemented in (whatever that may be).
Preface: limitations of this document:
Since this document is aimed at new libraries, it's also implicitly expecting that the new library might be in a new language. We can't presume to know precisely what that language will enable or encourage! Therefore, there will be limits to how transferable the advice in this document may be. We do expect that the best way to express IPLD concepts may vary based on the language a library is created in. We accept this and try to write this document anyway, and make it as useful as it can be.
These guidelines are written with particular attention to the limitations that are typical to strongly typed languages. (Some of the phrasing used reflect this -- we refer to "types", "enumerations", "interfaces", "packages", etc. However, these concepts can still translate even to languages with varying amounts of compile-time type checking, and indeed even to those with none. While the concepts are certainly not identical across all languages, we hope that they're close enough to be meaningful to a thoughtful reader.)
We expect that common concepts for IPLD libraries will emerge across many languages, and hope that some vocabulary for these concepts is something we can share. Loosely and untyped languages may need to interpret these guidelines appropriately while extracting the key concepts; but even among languages with stricter concepts of compile-time type checking, the meaning of "interface" can vary greatly -- all readers will need to be ready to use their best judgement.
Your IPLD library should have two cornerstone types:
Node
;Kind
.
Node
should be an interface -- the membership should be open
(aka, it should be possible for other packages to implement it).
Kind
should be an enumeration -- a fixed set of named members,
which should not be extendable.
Kind
maps very directly onto the definition of
Data Model Kinds.
Kind
does not include the Schema layer's concept of "struct", etc.
Kind
must be an enum, and not a sum type. Attempting to implement
kind as a sum type conflates it with Node
.
(This may be tempting to try to combine Kind
and Node
into a single
sum type definition if you're only looking at the Data Model layer,
but it is a mistake: both Schema types and Advanced Layouts require
the ability to add more implementations of Node
, so this conflation
will cause cataclysmic problems and force a painful refactor
as soon as you get to implementing those systems.
See the different implementors of Node
section, later in this document, for more information on this.)
Node
is a monomorphized interface for handling data -- in other words,
we make all data look and act like a Node
, so that we can write all of our
functions against the Node
interface, and have that work for any sort of data.
Node
has functions for examining any of the
Data Model Kinds.
For example, this means Node
must be able to
do a key lookup for a map kind,
provide an iterator for a list kind,
or be convertible to a primitive if it's a integer kind.
Node
is generally implemented by making an interface with the superset of all
these methods needed for the various kinds of data.
Some programming languages may also have a pattern-matching faculty which
may make this nicer; feel free to use it (but mind the caveats issued in the
Kind section above, and the
different implementors of Node section below:
the membership of Node
must remain open;
you do not want to use a sum type with a closed list of concrete members here,
or it will cause other roadblocks later that will force a redesign).
For languages where this is most straightforwardly implemented by a single
interface containing the superset of all necessary methods, many of the methods
will error if the Node
refers to information of the wrong kind for that method;
this is fine.
Node
should be clear about what sets of methods are valid for acting on it.
Typically, this is done by a Node.Kind()
method, which should return
a member of the Kind enum.
This information is useful for anyone writing functions which use the Node
interface, because it's much more pleasant (and fast) to check the Kind and
know which methods can be expected to work than it is to have to probe every
method individually for failure.
(Again, programming languages with pattern-matching faculties may find
a cleverer way for their compiler and type system to support this.)
Though the methods on the Node
interface are defined as those necessary for
examining data of the Data Model Kinds,
Node
is not only implemented by the Data Model:
- Yes,
Node
is implemented by types that just hold basic Data Model info; Node
is also implemented by Advanced Data Layouts --- consider a HAMT that spans many separately-seralized chunks of data; it should still be usable as if it's a regular map.
Node
is also implemented by Schema-typed Nodes --- Both if implemented by a single implementation that evalates rules at runtime (so, finite count of implementing types and known at core library compile time)...
- or if handled by codegen/macros (unknown count / open set of implementors of
Node
; not known at core library compile time; may be created in other packages that import the core, rather than core importing them!).
Even further, some libraries may choose to make even more various
implementations of Node
for optimizing performance of specific tasks:
for example, a Node
which implements basic Data Model "map" semantics,
but using some internal algorithm for memory layout which is known to be
efficient for certain workloads;
or for another example, a Node
which is particularly efficient for handling
data of one particular serialization codec, and keeps a lazy-loading skip-tree
over the serialized bytes.
Clearly, neither of these should be the default implementation a library uses,
but clearly, both of them should be able to be used transparently,
With all seven (?! indeed, seven) of these different stories,
we can consider it conclusive that the Node
interface should be ready
to support many, many diverse implementors.
As an IPLD library author, you may be tempted to make a single, "default"
implementation of Node
.
Feel free to do so; but be cautious of giving it special privileges. Try implementing it in a separate package from your core interfaces: this will be a good exercise to make sure other implementations can later do the same. (Since in the order of things you'll do when implementing a new IPLD library, creating this basic default node implementation is likely quite early, going about it in such a way that it forces design choices you'll need later anyway will save you from potentially discovering the need for some costly refactors later!)
If you choose to pursue a distinction between mutable and immutable data in the design of your library, it may be useful to create two separate interfaces for each phase of the data's lifecycle. These might be called "Node" (for the immutable data) and "NodeBuilder" (for the mutating/building phase of the data's life).
It is not necessary to have distinct interfaces for this; a library can also opt to have a mutable concept of "node". Immutable interfaces can be particularly well-suited to IPLD data, though; it's worth considering them.
Almost all features should be implemented to take Node
arguments,
and return Node
values.
Traversals and walks can be implemented in this way: e.g.
function walk(start Node, vistorFn func(visted Node))
.
Selectors can be implemented in this way. (Continue with the idea above for traversals.)
Transformations can be implemented in this way. (Continue with the idea above for traversals.)
Codecs themselves can be implemented this way:
marshalling is a traverse over nodes, so func marshal(obj Node) -> bytes
,
and unmarshalling is something like func unmarshal(bytes, NodeBuilder) -> Node
.
(Note that if your library has a Node
/NodeBuilder
split for immutability purposes,
then of course any operation that builds new nodes,
such as transformations or codecs during unmarshalling,
will have a NodeBuilder
parameter.
If your library has a mutable Node
, these function signatures might appear differently.)
By defining all these functions in terms of Node
, they can be used the same
in any of the various contexts described in the
different implementors of Node section:
- traversals/selectors/transforms/etc work over various codecs (trivially, by transitive property).
- traversals/selectors/transforms/etc work regardless of in-memory layouts
that may vary per
Node
implementation - traversals/selectors/transforms/etc work transparently over ADLs!
- traversals/selectors/transforms/etc work transparently over schemas!
It is also useful to note that by implementing these features over the Node
interface, rather than in the Node
interface, it becomes much more
possible to implement various kinds of e.g. traversal library
(perhaps you'll discover two different ways to go about it,
one with better ergonomics, and one with better performance?);
and it also requires much less code per Node
implementation if things
like traversals are implemented from the outside.