-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Universal container format based on progressive specialization #23
Comments
I've made some major changes, especially to generalize the terminology:
and removed It turned out to be a significant challenge to describe this in a clear manner so I might come back to polish it a little more. Since I think it got to a reasonably stable form I would be interested in getting some feedback. Any questions? suggestions for improvement? clarifications? |
Is this effort part of cid/ipfs/ipld/mutliformats or something different/new? |
@rotemdan, It's been a couple year. Curious if you've continued down this path? |
This is an idea I suggested several years ago with the purpose of potentially unifying content identifiers and IPLD documents. Basically having one highly flexible data format that could describe anything, and that would be compact enough to be transmitted as a link (albeit possibly a long one). This means that resonably simple/small files would not require fetching an additional metadata (IPLD) file from the network. The link would contain all the hashing information and the extra metadata required to safely retrieve and verify the data (not just from IPFS, but also from http, bittorrent or potentially any other protocol). As long as you have the link. You'd still have a chance of safely acquiring the file from somewhere. This is in contrast to IPFS, where once the IPLD document becomes unavailable, the associated data cannot be retrieved or verified. In a sense, it presents a vision that's quite different from the way IPFS was initially designed. It's not bound to set of predetermined protocols, and is not "locked" to the IPFS ecosystem. Since I never got any comment about this idea from IPFS team members, I'm assuming it's this either doesn't fit their business model or it is too much of a departure from the founders' original design of the network, to the point they may feel that going in this particular direction would diminish their sense of "ownership" of their own product. |
[This is a work-in-progress draft design which has been heavily edited since it was first published]
This is an attempt at designing a highly flexible, yet compact, multipurpose container format that can function both as a content/entity identifier, a file header, as a part of a protocol message, or even to contain both metadata and data by itself.
Basically there's a very simple underlying concept here: that successive type enumerations can be used to progressively "namespace" into more and more specialized contexts describing more fine-grained information. Note these type enumerations don't have to be limited to built-in fields (like
entity domain
orschema version
) -- they can be dynamically inferred from fields whose semantics are progressively refined by the schema itself (somewhat like a state machine).(This is mostly an illustrative example of how such format could be designed, but I did put a lot of thought into it so I think it's a worthwhile read)
It starts with a message encoding identifier (1 character), which can be any one of
raw-binary
,base64
,base32
etc:Now that we're in binary, a version number for the container format (varint):
Now a varint for a entity domain identifier (e.g.
file
,ipfs
,ipns
,https
,bitcoin
,ethereum
etc.)And now a varint version number of the schema for the domain (each domain independently maintains its own schema versioning):
Now the base payload (AKA required fields), where its schema is specialized for the particular domain and version number, (note that total length is included to allow for a client to segment it even if it is unfamiliar with the particular combination):
And now field data (AKA optional fields), in a simplified protocol buffer like encoding (roughly described below):
That's all really. It's not bound to contain a hash of any sort, or to be associated with a particular category within a set of predefined codec types.
Example: say we want to encode
[raw-binary, container version 2, IPFS, schema version 1]
so the first required field would beresource type
, say it'sUnixFS File
, which in turn would refine the schema further to expect<dag hash type [varint]>
and<dag hash [binary string]>
as following fields.The base document would look something like:
(Total length: 1 char + 38 bytes)
Optional fields:
Each optional field is structured as:
Where the first bit of
data type and field identifier
represents the type and the rest the field identifier (specific for the particular schema), which can grow indefinitely since its a varint (fitting into a single byte would allow for 6 bits which can support up to 64 different field IDs).Data type can be:
(I'm not sure if there's a need for anything else, since booleans can be contained in bitfields and floats can be stored in binary strings)
So let's say for the example we wanted to add a
file size
,chunking algorithm
andmax chunk size
optional fields to the base CID:Totals (
file size
: 7 bytes,chunking algorithm
: 2 bytes,chunk size
: 4 bytes). Of course if the information cannot be represented here (say, chunking is variable): it may simply not be included at all.Now let's say the user wants to also add a signature for the hash, and that is not supported in the base schema, so they would need to use their own application specific field identifier in a reserved range (for this example say 4096+ is reserved [4096 is roughly midway within the range available for 2 byte identifiers]).
Even if the client doesn't understand this field, it can safely ignore and skip it since all the length information is available through the encoding itself.
Note that it's possible to standardize identifiers within the range 4096+ as application reserved globally for all domains. This would mean that application-specific fields could be added to a document even if its schema is not understood by the client.
The text was updated successfully, but these errors were encountered: