-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CapTP AST / data representation and serialization #3
Comments
FWIW, here is the full encoding rules (well, in example form) for Syrup's on-the-wire-representation, taken from a comment in the Racket implementation:
|
Integers could be simplified to remove a character, btw. @tonyg and I have discussed that: -;; (Signed) integers: i<maybe-sign><int>e
+;; (Signed) integers: <maybe-sign><int>i Hasn't been implemented though. That's one small change worth considering however. |
Already decodes correctly with syrup.js though have yet to add an marshaller that encodes (big)ints to it. |
Note that on IRC we discussed yet another approach to this, which people liked: ocapn/syrup#2 |
Interestingly the two issues were posted by the same person on the same day, so there was some intention behind keeping them separate. The title suggests #5 was nominally requirements driven (related to abstract syntax e.g. Preserves), or "what do we need and not need to transmit", while this one is nominally implementation driven ('bindings' of abstract syntax to concrete syntax e.g. syrup or "how do we render it"). This seems like a natural division to me, if it can be observed, but I haven't checked to see how well the distinction is reflected in the actual issue comments. TBD me: check the #5 comments to see whether keeping concrete syntax discussion out of #5 is feasible. Comments @cwebber or @tsyesika ? By the way I don't like 'AST' - too suggestive of a data structure. 'Abstract syntax' is fine. |
We seem to be close on #5, and I think we're close enough that we can probably unblock talking about concrete serialization. Agoric folks have specified an embedding of their data model into json called "smallcaps," and it seems like we can relatively easily extend that to include the full data model. I like it, at least broadly and they are apparently stuck supporting it, so I'd suggest having this be our "textual" representation, rather than defining Another Thing. However, it probably isn't good enough for efficiency reasons, particularly when we look at adding ByteStrings. So we should also specify some binary encoding of the data model with better efficiency. The bridge will likely include a version of the data model encoded in capnp, but that is a lot of machinery to include, so instead we should probably define the protocol to use something else and just let that be a bridge thing. The data model has diverged somewhat from preserves, so using Syrup as-is is no longer an option. My thinking is to build on top of cbor in the same way smallcaps builds on top of json. The situation is similar: the data model is close, but not exact -- we'd need to concoct ways of encoding e.g. Capabilities and Errors into CBOR, and we'd want to impose some restrictions when decoding. E.g. we could specify that cbor's map type is decoded as a struct, where non-string keys cause decoding to fail. Thoughts on that general approach? |
I don't have any experience with CBOR; might be fine. Agoric does a lot of Cosmos-SDK stuff, and Cosmos-SDK uses protobuf. So some protobuf tooling is sunk cost, for us. My work in #19 was sort of at the wrong level, but I'm inclined to try a protobuf rendition of the data model and check out the costs and benefits. it's nice to get stubs roughly for free:
Now that I think about it, the existing JS protobuf tooling might not be a good fit here... we might want to use some of their low-level encoding APIs but do the high level tree-walking by writing something driven by the The capnproto schema language is roughly as expressive, so that's worth a try too. |
If folks are open to protobufs maybe I have misjudged the appetite for capnp. It would be nice to use capnp for serialization insofar as in a bridged environment it's one less thing in the tech stack. Here's a stab at modeling the current state of #5 using capnp schema: # ocapn.capnp
@0xcd301da1d95b8242;
struct Value {
union {
## Atoms ##
undefined @0 :Void;
null @1 :Void;
bool @2 :Bool;
float64 @3 :Float64;
unsignedInt @4 :Data;
# Non-negative integer, in big-endian format
negativeInt @5 :Data;
# Negative integer. value is `-1 - n`, where `n`
# is the data interpreted as an unsigned big-endian integer.
string @6 :Text;
byteString @7 :Data;
symbol @8 :Text; # Might be removed, pending #46.
## Containers ##
list @9 :List(Value);
struct @10 :List(StructField);
# Duplicate keys are not allowed; will need to enforce this at a higher level
# of abstraction.
tagged :group {
label @11 :Text;
value @12 :Value;
}
capability @13 :Cap;
error @14 :Error;
}
}
struct StructField {
key @0 :Text;
value @1 :Value;
}
interface Cap {
# TODO
}
struct Error {
# TODO, pending #10.
} |
I am in faviour of CBOR (or msgpck) but I vehemely am against both protobuf and capnproto for the following reason: Neither are self descriptive on the wire and schemas always get lost and/or loose something in translation to stubs.* Plus both protobuf and capnproto assume a buildstep or build environ where their schema language interpreter can run. ( @kentonv have you yet described capnprotos field packing algorithm anywhere other than in the c++ implementation yet? ) The ocapn protocol described in an RFC like document should be sufficient for @dckc stubs for Remotables? that say like for an Agoric Issuer Purse? (* having to reverse engineer MIPS32 firmware to figgure out a binary protocol whose schema was lost to the sands of time was not fun.) |
Self-description for the purpose of reverse engineering is not really an on-or-off thing, it is a spectrum. Protocols based on JSON, CBOR, or msgpack often (but not always) contain textual field names which might offer clues to a human trying to reverse engineer them; protobuf and Cap'n Proto do not. However, Protobuf and Cap'n Proto both still do allow you to determine the "shape" of the message tree without knowing the schema; this is still much more information than a completely arbitrary binary encoding provides. Conversely, there are many JSON protocols which manage to be inscrutable. (E.g. some people intentionally encode objects as tuples to avoid wasting bytes on field names. Some others just choose really terrible field names.) You could require that people using CBOR make sure to use intelligible field names. But similarly you could of course specify a protocol which uses Protobuf or Cap'n Proto, but requires every message to contain a copy of the schema. You could even go further and require each message to contain human-readable documentation explaining how to use it. Of course, at some point the cost of sending this schema and documentation in every message outweighs the benefits it brings in terms of reverse engineerability. So this is really an argument about trade-offs: what amount of wasted bytes in every single message is "worth it" to make it easier in the case that someone needs to reverse-engineer the protocol? It sounds like you are arguing that field names are worth it. I might agree in some use cases but certainly not in all cases. What if, instead, the RPC layer of the protocol defined a standard way to query a peer for their schemas, which implementations were expected to support by default? Then there's no waste in the common case but you can still get the info you need for reverse engineering. Both Protobuf and Cap'n Proto could easily support such a requirement, all the pieces are already in place to make schemas available automatically.
No, but nothing is stopping someone from reading the code and translating it to prose if desired. It's really not that complicated. (But I personally have no need or desire to standardize Cap'n Proto, so I haven't done it.) |
Quoting Kenton Varda (2023-05-19 00:26:14)
What if, instead, the RPC layer of the protocol defined a standard way
to query a peer for their schemas, which implementations were expected
to support by default? Then there's no waste in the common case but you
can still get the info you need for reverse engineering. Both Protobuf
and Cap'n Proto could easily support such a requirement, all the pieces
are already in place to make schemas available automatically.
Note, this is something that we will in fact need to build in order to
adequately bridge capnp and ocapn, regardless of what decisions we make
about using or not using capnp (or anything else) for binary
serialization. I have most of a design for this sketched out in my head,
which can be dumped out when it's something that's enough of a priority.
***@***.*** have you yet described capnprotos field packing
algorithm anywhere other than in the c++ implementation yet?
No, but nothing is stopping someone from reading the code and
translating it to prose if desired. It's really not that complicated.
Indeed, if this is the only reason for not using it I will spend the
time to sit down and document it.
…---
I will add to this: what's proposed above is a single schema that would
be used to encode the self-describing data model into capnp as a binary
encoding. It would not actually be any less self describing than CBOR,
since e.g. ocapn struct fields would be encoded as (Text, Value) pairs,
not in the C-like nameless layout that a capnp-defined struct uses. We
could copy the information in the output of `capnp compile -ocapnp
ocapn.capnp` into the spec and we'd be on equal footing with CBOR, which
similarly documents the meaning of otherwise arbitrary tags for lists,
strings etc. in a spec somewhere.
|
The As to capnproto serialization...
No; what I said about protobuf applies equally:
I just took the Then I spent enough time with CBOR to say it's probably fine: I dind't find a JS API for capnproto at the same level... encoder._pushString() and such. That is: the API that the code generated by protobuf tools uses. For example: // writing
var buffer = protobuf.Writer.create()
.uint32((1 << 3 | 2) >>> 0) // id 1, wireType 2
.string("hello world!")
.finish(); p.s. as to self-describing: what @kentonv said, especially:
This leads to things like the online Protobuf Decoder, which I find indispensible from time to time. |
Is the choice to skip past the code generator and write the marshaling code by hand just to avoid the dependency on the code generator? If not, what's the reason behind that? |
Four reasons to avoid the dependency on the schema to code generator:
More on the first point: One ancidotal experience I had with Corbin Simpson’s Monte was that its build environment made quite the assumptions that it had binaries available (which were legacy x86 64 bit only) and other such. More on the second point: I want to allow for this ocapn protocol to be implemented in wierd places such as Minecraft ComputerCraft in-game networks, whatever Roblox and SecondLife are supporting. And any place you can program an general purpose compute. More on the third point: You will be surprised how often this kind of thing has happened and where only the executable binary surrived. Knowing only the overall ‘shape’ of the datastructure does not help if there is not even a hint of which binary bits belongs to which field or what its datum type is. More on the fourth point: I blame the Trusting Trust for this and the whole underhanded c code contest but in a good way. Requiring peeps to setup say Genode+seL4 system to hack on or implement the ocapn protocol in such situations might be too much of an ask. |
Either way please do because when I tried I could not make heads or tails of that code within the nenna/gumption quota I was willing to spend on it. |
I see there's no Wouldn't it be nice to have a special place in the AST that says "here, you really have to give this a thought" instead of dealing with generic structure and guess "is that a descriptor or you just happy to send me?" for each node (and perhaps consider its context, up and down, too!). |
You're referring to the grammar in the 1st comment in this issue? Right. It's missing. It's present in several other sketches, such as the May 17 capnproto sketch above.
|
It is notably missing from the messages in the spec. And also isn't used by the test suite. |
Which spec? I didn't know we had a spec covering this issue. I think the test suite uses remote references; for example, I suppose the
So |
Yes, the test and |
Ok. I think I see your point now. I don't have much of an opinion. I'm not sure how relevant the syrup structure will be in the end. |
This has been closed in the July 2024 meeting. I think we have broad agreement on these topics. |
What is the resolution? What do we have broad agreement on? Is it the current state of that one wiki page? Does this decision mean that method calls do not get 1st class treatment in the protocol? |
There are really two questions:
Right now, (1) is handled by Syrup, (2) is handled by the abstract types in Preserves. Technically, (1) is just a very simple (but canonicalized) encoding of (2), simple enough to implement in about 3 hours, but there is also a (lossy, due to floats) textual representation and an alternate binary representation (which @tonyg and I have considered replacing with Syrup).
I propose we stick to representing CapTP's AST in terms of the abstract datatypes of Preserves, no matter which encoding we end up ultimately using. The core datatypes that then are used to compose this AST are:
I propose that we stick to this as the foundational abstracted set of types on which we build the AST. One advantage of Preserves' abstract types is that they do not on their own specify an encoding, they represent a language-oriented representation. Thus it is easy to switch to something else encoding-wise later.
(EDIT: I removed "Pointer", which is on the Preserves page but I don't think it should be there, and it wasn't when I wrote Syrup I think. @tonyg we should talk!)
The text was updated successfully, but these errors were encountered: