-
Notifications
You must be signed in to change notification settings - Fork 2
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to document just each record, or the journal in general? Or do we do a second ZEP for the journal? Eventually it would be nice to have something describing the journal as a whole.
ZEP-XXX-journal-record-format.md
Outdated
## Journal layout | ||
The journal contains one or more segments. Each segment is a file that contains a sequence of records ordered by index. | ||
|
||
|
||
TODO: JournalSegmentDescriptor - first 64 bytes of a segment | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused, the title and description state this is about the journal record format, but here we talk about the journal as a whole. What's the focus on the document?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer it to document journal record format. But since we discussed about documenting the whole journal, I added a small section here for the descriptor. If we are planning to have another document, then I would say only record format here.
ZEP-XXX-journal-record-format.md
Outdated
0 : End of File (EOF) | ||
1 : Version 1 | ||
|
||
If the value of the frame indicates EOF, then the following bytes do not contain a valid journal record. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little uneasy about having only a single byte. So 0 means there is nothing after, and anything else (1-255) is the version. Is this really resilient? I can't think of any strong arguments against it, however, so take it with a grain of salt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really resilient?
Why not?
ZEP-XXX-journal-record-format.md
Outdated
|
||
``` | ||
+ -------------------------------- | ||
| checksum<int64> | length<int32> | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't length unsigned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unsigned int is interpreted as long in the generated code. Then we have to cast to int always because everywhere length is expected to be int. Besides, there is not much advantage making it unsigned. 2^31 bytes is already too big for a record.
ZEP-XXX-journal-record-format.md
Outdated
Currently the transport layer uses Kryo for serialization. | ||
So even if we replicate the journal record as it is, we are not getting much benefits from it. | ||
We would still need to copy the record before it is send. | ||
However, if we can make use of operating systems' feature to transfer bytes directly from the file to the network it would make sense to transfer the entire journal record as it is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we will get this any time soon (even with gRPC), but I think it's still laudable to reduce the number of copies when possible. So with gRPC we will be able to cut it down copying to 2 - once from file to the netty send buffer (which is direct memory, so still happening out-of-JVM), then once to the socket. Which in the end is still preferable to the current 4 - once to serializedRaftRecord, once to Kryo, once to Netty send buffer, once to socket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can read more about the issue with FileRegion and gRPC here: grpc/grpc-java#1054
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can revisit it when we move to grpc. It is a breaking change anyway when we change the protocol, so we can also change this format.
ZEP-XXX-journal-record-format.md
Outdated
The raft thread that receives this record uses the term and index to verify the preconditions before writing it to the journal. It should then construct a journal record using `index`, `asqn`, `checksum` and `serializedRaftRecord` which can be appended to the journal. Journal is expected to verify if the index and checksum is correct. | ||
|
||
# Rationale and alternatives | ||
[rationale-and-alternatives]: #rationale-and-alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to add a section about alternatives in terms of serialization format - why zero copy? Why SBE and not Flatbuffers or Cap'n'Proto?
I can do that as I already had a look at these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. That will be good.
Co-authored-by: Nicolas Pepin-Perreault <[email protected]>
Co-authored-by: Nicolas Pepin-Perreault <[email protected]>
Co-authored-by: Nicolas Pepin-Perreault <[email protected]>
@miguelpires @npepinpe Please check. We can merge it imo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pointed out some minor typos but nothing blocking.
Co-authored-by: Miguel Pires <[email protected]>
Co-authored-by: Miguel Pires <[email protected]>
Co-authored-by: Miguel Pires <[email protected]>
Co-authored-by: Miguel Pires <[email protected]>
Co-authored-by: Miguel Pires <[email protected]>
Co-authored-by: Miguel Pires <[email protected]>
edc5041
to
5015ff4
Compare
This ZEP describes the new journal record format.