-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vote For New Feature: Add Extra fields at the bottom of a Message #175
Comments
What is this boilerplate code in option 2? |
To be honest, now that I've thought about it, it's very unreasonable, due to blocks being quite deep within other messages, it would be something like this, producing a stupid amount of garbage: I'd imagine the following proto:
Boilerplate code:
Our current XDR generator allows just ignoring the fields, and we have a helper method which populates offsets on demand, only when they are needed, so the boilerplate code doesn't necessarily need to compute the offset at the point it converts, and could use a helper method for computing them later, but given we're looping through them, and producing a lot of garbage anyway, I just added computation in the boilerplate code. |
Yes that is a lot of boilerplate and a lot of waste. Looks like this would have also been able to help You could put Offset as a field into the WireBlock message, but not send it. So if we have a message option which allows you to add some string at the end of the message fields.
This would generate go code:
But then all other generated methods would ignore these fields. Any more ideas? |
For my specific case nor Compare nor Equal are used, so having something like you suggested would be perfect. |
Ok I don't have time now, but I also want to hear about another use case for this before creating a feature. |
So I am willing to work on this feature, yet some guidance where this could be wedged in would be nice, as this code is still a maze to me. I guess I need three locations: |
I think their might be usecases where we want to specifically specify each extra field, but I don't know. So I am a little hesitant to move forward at this point. |
Right, I'll fork and implement it anyway, as it will save me from a lot of hassle, so if you could give me some guidance that (where it needs to be plugged in) would be nice. I will make a PR proposing the feature, but I do not expect you to merge it, if you feel there are downsides. |
That sounds great :) Ok so this is where you needs to add the option Its also nice to add a little helper function here The options get parsed by protoc, they are extensions on top of the descriptor. Its all legal protocol buffers. Then I would dive into the generator |
Tyvm |
Pull Request #179 Please if anyone likes this feature, vote for it by making a comment. |
If anyone has a comment about a change or extra use case here, please make also make a comment. |
Couldn't most of these use cases be solved through struct embedding? |
I've provided explanation and the boiler plate code involved. Embedding does not work if you struct with extra field has to be embeded in another proto message |
In general, protobuf types don't behave well when mutually embedded but I guess I am not seeing how the example demonstrates that embedding is problematic. To tell you the truth, it seems like this could be solved by simply sending the offset of the block over. Most of the problems goes away when you just send the extra field. Also, the offset can be extremely helpful in identifying corruption or bugs in your chunking algorithm. I've done the analysis and experimentation on such protocols and removing the offset provides a minimal, if not unrealizable, optimization, that trades off a lot of complexity. In this presented case, there is merely a 20% savings in size. When you add compression and varint encoding, this is even less. Either way, this seems like a lot of work for an optimization that will most definitely not be a bottleneck in this protocol. The bottleneck will be calculating hashes. My 2 cents. Sorry for jumping in here. |
We use the same wire format for storing stuff in the database, not only sending it on the write, which means wasting disk space too. We've been doing fine without offsets stored or sent anywhere for 3 years so even though I appreciate your analysis and experimentation, I don't think it has any meaning in our context when we genuinely know we don't need it. Also, your suggested merely 20% saving translates to something less "merely" for the guy with 20TB of data (https://data.syncthing.net/). |
@AudriusButkevicius No need for the dismissive tone. I'm just trying to suggest a simpler solution that may save you time and avoid a solution that looks that can create other, more subtle problems.
You could remove the field before writing to the database. The records could also be compressed and the extra storage requirements may be negligible.
I've been doing this for awhile, as well, so this isn't just "experimentation". In general, I'm sure it is workable, but the current design from this issue has no way to verify the complete file hash or the offsets of the individual blocks, unless you hashed the entire structure. If offset and block layout were incorrect, it would be impossible to detect without comparing to the source file. For example, if File.Size did not match the sum of block.Size, there would be no way to detect which block introduced the error without the offset. I'm not sure how common this is in practice.
This is an upper bound of 20% for metadata, with probably way less than 10% in practice. Once you take into account all data, the savings is negligible (like <0.0001%, with a 1MB block size). This really only provides benefits when total data is at the exabyte scale, but at this scale, the metadata is still massive, even without the offset, so relative benefit may be low. If you have smaller block sizes, removing the offset will provide more and more benefit. Either way, I think relying on protobuf generation for this solution isn't a great approach when a package-local struct would suffice. |
But that would produce tons of garbage, as I'd have to copy every single block, or use The data is already compressed, but you have to understand that these block structs take up probably 99% of what's in the database.
Our block size is 128kb, and we exchange the whole of metadata every time we connect - literally, every time, yes it's not ideal but that's not a topic for discussion here. If you are on a mobile somewhere in southern africa where data rates are crazy expensive, and you happen to move between antennas, every byte counts.
We have no verification. The database we use provides checksumming itself, so we are sure the data that lives there should not be corrupt, as well as TCP checksums should ensure integrity of the packets.
As I already said above, the amount of garbage local structs would produce is unacceptable. Users run our software on stupidly small devices and yet expect acceptable performance (PI's with 256MB of RAM handling 4TB drive arrays). I am sure that by doing this we'd put the entry bar higher in terms of how much memory you need in order to handle a working set of files sized Y. |
Hey where in Southern Africa are you? Sorry, but now I am going to make this a serious suggestion. How about just putting the Offset field in the WireBlock and not populating it? Or you could have two protobuf definitions. |
I am suggesting this because maintaining a fork is not the funnest thing to do and it might be a while before I see more usecases for this feature. |
So I am not in south africa, but it's not a rare occurrence of people with stupidly slow links complaining how much bandwidth the software consumes. As far as I understand not populating a field still consumes at least 3 bytes (2 bytes for the type information, 1 byte for a value of 0 if I understand correctly) on the wire. Regarding having 2 messages, it either becomes a one way trip or produces garbage:
|
Are you using proto3? |
Because then you can have even less bytes for zero populated fields. |
Yes, it is proto3, but I assume zero populated fields are still not free? |
zero populated fields are free in proto3. |
Ok, so I guess I can just zero them out. Case closed then. Also, you've mentioned that I could define two identical structs, one with less fields to perform partial deserialization. Does this work on proto3, and are there any specific requirements (such as matching field names/ids?) |
The fieldnumber and types should be the same. server.proto message Update {
int64 Version = 1;
repeated File Files = 2;
}
message File {
string FileName = 1;
int64 Size = 2;
repeated Block Blocks = 3;
}
message Block {
bytes Hash = 1;
int64 Size = 2;
} client.proto message Update {
int64 Version = 1;
repeated File Files = 2;
}
message File {
string FileName = 1;
int64 Size = 2;
repeated Block Blocks = 3;
}
message Block {
bytes Hash = 1;
int64 Size = 2;
int64 Offset = 3;
} |
I think this is impossible, but just want to double check.
I am interested in generating additional fields on the structs that gogoproto generates, yet I don't want these fields to be serialized into the write protocol, no matter what the value is.
The use case is as follows:
We slice a file into list of blocks, each block is potentially of a variable size.
We'd like to store the offset of the block within the file on the block struct, yet it's not something we'd want to send over the write as this is something that can be easily computed on the other side, especially if we can have 10 million blocks in a single message.
A few hacks I can think of:
The text was updated successfully, but these errors were encountered: