-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON Serialization with default value #861
Comments
The supported JSON encoding options: https://github.com/apple/swift-protobuf/blob/master/Sources/SwiftProtobuf/JSONEncodingOptions.swift |
p.s. - adding this as an option would be tricky since JSON encoding is built on the visitor pattern which is where it is skipping default values. |
@thomasvl thanks for your quick reply. Yes, I realized that default value is skipped in the generated code.
|
Is there a way to do this without breaking the API between generated code and the library? I see how old generated code could continue to work with an updated library, but I don't see how new generated code (with modified default handling) could work with an old library. If we can't preserve this compatibility, we would probably want to increment the major version. It's been a while since 1.0, so this is probably okay. We should certainly consider whether there are other issues that might require API breakage that should be adopted along with this. |
What I found is that current we do the check in codegen xx.pb.swift.
What if we add an option to JSONEncodingOptions, then we change
Then API will remain same, and then we do check inside actual visitSingularUInt32Field or visitSingularUInt64Field or visitSingularStringField |
This is a good idea. I'd need to think about how this works for proto2, though. |
The visitor api shouldn't know about the JSONEncodingOptions, we'd need to make a some other (internal) "options" to expose what ever subset makes sense for all possible visitors. |
I think
|
The call to |
Oh, I see. Yes, the original approach I proposed required all visitor to include default value check, which is not really flexible. Another way in my mind. We add a function into Visitor protocol
We provide default implementations as other methods in Visitor protocol
In JSONEncodingVisitor, we can implement it by checking shouldIncludeDefaultValue from options. The last change will be codegen. All traverse function will require to call
I guess this can be also used by formatTextString. |
@seanliu1's idea would have:
I'll have to think about whether this is enough compatibility to avoid the major version bump. |
@tbkka see this comments. #861 (comment) People wont notice any change, unless shouldIncludeDefaultValue is provided.
|
Likely should make a sweep of the options the other languages provide to TextFormat, JSON, and Binary to double checks if there are potentially any other things we might want to support that also would reach to this level of the Swift impl. Performance wise, we could call this once within each generated visitor method, but it is still a lot of calls if the graph has a lot of sub messages, and the call can't be optimized since it is on a generic Visitor. |
Let me know if I can help anything. |
I managed to make some changes based on my initial idea.
I also have the chance to look at
Python and Java all use descriptor to loop through all fields, and insert default value to the dictionary or map. |
Taking a quick look, the visitor seems to have special cased zero/false as the default. That's only true for a message declared in a file with proto3 syntax. A message in proto2 syntax can have any default the .proto file author chooses, so I'm not sure your flow would work out correctly when a proto2 syntax message is used as a field of a proto3 syntax message. |
Thanks for taking a look. I forgot proto2 can have any default value.
Only JsonEncoderVisitor will call Other callsites will remain same.
|
I also looked at other options provided by Java, Python, C++ JSON
Looks like all languages are on the same page, Swift just does not have includeDefault TextFormat
Binary |
Would love some feedback for this seanliu1#2 Thanks |
Left some quick comments, the repeated calls to the visitor also have the potential to slow things down when most fields are the defaults. It might make sense to fetch it once and reuse it in the conditionals (you could also check the flag first since when it is on, wether the value is set or not, you are calling the visit method). But this approach goes back to @tbkka comments before, there are behavior issues with old generated code against the new runtime (and new generated code against the old). So the question of a version change still has to be sorted out. And if there is a version change, then making this a method on visitor vs. doing something more generic like passing options for it into the generated visited code might make things easier to add others in the future (although the same interaction problems would likely exist). |
@thomasvl Thanks, I addressed your comment on repeated calls to the visitor, currently it will only call once at beginning of traverse. |
I think I might have figured out part of how to solve the versioning issue. We could generate two different traverse methods, one that provides the old behavior for use with the old libraries, and one that supports the new behavior. This might look like:
The expected behavior: If you're using an old library or old generated code, defaults are not visited. Defaults can only be controlled if you have a new library and new generated code. The above solves three of the four cases:
The remaining case is for the new library used with old generated code. This requires a way for the new library to call the old |
Hm, so we'd have to generate a travers shim in ever message class, which on all new library uses is wasted code (and if declared public, won't dead strip). For your last case, could the updated library provide an extension to Message with a default impl of the new traverse method that calls the old? Or would that always get used for any generic operation on Messages directly? |
At this moment, I'm happy if we can find any solution that works. I agree it would be nice to avoid this overhead. Certainly, that extra shim could be omitted the next time we change the major version.
Good idea! I think that would do it. I think generic uses and protocol-typed variables will work correctly as long as we declare both |
Any update on this for 2.0? Thanks |
Same as last check, no one has proposed a PR for this. |
Thanks @thomasvl I've seen @seanliu1 achieved something in here: seanliu1#2 Is there any documentation of how to do the setup to get default values in the JSON? Thanks |
The default values would be on the |
Are we okay with making breaking changes to the |
Actually after playing with this a bit I think an
And have the
This allows us to have a common shared traverse implementation
this allows the library to have more flexibility to change the internals of the visitor pattern regardless of which version of the library was used to generate the type conforming to I'm still pretty new to this project so I'm not sure if there are potential pitfalls with this approach. I'm not sure if what I'm proposing would trigger issues like this one |
Yup, 2.0 is allowed breaking changes. As to your other questions - I'll defer to the Apple folks on the general api changes. @tbkka @FranzBusch @Lukasa I do worry a little about how much the compiler will be able to inline compared to the current code and it also seems like you concern about stack size might be justified. Having said that, I do sorta like the idea of only having to generate something more Array like that might hopefully be less actual code, reducing the overall size of the generated code. |
Sounds good I'll put together a PR this week and we can discuss more there, thanks for the quick reply! |
I developed the current API after performance-testing a lot of different alternatives. But I must confess, I never tried something with quite this shape, so I'm very curious to see how well it would work. Like Thomas, I'd be willing to accept a modest performance loss if this could significantly reduce code size. Certainly worth the experiment! I suspect the first performance challenge will be to ensure the array of nodes is statically allocated just once for each message type. |
I'll have to think about this a bit. On the surface level static allocation doesn't really make sense since each node is capturing the value of a given field. Maybe if nodes capture key paths this could work 🤔
Do you have any recommendations for ways I can capture the relative performance of my solution? Are there some good sample protos I could use or some benchmark test cases I could run? |
Oh. I missed that. Hmmm.... That means this array is getting built every time you try to walk the structure? That's going to be hard for binary serialization, in particular, since we actually walk the structure several times (to compute size information that's used to allocate the necessary buffers). And building this array will require a bunch of allocations: a couple for the array itself, and likely one more for each captured existential. Key paths would be worth experimenting with, I think.
We have a Performance suite that Tony Allevato wrote and which I've relied on a lot in the past. It's a collection of shell scripts that can generate standard protos and then compiles and runs benchmarks for both C++ and Swift. Most importantly, it makes some very pretty graphs of the results. 😁 Pretty handy, though the shell scripting for compiling the C++ benchmarks keeps breaking as the protobuf folks keep churning their build system. (Note: A few years back, Thomas suggested we switch the binary encoder to serialize from back-to-front in the buffer; that would only require a single walk to size the entire output, then a walk to actually write the data into the buffer. I keep meaning to set aside time to try implementing this approach, which would also address a more serious performance cliff we currently have in the binary encoder.) |
I played around with keypaths and I think they should work! I've got a rough prototype have been able to make this work for every field type. Here is the general shape I've got so far struct FieldNode<M: Message> {
private let fieldNumber: Int
private let subtype: Subtype
private enum Subtype {
case singularFloat(keypath: KeyPath<M, Float>, defaultValue: Float)
// ...
}
func traverse<V: Visitor>(message: M, using visitor: inout V, options: VisitorOptions) throws {
switch subtype {
case .singularFloat(let keypath, let defaultValue):
let value = message[keyPath: keypath]
if options.contains(.visitDefaults) || value != defaultValue {
try visitor.visitSingularFloatField(value: value, fieldNumber: fieldNumber)
}
// ...
}
}
// MARK: - Factories
public static func singularFloat(_ keyPath: KeyPath<M, Float>, fieldNumber: Int, defaultValue: Float) -> Self {
Self(fieldNumber: fieldNumber, subtype: .singularFloat(keypath: keyPath, defaultValue: defaultValue))
}
// ...
}
extension FooType: Message {
static let fieldNodes: [FieldNode<Self>] = [
.singularFloat(\.someFloat, fieldNumber: 1, defaultValue: 0)
]
} Message/Enum/Map fields are a little trickier but I do have something working for those too. Accessing the fields through a keypath doesn't seem to have a significant performance impact compared to accessing fields directly. If we push this concept a little further we could actually use this to get rid of static let fieldNodes: [Int: FieldNode<Self>] = [
1: .singularFloat(\.someFloat, defaultValue: 0, name: .same("some_float"))
] We'd probably need to use an ordered dictionary though since I assume its important that visitor visit fields in order and we wouldn't want to do any sorting at runtime. |
This is very promising! We'd need to see some measurements (performance and code size) to see how to best take advantage of this, of course. For example, if it's a big code size win but also a significant performance regression, then we might need to give people the option of generating either old-style (fast) or new-style (compact) output. But we'd have to see some actual numbers to inform any such planning. Of course, we all hope that🤞 it's a code size win without being a performance regression -- that would be truly wonderful! This might also give us a way to drop the generated Equatable conformances, which is another source of code bloat; walking this list of field info would let us compare two messages with a general iterator instead of bespoke code. |
You might not need a dictionary to replace |
Yes exactly, you'd still need to the ability to do a lookup for decoding purposes from what I can tell. |
Next step for me is to get this code out of my custom plugin I've written and into a fork of I'll report back my findings when I have some 😊, thank you for your feedback so far! |
One catch for visit is the intermixing of extension ranges. Not sure if you want markers for that in the list of fields, or if we capture the info in another list and then walk the two in parallel to mix the data? (unknown fields just get put at the end, not in order). As far are regeneration goes, once the generator is updated, the |
Actually, we also need to double check how other languages do this option with respect to field presence. i.e. - if the field has presence, does the flag actually do something, or is the flag only honored when the field doesn't have presence? |
One other through - you might be able to get some generated code size savings by doubling |
I did a bunch of performance experimentation over the weekend and here are some findings:
With these in mind this is what I've pivoted to: internal protocol Field<M> {
associatedtype M: Message
func traverse<V: Visitor>(message: M, visitor: inout V) throws
}
fileprivate struct SingularInt32Field<M: Message>: Field {
private let fieldNumber: Int
private let getValue: (M) -> Int32
func traverse<V: Visitor>(message: M, visitor: inout V) throws {
try visitor.visitSingularInt32Field(value: getValue(message), fieldNumber: fieldNumber)
}
}
public struct FieldNode<M: Message> {
private let field: any Field<M>
private let isDefault: (M) -> Bool
internal func traverse<V: Visitor>(message: M, using visitor: inout V) throws {
if !isDefault(message) {
try field.traverse(message: message, visitor: &visitor)
}
}
public static func singularInt32(_ getValue: @escaping (M) -> Int32, fieldNumber: Int, defaultValue: Int32 = 0) -> Self {
Self(field: SingularInt32Field(fieldNumber: fieldNumber, getValue: getValue), isDefault: { getValue($0) == defaultValue })
}
}
extension Message {
public func traverse<V: Visitor>(visitor inout: V) throws {
for node in Self.nodes {
node.traverse(message: self, visitor: &visitor)
}
}
} Generated code: extension SomeProto {
static let fieldNodes: [FieldNode<Self>] = [
.singularInt32({ $0.someInt32 }, fieldNumber: 1),
]
} PerformanceI've been measuring performance by generating a proto with one of every kind of field and encoding it to both binary and json formats, in a cli I built using the release configuration. This isn't that scientific but it gives us a ballpark estimation of the performance loss
I'm still trying to optimize the "binary encode, all fields are unset" case and I'm open to suggestions. Its kind of hard to compete with the status quo since its an inlined function that effectively no-ops in the status quo while we still need to iterate over the array of nodes and dispatch some calls in my proposed implementation. |
Are those initial numbers debug or release? And how is performance compare in the other? i.e. - how much slower is debug how much slower is release? |
Those measurements are taken in a release build. Here are the same measurements in a debug build:
|
Update: Here is a more accurate performance measurement: Release
Debug
I think this is reasonable point to pause runtime performance optimization, especially since if we put name information into Field we can probably make json encoding faster than status quo. I'm going to shift my focus to measuring the bundle size impact and seeing how much that can be optimized. |
I did some testing last night on the impact of my approach on the size of the binary and unfortunately this approach increases the binary size by about 10% rather than decrease. I think this is happening because |
Do you have any advice for measuring bundle size impact? I've tried a few things now and I'm getting either inconsistent or unexpected results. For example I experimented with dropping the So far what I've been doing is creating a macOS cli that depends on my fork of |
Here is a PR with what I've been working on #1504 |
Based on my understanding of the spec, this flag should be ignored for fields with presence.
These two implementations use
I believe these implementations are equivalent, but it's harder to read this out (proto3 optional is implemented as oneof):
|
Developing using iOS12, Swift 5, proto3 . I am about to add an extension which can support to output fields with their default values. I just want to check whether it is already implemented.
Based on proto doc, it looks like
I wonder does swift version has option to output fileds with their default values. I found python version has it
MessageToJson(message, including_default_value_fields=False)
https://developers.google.com/protocol-buffers/docs/reference/python/google.protobuf.json_format-module
The text was updated successfully, but these errors were encountered: