-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
codec: implement protobuf unknown fields checker #6557
Conversation
Codecov Report
@@ Coverage Diff @@
## master #6557 +/- ##
==========================================
+ Coverage 61.29% 61.35% +0.05%
==========================================
Files 509 510 +1
Lines 31625 31769 +144
==========================================
+ Hits 19386 19492 +106
- Misses 10725 10750 +25
- Partials 1514 1527 +13 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a great start @odeke-em! Thank you.
I left several comments. At a high-level:
- we don't need to support golang proto, just gogo proto
- we need to handle nested message including nesting via
Any
andoneof
- it seems like you may have theAny
case sufficiently covered, but it needs to be tested to make sure
Thank you for round 1 of reviews @aaronc, I've addressed your feedback and added an extra check to ensure that even if messages have the same field numbers, if their wireTypes don't match, we should trip out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like good progress @odeke-em. Left some more comments.
I gave some more guidance on how to check for nested fields. Basically we need to have data in nested structs that fails validation which means that CheckMismatchedProtoFields
needs to recusively parse messages. Let me know if it would be helpful to chat through how to approach that.
Thanks for that, my mind was coming from a different direction just basing off proto definitions getting written but I've crafted up nested structs and am working on that. Updates coming up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there are some things that still need to be covered here:
- Nesting is insufficiently covered. The one test for nesting just checks one field's wire type, but our primary aim is not to detect wire type changes (which are covered by the core
Unmarshal
pass) but to test for unknown fields - both critical ones and non-critical ones. Also as far as I can tell, there is no coverage of nesting via:Any
oneof
s- repeated fields that nest simple messages,
oneof
s andAny
s
- It seems like that are some potential nesting complex cases that need to be handled via reflection that aren't which include pointers, arrays of structs and pointers, oneof's, arrays of oneof's
,
Any, etc. The
TODO` that is in the code I believe is actually larger than it seems... I wonder if we should maybe consider a schema based approach as an alternative to reflection to handle all these cases. Obviously I understand changing tactics could be frustrated as I know you've spent a lot of work on this approach. But I do think it's worth having a conversation about how to handle all the additional cases and the best way to approach that.
Maybe let's see if we can find a time to chat live?
Thanks for the round of feedback @aaronc I am going to work on in about 1.5hr :) |
So before you move forward can we find a time to sync? Ping me on discord or email. I've thought through it a bit more and unfortunately, I'm pretty sure the reflection approach won't actually work to cover all the cases. I can explain more but I think it would be easier if we sync live. |
nullable = false produces invalid nested structs eg
type A struct {
Field A
}
Is an invalid recursive type
But
type A struct {
Field *A
}
Isn’t
The ones for timestamp too produce invalid structs.
…On Tue, Jul 14, 2020 at 7:27 AM Aaron Craelius ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In codec/testdata/proto.proto
<#6557 (comment)>:
> +}
+
+message Customer3 {
+ int32 id = 1;
+ string name = 2;
+ float sf = 3;
+ float surcharge = 4;
+ string destination = 5;
+
+ oneof payment {
+ string credit_card_no = 7;
+ string cheque_no = 8;
+ }
+
+ Customer1 original = 9;
+}
Which ones introduce invalid go structs? Maybe you're missing this type
declaration: type TestVersion1s []TestVersion1?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6557 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABFL3V57VVDN6D54OMDQGH3R3RTLNANCNFSM4ONBLLZA>
.
|
If you take a look at the generated proto, the wrappers aren’t
gogoproto.Message nor are there FieldDescriptorProto. The only way to walk
through their definitions is to use the OneofIndex, to get the definition
and then walk that struct, we have to us reflection at this point. I
confirmed that this is also what gogoproto does during Unmarshal.
…On Wed, Jul 15, 2020 at 12:07 PM Aaron Craelius ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In codec/enforceproto/extraneous_fields.go
<#6557 (comment)>:
> + fieldBytes := b[2:n]
+ b = b[n:]
+
+ // An unknown but critical field or just a scalar type (aka not a message).
+ if fieldDescProto == nil || fieldDescProto.IsScalar() {
+ continue
+ }
+
+ fieldGoType, ok := protoNumToGoTypesIndex[tagNum]
+
+ switch {
+ case fieldDescProto.OneofIndex != nil: // Oneof field.
+ // Traverse the oneof field, after locating it by its index!
+ oneOfs := extractOneOfTypes(rt)
+ oneOfSampleType := oneOfs[fieldDescProto.GetOneofIndex()]
+ fieldGoType := reflect.TypeOf(oneOfSampleType).Elem()
Why do we need to use reflection at all at this point? We have the
Descriptor, that should have all the information so we no longer need to
introspect go structs at all at this point afaik.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6557 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABFL3V763V3X5ATTDVNZAF3R3X46ZANCNFSM4ONBLLZA>
.
|
Using You should be able to retrieve the
Basically my read on this is that there really isn't much reason to pull in the go field types at all to do this. We just need the |
@aaronc you were right, thank you for the advocacy and calling out the fact that FieldDescriptorProto always gives us the TypeName and we can use that to retrieve the found messageTypes from gogoproto.MessageType. |
For updated performance numbers: $ benchstat proto_unmarshal.txt check_mismatchedfields.txt
name old time/op new time/op delta
Do_serial-8 263ns ± 3% 269ns ± 3% +2.32% (p=0.010 n=10+10)
Do_parallel-8 147ns ± 3% 153ns ± 7% +4.02% (p=0.007 n=10+10)
name old speed new speed delta
Do_serial-8 358MB/s ± 4% 349MB/s ± 3% -2.38% (p=0.009 n=10+10)
Do_parallel-8 642MB/s ± 3% 616MB/s ± 7% -4.03% (p=0.007 n=10+10)
name old alloc/op new alloc/op delta
Do_serial-8 112B ± 0% 48B ± 0% -57.14% (p=0.000 n=10+10)
Do_parallel-8 112B ± 0% 48B ± 0% -57.14% (p=0.000 n=10+10)
name old allocs/op new allocs/op delta
Do_serial-8 4.00 ± 0% 2.00 ± 0% -50.00% (p=0.000 n=10+10)
Do_parallel-8 4.00 ± 0% 2.00 ± 0% -50.00% (p=0.000 n=10+10) We have:
|
I just did an algorithmic analysis of the program and noticed a path that is O(n) and sped that up by using a dense and constant time look up in fd2e329, and now this pass is faster in every dimension than invoking proto.Unmarshal, which was the behavior that I had postulated I could achieve even before this project began $ benchstat proto.txt check.txt
name old time/op new time/op delta
Do_serial-8 253ns ± 2% 179ns ± 1% -29.28% (p=0.000 n=10+10)
Do_parallel-8 158ns ± 4% 145ns ± 3% -7.80% (p=0.000 n=10+10)
name old speed new speed delta
Do_serial-8 371MB/s ± 2% 525MB/s ± 1% +41.37% (p=0.000 n=10+10)
Do_parallel-8 596MB/s ± 4% 647MB/s ± 3% +8.56% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
Do_serial-8 112B ± 0% 48B ± 0% -57.14% (p=0.000 n=10+10)
Do_parallel-8 112B ± 0% 48B ± 0% -57.14% (p=0.000 n=10+10)
name old allocs/op new allocs/op delta
Do_serial-8 4.00 ± 0% 2.00 ± 0% -50.00% (p=0.000 n=10+10)
Do_parallel-8 4.00 ± 0% 2.00 ± 0% -50.00% (p=0.000 n=10+10) |
… of O(n) Noticed from reading through the code and algorithmically analyzing the costs per switch when checking if a wireType matches a descriptor type. The appropriate change is by noticing that wireTypes are very small i.e. 0 to 5 but also we can directly know which descType matches up, thus mapping a dense slice indexed by protowire.Type, and to a map of descriptorField_Type to boolean, hence a constant running time and this change now makes this pass faster than doing proto.Unmarshal in every dimension. * After: ```shell $ benchstat proto.txt check.txt name old time/op new time/op delta Do_serial-8 253ns ± 2% 179ns ± 1% -29.28% (p=0.000 n=10+10) Do_parallel-8 158ns ± 4% 145ns ± 3% -7.80% (p=0.000 n=10+10) name old speed new speed delta Do_serial-8 371MB/s ± 2% 525MB/s ± 1% +41.37% (p=0.000 n=10+10) Do_parallel-8 596MB/s ± 4% 647MB/s ± 3% +8.56% (p=0.000 n=10+10) name old alloc/op new alloc/op delta Do_serial-8 112B ± 0% 48B ± 0% -57.14% (p=0.000 n=10+10) Do_parallel-8 112B ± 0% 48B ± 0% -57.14% (p=0.000 n=10+10) name old allocs/op new allocs/op delta Do_serial-8 4.00 ± 0% 2.00 ± 0% -50.00% (p=0.000 n=10+10) Do_parallel-8 4.00 ± 0% 2.00 ± 0% -50.00% (p=0.000 n=10+10) ``` * Before: ``` $ benchstat proto.txt check.txt name old time/op new time/op delta Do_serial-8 263ns ± 3% 269ns ± 3% +2.32% (p=0.010 n=10+10) Do_parallel-8 147ns ± 3% 153ns ± 7% +4.02% (p=0.007 n=10+10) name old speed new speed delta Do_serial-8 358MB/s ± 4% 349MB/s ± 3% -2.38% (p=0.009 n=10+10) Do_parallel-8 642MB/s ± 3% 616MB/s ± 7% -4.03% (p=0.007 n=10+10) name old alloc/op new alloc/op delta Do_serial-8 112B ± 0% 48B ± 0% -57.14% (p=0.000 n=10+10) Do_parallel-8 112B ± 0% 48B ± 0% -57.14% (p=0.000 n=10+10) name old allocs/op new allocs/op delta Do_serial-8 4.00 ± 0% 2.00 ± 0% -50.00% (p=0.000 n=10+10) Do_parallel-8 4.00 ± 0% 2.00 ± 0% -50.00% (p=0.000 n=10+10) ``` which is a big speed up in every dimenion by a considerable amount.
…ng indices Improve the context for the next reader to describe the mapping between indices and their various elements. Brings back comments that previously existed when we had the switch.
Also while here, add an equivalence test for walking slices.
Address @fedekunze's feedback about add docs to the various types. While here, also added a doc.go file which will be our entry point when programmers reference this code; also unexported the errors so that our external API now solely consists of the function: CheckMismatchedProtoFields
Addressing points raised by @aaronc Also added a check for scalarity of TYPE_BYTES which would be equivalent to pretty much TYPE_STRING, whose scalarity is special cases when we can't find the appropriate typename.
Addressing some of @aaronc's feedback.
Addressing @aaronc's feedback.
Addressing @aaronc's feedback on the name bikeshed.
…customizer For security, by default, we'll report every single unknown field regardless of whether it is critical or non-critical. However, provide a bulky option to customize this behavior by providing a Checker type whose AllowUnknownNonCriticals value can be toggled, but that'll be opt-in.
* Renmaed CheckMismatchedFields to RejectUnknownFields * Removed the global default in favor of manually creating a checker and invoking RejectUnkownFields on a receiver * Added tests to check for unknown fields in the midst of repeat fields * Updated docs and some more tests
Addressing @anilcse's review feedback.
Kinda tricky for my fork with the new history being applied to the head of my branch and commits, but this is the last commit anyways given the change from master, so I'll still rebase :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, may be a followup PR for cleaning up tests would be great!
// google.protobuf.Timestamp i = 10; | ||
// google.protobuf.Timestamp j = 11; // [(gogoproto.stdtime) = true]; | ||
Customer1 k = 12 [(gogoproto.embed) = true]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requires cleanup?
// google.protobuf.Timestamp i = 10; | |
// google.protobuf.Timestamp j = 11; // [(gogoproto.stdtime) = true]; | |
Customer1 k = 12 [(gogoproto.embed) = true]; | |
Customer1 k = 10 [(gogoproto.embed) = true]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we perhaps do this cleanup in a follow-on PR? I ask because in our discussion we wanted some of these fields for test cases but they couldn't work out quite as planned. Important to keep them first so they don't get lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's address in a follow-up PR. I'm adding a note in #6192
@Mergifyio update |
Command
|
@odeke-em please merge master into this branch, we don't seem to have permissions. (Again no need to rebase - just |
Done with the merge, please take a look! |
@odeke-em you should see an "Update branch" button in the PR, you're going to need to keep hitting it everytime there's a commit until this is merged. |
Gotcha, thanks for the heads up! |
Adds a pass that if given a protobuf serialized byte sequence and a proto.Message that
would have been unmarshaled, the pass will traverse the proto.Message and byte sequence
and report mismatches and extraneous fields that were included in the byte sequence.
For performance numbers, I've included benchmarks that show that the pass is dirt cheap
to use even in the path of invoking proto.Unmarshal (which doesn't report extraneous fields)
ref: #6192
Before we can merge this PR, please make sure that all the following items have been
checked off. If any of the checklist items are not applicable, please leave them but
write a little note why.
docs/
) or specification (x/<module>/spec/
)godoc
comments.Unreleased
section inCHANGELOG.md
Files changed
in the Github PR explorerCodecov Report
in the comment section below once CI passes