-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protobuf extension #4039
Protobuf extension #4039
Conversation
@knoguchi Shading would be good, I think, since we've had problems with protobuf version conflicts in the past -- in particular with Hadoop. Also, are you using this in production? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @knoguchi! Looks like a lot of useful changes. Other than the line comments, some general ones:
- Documentation needs an update: the info on protobuf should move to the extensions section, and info on the new capabilities should be added.
- Since protobuf is currently built in, I think it makes more sense as a core extension than as a contrib extension. If other Druid committers agree then let's move it there.
- Code style is a little bit off in many of these files. If you use IntelliJ could you run them through a style pass using the code style jar we have in github?
import java.util.Arrays; | ||
import java.util.List; | ||
|
||
public class ProtoBufExtensionsModule implements DruidModule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is ProtoBuf really right? I usually see it styled as "Protobuf" with a lowercase b.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed Protobuf is the correct case. I followed the naming convention of the old code to minimize potential breakages.
import java.util.Map; | ||
|
||
@JsonTypeName("protobuf") | ||
public class ProtoBufInputRowParser implements ByteBufferInputRowParser { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't include commented out code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
import java.nio.ByteBuffer; | ||
import java.util.Map; | ||
|
||
@JsonTypeName("protobuf") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you don't need both this and registration of a NamedType in the Guice module, although I'm not totally sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's from the old code. I will remove it up on test.
try { | ||
DynamicMessage message = DynamicMessage.parseFrom(descriptor, ByteString.copyFrom(input)); | ||
json = JsonFormat.printer().print(message); | ||
// log.trace(json); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't include commented out code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
throw new ParseException(e, "Cannot read " + descriptorFilePath); | ||
} | ||
|
||
Descriptor desc = dynamicSchema.getMessageDescriptor(protoMessageType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the old code would always take the first message type, and you're adding a protoMessageType to control which one is read. For compatibility with the old code, this should read the first message type if protoMessageType is null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will allow protoMessageType null and read the first one only when the number of the message types in the descriptor file is 1. This is not exactly backward compatible behavior but it's necessary in my opinion.
If we simply pick the first descriptor it would cause non deterministic problems in the future in unexpected way because the order of the descriptors (message types) in the descriptor files is not controllable by us afaik. One day when you recompile the descriptor file with additional .proto files, the parser may pick unintended descriptor and stop working.
} | ||
|
||
DynamicSchema dynamicSchema = null; | ||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old code did DescriptorProtos.FileDescriptorSet.parseFrom(fin)
and this new code instead uses DynamicSchema from a new protobuf-dynamic library. What's the rationale for this change? Is it compatible with descriptor files people might have been using with the old code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The descriptor files are what the protoc
compiler generates. Both old code and my code use the same descriptor files. The rationale is the old code does not work for nested message types. The protobuf-dynamic takes care of loading descriptors for the nested message types that are subsequently referenced. With the protobuf-dynamic, you only have to specify the top level message type which is provided by the protoMessageType
.
I will see if I can implement the same functionality. The less dependencies the better.
throw new ParseException(e, "Protobuf message could not be parsed"); | ||
} | ||
|
||
Map<String, Object> record = parser.parse(json); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this means that the parseSpec has to be a "json" parseSpec, since we're parsing protobuf by turning it into json first. This isn't great for performance, but the old code wasn't a paragon of efficiency either, so I'm willing to let it slide.
Although: have you looked into whether it'd be possible to apply the flattenSpec to the DynamicMessage directly? If doable, that should be faster than either this code or the old code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm well aware of the performance issue. What I looked at is the JsonProvider and MappingProvider SPI in the JsonPath lib. The benefit of the JsonProvider is we could add the flattenSpec
not only for Protobuf but potentially any format as I discussed at #3505.
Unfortunately the provider interface requires String or InputStream.
I actually have ProtobufProvider and ProtobufMapper implementation (WIP) that takes Protobuf bytes via InputStream. But it's hacky, yet not zero-copy. Therefore I did not include them in this PR.
If we modify the JsonPath so it takes Object
type, we can implement zero copy providers. And if we do, probably we have to maintain the fork. I don't think JsonPath author would accept PR for some random object providers that are not JSON ;-)
@@ -0,0 +1,52 @@ | |||
<?xml version="1.0" encoding="UTF-8"?> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing license header
Took a quick look. Same issues as what @gianm pointed out but generally this looks good to me. |
👍 |
I don't know why Travis failed for the last commit. I just let it re-run, and it looks ok this time. I think I addressed all the points. Remaining issue is the performance but that will be another PR. There is a PR #2354 that seems to perform better (but uncommon flatten expression syntax). I might reuse some code from the PR and create the JSONPath Protobuf provider. |
Travis is having timeout or dead lock problem. The test passes in my local machine. |
@knoguchi can you fix up the conflicts and we'll get this merged before next release |
ProtoTestEventWrapper.java is a generated class using the protobuf compiler. It seems the file was further modified by some kind of tool (Error Prone?) in the master branch besides the license header. IMO we shouldn't edit the generated code unless absolutely necessary. To fix the conflict, I have to apply the same tool, but I don't know how. |
@knoguchi I think it'd be fine to re-generate the file using the generator, and add exclusions for it to the errorprone config. For example see common/pom.xml where antlr-generated files are excluded from certain checks. The config looks like this:
|
@knoguchi can you fix merge conflicts and I'll merge this |
@fjy the conflict has been resolved. |
Hi @knoguchi, I wonder how the version migration is done in this patch because it looks like the dependency is actually updated in #3682. Some users want to know which druid version starts supporting protobuf v3 (https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/druid-user/DaBeEA8Ca50/PXS7l4nUBQAJ). |
Looks like Apache Calcite Avatica introduced by #3682 requires Protobuf3. I started working on this PR last year Sept or Oct before SQL module was introduced. Either way, the Protobuf3 com.google.protobuf package is shaded in my extension because Hadoop libraries still depend on Protobuf 2. Maybe the shading is no longer necessary. Out of curiosity I grep'd protobuf-java. hdfs-storage acually "excluded" protobuf-java.
|
Ah, maybe there was a version conflict problem before this patch. Cool. Thank you for the explanation. |
I've created protobuf-extensions per #3508 and #3509.
processing
module toextensions-contrib/protobuf-extensions
protoMessageType
.The com.google.protobuf package might need shading because some part of Druid is still using protobuf 2.5.0. I will include the shade plugin in pom.xml if necessary.
Here is an example JSON that works with the Kafka indexing service.