New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Protobuf extension #4039

Merged

fjy merged 27 commits into apache:master from knoguchi:protobuf

May 30, 2017

Contributor

knoguchi commented Mar 13, 2017 •

edited

Loading

I've created protobuf-extensions per #3508 and #3509.

Factor out ProtoBuf parser from processing module to extensions-contrib/protobuf-extensions
Migrate ProtoBuf from v2 to v3
Add flattenSpec support.
Support composite (nested) message types. Original implementation had an assumption that the descriptor file contains one message type only. See this line. In order to specify the message type, I added a mandatory key protoMessageType.
Load the descriptor file from a URL as well as classpath. Original implementation expected the descriptor file in the classpath. It was inconvenient when using with the Kafka indexing service that can create indexing jobs dynamically.

The com.google.protobuf package might need shading because some part of Druid is still using protobuf 2.5.0. I will include the shade plugin in pom.xml if necessary.

Here is an example JSON that works with the Kafka indexing service.

fjy added this to the 0.10.1 milestone

fjy added the Feature label

Contributor

gianm commented Mar 26, 2017

@knoguchi Shading would be good, I think, since we've had problems with protobuf version conflicts in the past -- in particular with Hadoop.

Also, are you using this in production?

gianm reviewed

View reviewed changes

Contributor

gianm left a comment

Thanks for the contribution @knoguchi! Looks like a lot of useful changes. Other than the line comments, some general ones:

Documentation needs an update: the info on protobuf should move to the extensions section, and info on the new capabilities should be added.
Since protobuf is currently built in, I think it makes more sense as a core extension than as a contrib extension. If other Druid committers agree then let's move it there.
Code style is a little bit off in many of these files. If you use IntelliJ could you run them through a style pass using the code style jar we have in github?

...protobuf-extensions/src/main/java/io/druid/data/input/protobuf/ProtoBufExtensionsModule.java Outdated

    
              import java.util.Arrays;

              import java.util.List;

              public class ProtoBufExtensionsModule implements DruidModule

Contributor

gianm Mar 26, 2017

Is ProtoBuf really right? I usually see it styled as "Protobuf" with a lowercase b.

Contributor Author

knoguchi Mar 26, 2017

Indeed Protobuf is the correct case. I followed the naming convention of the old code to minimize potential breakages.

...b/protobuf-extensions/src/main/java/io/druid/data/input/protobuf/ProtoBufInputRowParser.java Outdated

    
              import java.util.Map;

              @JsonTypeName("protobuf")

              public class ProtoBufInputRowParser implements ByteBufferInputRowParser {

Contributor

gianm Mar 26, 2017

Don't include commented out code.

Contributor Author

knoguchi Mar 26, 2017

Will do.

...b/protobuf-extensions/src/main/java/io/druid/data/input/protobuf/ProtoBufInputRowParser.java Outdated

    
              import java.nio.ByteBuffer;

              import java.util.Map;

              @JsonTypeName("protobuf")

Contributor

gianm Mar 26, 2017

I think you don't need both this and registration of a NamedType in the Guice module, although I'm not totally sure.

Contributor Author

knoguchi Mar 26, 2017

It's from the old code. I will remove it up on test.

...b/protobuf-extensions/src/main/java/io/druid/data/input/protobuf/ProtoBufInputRowParser.java Outdated

    
                  try {

                    DynamicMessage message = DynamicMessage.parseFrom(descriptor, ByteString.copyFrom(input));

                    json = JsonFormat.printer().print(message);

                    // log.trace(json);

Contributor

gianm Mar 26, 2017

Don't include commented out code

Contributor Author

knoguchi Mar 26, 2017

Will do

...b/protobuf-extensions/src/main/java/io/druid/data/input/protobuf/ProtoBufInputRowParser.java Outdated

    
                    throw new ParseException(e, "Cannot read " + descriptorFilePath);

                  }

                  Descriptor desc = dynamicSchema.getMessageDescriptor(protoMessageType);

Contributor

gianm Mar 26, 2017

It looks like the old code would always take the first message type, and you're adding a protoMessageType to control which one is read. For compatibility with the old code, this should read the first message type if protoMessageType is null.

Contributor Author

knoguchi Mar 26, 2017 •

edited

Loading

I will allow protoMessageType null and read the first one only when the number of the message types in the descriptor file is 1. This is not exactly backward compatible behavior but it's necessary in my opinion.

If we simply pick the first descriptor it would cause non deterministic problems in the future in unexpected way because the order of the descriptors (message types) in the descriptor files is not controllable by us afaik. One day when you recompile the descriptor file with additional .proto files, the parser may pick unintended descriptor and stop working.

...b/protobuf-extensions/src/main/java/io/druid/data/input/protobuf/ProtoBufInputRowParser.java Outdated

    
                  }

                  DynamicSchema dynamicSchema = null;

                  try {

Contributor

gianm Mar 26, 2017

The old code did DescriptorProtos.FileDescriptorSet.parseFrom(fin) and this new code instead uses DynamicSchema from a new protobuf-dynamic library. What's the rationale for this change? Is it compatible with descriptor files people might have been using with the old code?

Contributor Author

knoguchi Mar 26, 2017 •

edited

Loading

The descriptor files are what the protoc compiler generates. Both old code and my code use the same descriptor files. The rationale is the old code does not work for nested message types. The protobuf-dynamic takes care of loading descriptors for the nested message types that are subsequently referenced. With the protobuf-dynamic, you only have to specify the top level message type which is provided by the protoMessageType.

I will see if I can implement the same functionality. The less dependencies the better.

...b/protobuf-extensions/src/main/java/io/druid/data/input/protobuf/ProtoBufInputRowParser.java Outdated

    
                    throw new ParseException(e, "Protobuf message could not be parsed");

                  }

                  Map<String, Object> record = parser.parse(json);

Contributor

gianm Mar 26, 2017

I guess this means that the parseSpec has to be a "json" parseSpec, since we're parsing protobuf by turning it into json first. This isn't great for performance, but the old code wasn't a paragon of efficiency either, so I'm willing to let it slide.

Although: have you looked into whether it'd be possible to apply the flattenSpec to the DynamicMessage directly? If doable, that should be faster than either this code or the old code.

Contributor Author

knoguchi Mar 26, 2017 •

edited

Loading

I'm well aware of the performance issue. What I looked at is the JsonProvider and MappingProvider SPI in the JsonPath lib. The benefit of the JsonProvider is we could add the flattenSpec not only for Protobuf but potentially any format as I discussed at #3505.

Unfortunately the provider interface requires String or InputStream.

I actually have ProtobufProvider and ProtobufMapper implementation (WIP) that takes Protobuf bytes via InputStream. But it's hacky, yet not zero-copy. Therefore I did not include them in this PR.

If we modify the JsonPath so it takes Object type, we can implement zero copy providers. And if we do, probably we have to maintain the fork. I don't think JsonPath author would accept PR for some random object providers that are not JSON ;-)

fjy reviewed

View reviewed changes

extensions-contrib/protobuf-extensions/pom.xml Outdated

    
            @@ -0,0 +1,52 @@
          
              <?xml version="1.0" encoding="UTF-8"?>

Contributor

fjy Apr 4, 2017

missing license header

Contributor

fjy commented Apr 4, 2017 •

edited

Loading

Took a quick look. Same issues as what @gianm pointed out but generally this looks good to me.

Contributor

fjy commented Apr 5, 2017

👍

Contributor Author

knoguchi commented Apr 6, 2017 •

edited

Loading

I don't know why Travis failed for the last commit. I just let it re-run, and it looks ok this time. I think I addressed all the points. Remaining issue is the performance but that will be another PR.

There is a PR #2354 that seems to perform better (but uncommon flatten expression syntax). I might reuse some code from the PR and create the JSONPath Protobuf provider.

Contributor Author

knoguchi commented Apr 11, 2017

Travis is having timeout or dead lock problem. The test passes in my local machine.

Contributor

fjy commented May 25, 2017

@knoguchi can you fix up the conflicts and we'll get this merged before next release

Contributor Author

knoguchi commented May 25, 2017 •

edited

Loading

ProtoTestEventWrapper.java is a generated class using the protobuf compiler. It seems the file was further modified by some kind of tool (Error Prone?) in the master branch besides the license header. IMO we shouldn't edit the generated code unless absolutely necessary. To fix the conflict, I have to apply the same tool, but I don't know how.

Contributor

gianm commented May 25, 2017

@knoguchi I think it'd be fine to re-generate the file using the generator, and add exclusions for it to the errorprone config. For example see common/pom.xml where antlr-generated files are excluded from certain checks. The config looks like this:

                    <plugin>
                        <groupId>org.apache.maven.plugins</groupId>
                        <artifactId>maven-compiler-plugin</artifactId>
                        <configuration>
                            <compilerArgs>
                                <!-- Antlr-generated classes miss @Override, that is not easy to fix -->
                                <arg>-Xep:MissingOverride:WARN</arg>
                            </compilerArgs>
                        </configuration>
                    </plugin>

Contributor

fjy commented May 30, 2017

@knoguchi can you fix merge conflicts and I'll merge this

Kenji Noguchi added 17 commits

May 30, 2017 10:55


          move ProtoBufInputRowParser from processing module to protobuf extens…

3b18616

…ions


          Ported PR apache#3509

72f4a90


          add DynamicMessage

147821c


          fix local test stuff that slipped in

381ec34


          add license header

5c5a551


          removed redundant type name

86fdd9c


          removed commented code

f1a4374


          fix code style

265cb66


          rename ProtoBuf -> Protobuf

ecaa549


          pom.xml: shade protobuf classes, handle .desc resource file as binary…

65bb4e9

… file


          clean up error messages

6f3f2a2


          pick first message type from descriptor if not specified

607f53f


          fix protoMessageType null check. add test case

47cc082


          move protobuf-extension from contrib to core

cb53117


          document: add new configuration keys, and descriptions

b357d8d


          update document. add examples

e49e45d


          move protobuf-extension from contrib to core (2nd try)

0cfd703

Kenji Noguchi added 7 commits

May 30, 2017 10:56


          touch

d2ab342


          include protobuf extensions in the distribution

9bae058


          fix whitespace

00c55aa


          include protobuf example in the distribution

93a750e


          example: create new pb obj everytime

e073a43


          document: use properly quoted json

982513b


          fix whitespace

8cbb95f

knoguchi force-pushed the protobuf branch from dc858cd to 8cbb95f Compare

May 30, 2017 17:57

Kenji Noguchi added 3 commits

May 30, 2017 11:27


          bump parent version to 0.10.1-SNAPSHOT

a352b4c


          ignore Override check

41ed2dd


          touch

cb7215d

Contributor Author

knoguchi commented May 30, 2017

@fjy the conflict has been resolved.

fjy merged commit 3400f60 into apache:master

jon-wei mentioned this pull request

Druid 0.10.1 release notes #4384

Closed

Contributor

jihoonson commented Jul 12, 2017

Migrate ProtoBuf from v2 to v3

Hi @knoguchi, I wonder how the version migration is done in this patch because it looks like the dependency is actually updated in #3682. Some users want to know which druid version starts supporting protobuf v3 (https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/druid-user/DaBeEA8Ca50/PXS7l4nUBQAJ).

Contributor Author

knoguchi commented Jul 12, 2017 •

edited

Loading

Looks like Apache Calcite Avatica introduced by #3682 requires Protobuf3. I started working on this PR last year Sept or Oct before SQL module was introduced. Either way, the Protobuf3 com.google.protobuf package is shaded in my extension because Hadoop libraries still depend on Protobuf 2. Maybe the shading is no longer necessary.

Out of curiosity I grep'd protobuf-java. hdfs-storage acually "excluded" protobuf-java.

$ git grep -A2 protobuf-java
extensions-contrib/orc-extensions/pom.xml:                    <artifactId>protobuf-java</artifactId>
extensions-contrib/orc-extensions/pom.xml-                </exclusion>
extensions-contrib/orc-extensions/pom.xml-                <exclusion>
--
extensions-core/druid-kerberos/pom.xml:          <artifactId>protobuf-java</artifactId>
extensions-core/druid-kerberos/pom.xml-        </exclusion>
extensions-core/druid-kerberos/pom.xml-        <exclusion>
--
extensions-core/hdfs-storage/pom.xml:                <artifactId>protobuf-java</artifactId>
extensions-core/hdfs-storage/pom.xml-              </exclusion>
extensions-core/hdfs-storage/pom.xml-              <exclusion>
--
extensions-core/protobuf-extensions/pom.xml:      <artifactId>protobuf-java</artifactId>
extensions-core/protobuf-extensions/pom.xml-      <version>${protobuf.version}</version>
extensions-core/protobuf-extensions/pom.xml-    </dependency>
--
extensions-core/protobuf-extensions/pom.xml:      <artifactId>protobuf-java-util</artifactId>
extensions-core/protobuf-extensions/pom.xml-      <version>${protobuf.version}</version>
extensions-core/protobuf-extensions/pom.xml-    </dependency>
--
pom.xml:                <artifactId>protobuf-java</artifactId>
pom.xml-                <version>3.1.0</version>
pom.xml-            </dependency>

Contributor

jihoonson commented Jul 13, 2017

Ah, maybe there was a version conflict problem before this patch. Cool. Thank you for the explanation.

knoguchi deleted the protobuf branch

July 13, 2017 04:09

knoguchi restored the protobuf branch

July 13, 2017 04:10

knoguchi deleted the protobuf branch

May 15, 2019 21:31

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels