Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protobuf extension #4039

Merged
merged 27 commits into from
May 30, 2017
Merged

Protobuf extension #4039

merged 27 commits into from
May 30, 2017

Conversation

knoguchi
Copy link
Contributor

@knoguchi knoguchi commented Mar 13, 2017

I've created protobuf-extensions per #3508 and #3509.

  1. Factor out ProtoBuf parser from processing module to extensions-contrib/protobuf-extensions
  2. Migrate ProtoBuf from v2 to v3
  3. Add flattenSpec support.
  4. Support composite (nested) message types. Original implementation had an assumption that the descriptor file contains one message type only. See this line. In order to specify the message type, I added a mandatory key protoMessageType.
  5. Load the descriptor file from a URL as well as classpath. Original implementation expected the descriptor file in the classpath. It was inconvenient when using with the Kafka indexing service that can create indexing jobs dynamically.

The com.google.protobuf package might need shading because some part of Druid is still using protobuf 2.5.0. I will include the shade plugin in pom.xml if necessary.

Here is an example JSON that works with the Kafka indexing service.

@fjy fjy added this to the 0.10.1 milestone Mar 24, 2017
@fjy fjy added the Feature label Mar 24, 2017
@gianm
Copy link
Contributor

gianm commented Mar 26, 2017

@knoguchi Shading would be good, I think, since we've had problems with protobuf version conflicts in the past -- in particular with Hadoop.

Also, are you using this in production?

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @knoguchi! Looks like a lot of useful changes. Other than the line comments, some general ones:

  • Documentation needs an update: the info on protobuf should move to the extensions section, and info on the new capabilities should be added.
  • Since protobuf is currently built in, I think it makes more sense as a core extension than as a contrib extension. If other Druid committers agree then let's move it there.
  • Code style is a little bit off in many of these files. If you use IntelliJ could you run them through a style pass using the code style jar we have in github?

import java.util.Arrays;
import java.util.List;

public class ProtoBufExtensionsModule implements DruidModule
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ProtoBuf really right? I usually see it styled as "Protobuf" with a lowercase b.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed Protobuf is the correct case. I followed the naming convention of the old code to minimize potential breakages.

import java.util.Map;

@JsonTypeName("protobuf")
public class ProtoBufInputRowParser implements ByteBufferInputRowParser {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't include commented out code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

import java.nio.ByteBuffer;
import java.util.Map;

@JsonTypeName("protobuf")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't need both this and registration of a NamedType in the Guice module, although I'm not totally sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's from the old code. I will remove it up on test.

try {
DynamicMessage message = DynamicMessage.parseFrom(descriptor, ByteString.copyFrom(input));
json = JsonFormat.printer().print(message);
// log.trace(json);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't include commented out code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

throw new ParseException(e, "Cannot read " + descriptorFilePath);
}

Descriptor desc = dynamicSchema.getMessageDescriptor(protoMessageType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the old code would always take the first message type, and you're adding a protoMessageType to control which one is read. For compatibility with the old code, this should read the first message type if protoMessageType is null.

Copy link
Contributor Author

@knoguchi knoguchi Mar 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will allow protoMessageType null and read the first one only when the number of the message types in the descriptor file is 1. This is not exactly backward compatible behavior but it's necessary in my opinion.

If we simply pick the first descriptor it would cause non deterministic problems in the future in unexpected way because the order of the descriptors (message types) in the descriptor files is not controllable by us afaik. One day when you recompile the descriptor file with additional .proto files, the parser may pick unintended descriptor and stop working.

}

DynamicSchema dynamicSchema = null;
try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code did DescriptorProtos.FileDescriptorSet.parseFrom(fin) and this new code instead uses DynamicSchema from a new protobuf-dynamic library. What's the rationale for this change? Is it compatible with descriptor files people might have been using with the old code?

Copy link
Contributor Author

@knoguchi knoguchi Mar 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The descriptor files are what the protoc compiler generates. Both old code and my code use the same descriptor files. The rationale is the old code does not work for nested message types. The protobuf-dynamic takes care of loading descriptors for the nested message types that are subsequently referenced. With the protobuf-dynamic, you only have to specify the top level message type which is provided by the protoMessageType.

I will see if I can implement the same functionality. The less dependencies the better.

throw new ParseException(e, "Protobuf message could not be parsed");
}

Map<String, Object> record = parser.parse(json);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this means that the parseSpec has to be a "json" parseSpec, since we're parsing protobuf by turning it into json first. This isn't great for performance, but the old code wasn't a paragon of efficiency either, so I'm willing to let it slide.

Although: have you looked into whether it'd be possible to apply the flattenSpec to the DynamicMessage directly? If doable, that should be faster than either this code or the old code.

Copy link
Contributor Author

@knoguchi knoguchi Mar 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm well aware of the performance issue. What I looked at is the JsonProvider and MappingProvider SPI in the JsonPath lib. The benefit of the JsonProvider is we could add the flattenSpec not only for Protobuf but potentially any format as I discussed at #3505.

Unfortunately the provider interface requires String or InputStream.

I actually have ProtobufProvider and ProtobufMapper implementation (WIP) that takes Protobuf bytes via InputStream. But it's hacky, yet not zero-copy. Therefore I did not include them in this PR.

If we modify the JsonPath so it takes Object type, we can implement zero copy providers. And if we do, probably we have to maintain the fork. I don't think JsonPath author would accept PR for some random object providers that are not JSON ;-)

@@ -0,0 +1,52 @@
<?xml version="1.0" encoding="UTF-8"?>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing license header

@fjy
Copy link
Contributor

fjy commented Apr 4, 2017

Took a quick look. Same issues as what @gianm pointed out but generally this looks good to me.

@fjy
Copy link
Contributor

fjy commented Apr 5, 2017

👍

@knoguchi
Copy link
Contributor Author

knoguchi commented Apr 6, 2017

I don't know why Travis failed for the last commit. I just let it re-run, and it looks ok this time. I think I addressed all the points. Remaining issue is the performance but that will be another PR.

There is a PR #2354 that seems to perform better (but uncommon flatten expression syntax). I might reuse some code from the PR and create the JSONPath Protobuf provider.

@knoguchi
Copy link
Contributor Author

Travis is having timeout or dead lock problem. The test passes in my local machine.

@fjy
Copy link
Contributor

fjy commented May 25, 2017

@knoguchi can you fix up the conflicts and we'll get this merged before next release

@knoguchi
Copy link
Contributor Author

knoguchi commented May 25, 2017

ProtoTestEventWrapper.java is a generated class using the protobuf compiler. It seems the file was further modified by some kind of tool (Error Prone?) in the master branch besides the license header. IMO we shouldn't edit the generated code unless absolutely necessary. To fix the conflict, I have to apply the same tool, but I don't know how.

@gianm
Copy link
Contributor

gianm commented May 25, 2017

@knoguchi I think it'd be fine to re-generate the file using the generator, and add exclusions for it to the errorprone config. For example see common/pom.xml where antlr-generated files are excluded from certain checks. The config looks like this:

                    <plugin>
                        <groupId>org.apache.maven.plugins</groupId>
                        <artifactId>maven-compiler-plugin</artifactId>
                        <configuration>
                            <compilerArgs>
                                <!-- Antlr-generated classes miss @Override, that is not easy to fix -->
                                <arg>-Xep:MissingOverride:WARN</arg>
                            </compilerArgs>
                        </configuration>
                    </plugin>

@fjy
Copy link
Contributor

fjy commented May 30, 2017

@knoguchi can you fix merge conflicts and I'll merge this

@knoguchi
Copy link
Contributor Author

@fjy the conflict has been resolved.

@fjy fjy merged commit 3400f60 into apache:master May 30, 2017
@jihoonson
Copy link
Contributor

Migrate ProtoBuf from v2 to v3

Hi @knoguchi, I wonder how the version migration is done in this patch because it looks like the dependency is actually updated in #3682. Some users want to know which druid version starts supporting protobuf v3 (https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/druid-user/DaBeEA8Ca50/PXS7l4nUBQAJ).

@knoguchi
Copy link
Contributor Author

knoguchi commented Jul 12, 2017

Looks like Apache Calcite Avatica introduced by #3682 requires Protobuf3. I started working on this PR last year Sept or Oct before SQL module was introduced. Either way, the Protobuf3 com.google.protobuf package is shaded in my extension because Hadoop libraries still depend on Protobuf 2. Maybe the shading is no longer necessary.

Out of curiosity I grep'd protobuf-java. hdfs-storage acually "excluded" protobuf-java.

$ git grep -A2 protobuf-java
extensions-contrib/orc-extensions/pom.xml:                    <artifactId>protobuf-java</artifactId>
extensions-contrib/orc-extensions/pom.xml-                </exclusion>
extensions-contrib/orc-extensions/pom.xml-                <exclusion>
--
extensions-core/druid-kerberos/pom.xml:          <artifactId>protobuf-java</artifactId>
extensions-core/druid-kerberos/pom.xml-        </exclusion>
extensions-core/druid-kerberos/pom.xml-        <exclusion>
--
extensions-core/hdfs-storage/pom.xml:                <artifactId>protobuf-java</artifactId>
extensions-core/hdfs-storage/pom.xml-              </exclusion>
extensions-core/hdfs-storage/pom.xml-              <exclusion>
--
extensions-core/protobuf-extensions/pom.xml:      <artifactId>protobuf-java</artifactId>
extensions-core/protobuf-extensions/pom.xml-      <version>${protobuf.version}</version>
extensions-core/protobuf-extensions/pom.xml-    </dependency>
--
extensions-core/protobuf-extensions/pom.xml:      <artifactId>protobuf-java-util</artifactId>
extensions-core/protobuf-extensions/pom.xml-      <version>${protobuf.version}</version>
extensions-core/protobuf-extensions/pom.xml-    </dependency>
--
pom.xml:                <artifactId>protobuf-java</artifactId>
pom.xml-                <version>3.1.0</version>
pom.xml-            </dependency>

@jihoonson
Copy link
Contributor

Ah, maybe there was a version conflict problem before this patch. Cool. Thank you for the explanation.

@knoguchi knoguchi deleted the protobuf branch July 13, 2017 04:09
@knoguchi knoguchi restored the protobuf branch July 13, 2017 04:10
@knoguchi knoguchi deleted the protobuf branch May 15, 2019 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants