Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Thrift support for realtime and batch ingestion #3418

Merged
merged 3 commits into from
Dec 13, 2016

Conversation

du00cs
Copy link
Contributor

@du00cs du00cs commented Sep 1, 2016

In our company thrift is used heavily, thrift serialized objects are stored to HDFS as SequenceFile and bytes to Kafka. I try to make it easier for thrift data to be ingested into druid without ETL in simple case.
Thanks to the flexible JSONSpec, the Implementation is rather simple, deserialize thrift bytes to thrift object and then serialize it to json. And then parse it with JSONSpec.

Is anybody using thrift too? Hope this useful for somebody like me.

@du00cs du00cs force-pushed the feature-thrift branch 3 times, most recently from 1c216f8 to 205becc Compare September 6, 2016 09:18
@fjy fjy added this to the 0.9.3 milestone Sep 13, 2016
@fjy fjy added the Feature label Sep 13, 2016
{
this.protocol = protocol;

if (protocol == null || protocol.equals("compact")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use Jackson to deserialize the protocol into a TCompactProtocol or TBinaryProtocol? this iwll avoid the if/else check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any example? I am not familiar with Jackson and don't know how to make it after some search.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fjy Sorry, but how to do that? Any tips?

@fjy
Copy link
Contributor

fjy commented Nov 8, 2016

@du00cs there's some overlap between this and https://github.com/druid-io/druid/pull/3633/files

Can we come to an agreement around which version to use?

@du00cs
Copy link
Contributor Author

du00cs commented Nov 9, 2016

@fjy Protocol aware is awesome and I missed thread safe problem of deserializer. Something should be checked more carefully like how and when a thrift class is added to classpath. Implement a TBase to Map is too complicated and error prone. I should argue that using JSONPath will make it much simpler and more expressive and be consistent with what JsonSpec can do.

@justsmoke
Copy link

@du00cs I just had a look at your code, specify thrift jar in runtime is really good, and serialize to simple json can solve the problem of nested field, so I think your solution is better. Also, I noticed that your batch ingestion is based on SequenceFileInputFormat, can you help add support for LzoThriftBlockInputFormat (this is implemented in project elephant-bird which is also quite popular)

@du00cs
Copy link
Contributor Author

du00cs commented Nov 10, 2016

@justsmoke I must admit that your protocol aware and taking thread safe into account is much better. We can do it together. Take multiple format into account should be supported.

@du00cs du00cs force-pushed the feature-thrift branch 3 times, most recently from bff869d to d2f43d8 Compare November 15, 2016 10:38
@du00cs
Copy link
Contributor Author

du00cs commented Nov 15, 2016

@justsmoke I have merged your work. Can you help me to test if LzoBlockThriftInputFormat is supported? I tried to set up the environment but failed.

@fjy
Copy link
Contributor

fjy commented Nov 16, 2016

@justsmoke @du00cs thanks, we can focus review on this PR

@du00cs
Copy link
Contributor Author

du00cs commented Nov 16, 2016

@justsmoke update with a change to avoid code style check failure. Thrift generator is platform dependent, I use scrooge-generator as a substitution to generate thrift files. You may need to change the thrift and scrooge version to match your case.

@justsmoke
Copy link

@du00cs ok, I will test the LzoBlockThriftInputFormat and give you feedback in two days

@justsmoke
Copy link

@du00cs When using LzoBlockThriftInputFormat, the thrift jar had to be included in classpath. This is different from other cases, and you may need to add some description in documentation. After the thrift jar is added to classpath, the thrift object is successfully parsed.

@du00cs
Copy link
Contributor Author

du00cs commented Nov 17, 2016

@justsmoke if the thrift jar is specified by tmpjars like

{
"jobProperties" : {
        "tmpjars": "/user/xxx/druid/test/book.jar"
      }
}

the job will succeed?

@justsmoke
Copy link

@du00cs No, using config: "elephantbird.class.for.MultiInputFormat" : "${YOUR_THRIFT_CLASS_NAME}" and include jar in classpath is the only way to get it work. (because thrift deserialize is done within the elephant-bird lib)

@du00cs
Copy link
Contributor Author

du00cs commented Nov 17, 2016

@justsmoke A little confusing

{
"jobProperties" : {
        "tmpjars": "/user/xxx/druid/test/book.jar",
        "elephantbird.class.for.MultiInputFormat" : "your.class.name"
      }
}

fails? I can not understand why. What's more, the path of tmpjars should be a HDFS path. Can you try it once more?

@@ -0,0 +1,107 @@
/**
* Copyright (c) 2016 XiaoMi Inc. All Rights Reserved.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this 'Copyright' is not appropriate here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it's just a mistake.

@justsmoke
Copy link

@du00cs Because deserialization is done by elephant-bird (after calling get on thriftWritable object, we get a TBase directly), so the parameter tmpjars is actually not used.

1. thrift binary is platform dependent, use scrooge to generate java files to avoid style check failure
2. stream and hadoop ingesion are both supported, input format can be sequence file and lzo thrift block file.
3. base64 and protocol aware

change header
@fjy
Copy link
Contributor

fjy commented Dec 6, 2016

@du00cs @justsmoke there's a few merge conflicts. Do you guys feel this is ready?

@du00cs
Copy link
Contributor Author

du00cs commented Dec 7, 2016

@justsmoke Can you send me a sample file of LzoBlockThriftInputFormat (and also the thrfit file) ? I want to look more into it.

Copy link
Member

@nishantmonu51 nishantmonu51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, seems good to go after conflicts are resolved.

| ----------- | ----------- | ---------------------------------------- | -------- |
| type | String | This should say `thrift` | yes |
| parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. Should be a Json parseSpec. | yes |
| thriftJar | String | path of thrift jar, if not provided, it will try to find the thrift class in classpath. Thrift jar in batch ingestion should be uploaded to HDFS first and configure `jobProperties` with `"tmpjars":"/path/to/your/thrift.jar"` | no |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

druid uploads the jars added to classpath of middlemanager to HDFS. can you test by adding the thrift jar to middlemanager classpath work for batch ingestion ?
In that case use may not need to add explicitly upload it to HDFS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding thrift jar to middle manager's class path will indeed work for batch ingestion, but an HDFS or some path is much more flexible. You don't have to put a jar into each nodes. Maybe pull from somewhere by http is the best. In this way, batch and realtime ways are unified. I should try to make it.

@fjy
Copy link
Contributor

fjy commented Dec 9, 2016

@du00cs there's a few conflicts

@fjy
Copy link
Contributor

fjy commented Dec 13, 2016

👍

@fjy fjy merged commit 469ab21 into apache:master Dec 13, 2016
@justsmoke
Copy link

@du00cs I am using this extension with tranquility, but get the following exception: java.lang.IllegalArgumentException: requirement failed: Expected ByteBufferInputRowParser, got[io.druid.data.input.thrift.ThriftInputRowParser]. It seems that tranquility only supports ByteBufferInputRowParser ?

@du00cs
Copy link
Contributor Author

du00cs commented Dec 16, 2016

@justsmoke You are right. I submit a pull request at Sep. [Fix] Remove data type and InputRowParser type bindings , may be some help.

dgolitsyn pushed a commit to metamx/druid that referenced this pull request Feb 14, 2017
* Thrift ingestion plugin

1. thrift binary is platform dependent, use scrooge to generate java files to avoid style check failure
2. stream and hadoop ingesion are both supported, input format can be sequence file and lzo thrift block file.
3. base64 and protocol aware

change header

* fix conlicts in pom
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants