-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Thrift support for realtime and batch ingestion #3418
Conversation
1c216f8
to
205becc
Compare
{ | ||
this.protocol = protocol; | ||
|
||
if (protocol == null || protocol.equals("compact")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use Jackson to deserialize the protocol into a TCompactProtocol or TBinaryProtocol? this iwll avoid the if/else check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any example? I am not familiar with Jackson and don't know how to make it after some search.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fjy Sorry, but how to do that? Any tips?
@du00cs there's some overlap between this and https://github.com/druid-io/druid/pull/3633/files Can we come to an agreement around which version to use? |
@fjy Protocol aware is awesome and I missed thread safe problem of deserializer. Something should be checked more carefully like how and when a thrift class is added to classpath. Implement a TBase to Map is too complicated and error prone. I should argue that using JSONPath will make it much simpler and more expressive and be consistent with what JsonSpec can do. |
@du00cs I just had a look at your code, specify thrift jar in runtime is really good, and serialize to simple json can solve the problem of nested field, so I think your solution is better. Also, I noticed that your batch ingestion is based on SequenceFileInputFormat, can you help add support for LzoThriftBlockInputFormat (this is implemented in project elephant-bird which is also quite popular) |
@justsmoke I must admit that your protocol aware and taking thread safe into account is much better. We can do it together. Take multiple format into account should be supported. |
bff869d
to
d2f43d8
Compare
@justsmoke I have merged your work. Can you help me to test if LzoBlockThriftInputFormat is supported? I tried to set up the environment but failed. |
@justsmoke @du00cs thanks, we can focus review on this PR |
d2f43d8
to
5e3c4a7
Compare
@justsmoke update with a change to avoid code style check failure. Thrift generator is platform dependent, I use scrooge-generator as a substitution to generate thrift files. You may need to change the thrift and scrooge version to match your case. |
5e3c4a7
to
ace0abe
Compare
@du00cs ok, I will test the LzoBlockThriftInputFormat and give you feedback in two days |
@du00cs When using LzoBlockThriftInputFormat, the thrift jar had to be included in classpath. This is different from other cases, and you may need to add some description in documentation. After the thrift jar is added to classpath, the thrift object is successfully parsed. |
@justsmoke if the thrift jar is specified by {
"jobProperties" : {
"tmpjars": "/user/xxx/druid/test/book.jar"
}
} the job will succeed? |
@du00cs No, using config: "elephantbird.class.for.MultiInputFormat" : "${YOUR_THRIFT_CLASS_NAME}" and include jar in classpath is the only way to get it work. (because thrift deserialize is done within the elephant-bird lib) |
@justsmoke A little confusing {
"jobProperties" : {
"tmpjars": "/user/xxx/druid/test/book.jar",
"elephantbird.class.for.MultiInputFormat" : "your.class.name"
}
} fails? I can not understand why. What's more, the path of |
@@ -0,0 +1,107 @@ | |||
/** | |||
* Copyright (c) 2016 XiaoMi Inc. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this 'Copyright' is not appropriate here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, it's just a mistake.
@du00cs Because deserialization is done by elephant-bird (after calling get on thriftWritable object, we get a TBase directly), so the parameter tmpjars is actually not used. |
1. thrift binary is platform dependent, use scrooge to generate java files to avoid style check failure 2. stream and hadoop ingesion are both supported, input format can be sequence file and lzo thrift block file. 3. base64 and protocol aware change header
8b1b2bb
to
ae3e799
Compare
@du00cs @justsmoke there's a few merge conflicts. Do you guys feel this is ready? |
@justsmoke Can you send me a sample file of LzoBlockThriftInputFormat (and also the thrfit file) ? I want to look more into it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, seems good to go after conflicts are resolved.
| ----------- | ----------- | ---------------------------------------- | -------- | | ||
| type | String | This should say `thrift` | yes | | ||
| parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. Should be a Json parseSpec. | yes | | ||
| thriftJar | String | path of thrift jar, if not provided, it will try to find the thrift class in classpath. Thrift jar in batch ingestion should be uploaded to HDFS first and configure `jobProperties` with `"tmpjars":"/path/to/your/thrift.jar"` | no | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
druid uploads the jars added to classpath of middlemanager to HDFS. can you test by adding the thrift jar to middlemanager classpath work for batch ingestion ?
In that case use may not need to add explicitly upload it to HDFS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding thrift jar to middle manager's class path will indeed work for batch ingestion, but an HDFS or some path is much more flexible. You don't have to put a jar into each nodes. Maybe pull from somewhere by http is the best. In this way, batch and realtime ways are unified. I should try to make it.
@du00cs there's a few conflicts |
merge & fix pom conflicts
👍 |
@du00cs I am using this extension with tranquility, but get the following exception: java.lang.IllegalArgumentException: requirement failed: Expected ByteBufferInputRowParser, got[io.druid.data.input.thrift.ThriftInputRowParser]. It seems that tranquility only supports ByteBufferInputRowParser ? |
@justsmoke You are right. I submit a pull request at Sep. [Fix] Remove data type and InputRowParser type bindings , may be some help. |
* Thrift ingestion plugin 1. thrift binary is platform dependent, use scrooge to generate java files to avoid style check failure 2. stream and hadoop ingesion are both supported, input format can be sequence file and lzo thrift block file. 3. base64 and protocol aware change header * fix conlicts in pom
In our company thrift is used heavily, thrift serialized objects are stored to HDFS as SequenceFile and bytes to Kafka. I try to make it easier for thrift data to be ingested into druid without ETL in simple case.
Thanks to the flexible JSONSpec, the Implementation is rather simple, deserialize thrift bytes to thrift object and then serialize it to json. And then parse it with JSONSpec.
Is anybody using thrift too? Hope this useful for somebody like me.