Auto merging segments created by the Kafka indexing service #4498

erankor · 2017-07-03T14:19:44Z

Hi all,

We're using Druid with the Kafka indexing service and would like to set up a process for merging
the small segments generated for each kafka partition. As far as I can see, the recommended way of doing it (based on the documentation) is to use Hadoop, but that sounds like added complexity and I'd rather avoid it if possible.
From what I understand (and please correct me if I'm wrong) this pull #3611 added support for merging sharded segments to the basic IndexTask. So, it should theoretically be possible to merge the Kafka indexing segments without Hadoop.
However, enabling druid.coordinator.merge.on doesn't work, since it looks only for segments that use NoneShard (https://github.com/druid-io/druid/blob/b77fab8a30eaaaba4e9f5f87a21a8031e3a20f66/server/src/main/java/io/druid/server/coordinator/helper/DruidCoordinatorSegmentMerger.java#L245)
Can this condition be removed following the merge of the pull I referenced above?
Is there a better way to auto-merge segments created by Kafka Indexing service?

Thank you!

Eran

jihoonson · 2017-07-03T14:38:36Z

Hi Eran,

regarding druid.coordinator.merge.on option, you're correct. The coordinator internally runs mergeTasks not native index tasks, which have the limitation that works for only noneShard specs.

Unfortunately, the currently possible solution based on druid's native index task is to manually set up a compaction workflow using some workflow tools like oozie. For example, you can schedule your compaction task to be run every midnight for merging segments ingested for a day.

FYI, there are two proposals for automatic background compaction (#4479) and improved segment generation in kafka indexing tasks (#4178).

erankor · 2017-07-03T19:31:04Z

@jihoonson, thank you for the super fast reply :)
A couple of follow up questions -

What is the difference between 'merge task' and 'index task'?
If write some script (external to Druid) that looks at the segment layout and creates 'index tasks' to merge segments from the same time, is that expected to work ok, or would you still recommend Hadoop?
Both features that you referred to sound great, but they sound like a lot of work to me, wouldn't it be much easier to just change the coordinator to use index tasks instead of 'merge tasks'? or is there some reason this can't work?

Thanks again!

Eran

jihoonson · 2017-07-04T02:00:14Z

@erankor

here are my answers.

Index task is a general task type that can be used for ingesting data from any dataSource even including druid's other dataSources. So, using index tasks, you can add new dataSources by ingesting external data to druid or reindex existing dataSources (http://druid.io/docs/latest/ingestion/update-existing-data.html). Here, the later is able to be used for the compaction effectively because index tasks can automatically determine more optmized shardSpec and segment size (please refer to here for more details.) Meanwhile, merge tasks are capable of merging existing segments which have none shardSpec.
I recommend to use Hadoop index task for now because the native index task has some limitations like the potential out of disk problem (Early publishing segments in IndexTask #4227) and slow ingestion due to the single processing model. This will be resolved in 0.11.0.
Unfortunately, it's not possible because of the limitation of our current segment versioning system. If you run a compaction task and a kafka index task at the same time, one of their result segments will be overshadowed and eventually removed. Please refer to [Proposal] Background compaction #4479 for more details.

erankor · 2017-07-04T03:50:34Z

@jihoonson, thank you very much for your support! Hadoop it is then... :)

jihoonson · 2017-07-04T05:11:06Z

@erankor cool! Feel free to further ask if you have more questions.

l15k4 · 2017-10-30T23:56:42Z

Hey, we use hadoop index tasks only, are these compatible with the druid.coordinator.merge.on ?
Do these segment use the NoneShard Spec?

Or is it the other way around and only Index Tasks produce segments with NoneShard Spec ?

jihoonson · 2017-10-31T02:00:43Z

Hi @l15k4, Hadoop index task does not use NoneShard spec unless it is forced which is not recommended.

The more recommended way is using IndexTask (http://druid.io/docs/latest/ingestion/tasks.html#index-task) and IngestSegmentFirehose (http://druid.io/docs/latest/ingestion/firehose.html#ingestsegmentfirehose). This should work for any type of shardSpec.

I'm currently working on a new compact task (#4985) which is a simplified version of indexTask + ingestSegmentFirehose. I expect that it will be included in 0.11.1.

l15k4 · 2017-11-06T19:09:33Z

@jihoonson I checked the descriptor.json in s3 deep storage and there is ShardSpec=none so I probably forced it without knowing.

We try to get s3-independent and at the same time leverage druid.coordinator.merge.on so I'm thinking about using tranquility. Can segments produced by tranquility be automatically merged?

licl2014 · 2018-04-17T09:51:17Z

@l15k4

Can segments produced by tranquility be automatically merged?

as same as kafka index service.

jihoonson · 2018-10-25T21:52:47Z

@l15k4 @licl2014, just saw these comments.. Sorry for the late response.

NoneShardSpec is not appropriate for every kind of appending. Once a segment with NoneShardSpec is generated, no more segment can't be added to the same interval.

Tranquility may be fine because it rejects late data, but you can't add more data to the same interval with any task types including native or hadoop batch tasks.

erankor closed this as completed Jul 4, 2017

pjain1 added the Area - Streaming Ingestion label Sep 16, 2017

pjain1 mentioned this issue Sep 16, 2017

Kafka Index Task that supports Incremental handoffs #4815

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto merging segments created by the Kafka indexing service #4498

Auto merging segments created by the Kafka indexing service #4498

erankor commented Jul 3, 2017

jihoonson commented Jul 3, 2017

erankor commented Jul 3, 2017

jihoonson commented Jul 4, 2017

erankor commented Jul 4, 2017

jihoonson commented Jul 4, 2017

l15k4 commented Oct 30, 2017

jihoonson commented Oct 31, 2017

l15k4 commented Nov 6, 2017 •

edited

Loading

licl2014 commented Apr 17, 2018

jihoonson commented Oct 25, 2018

Auto merging segments created by the Kafka indexing service #4498

Auto merging segments created by the Kafka indexing service #4498

Comments

erankor commented Jul 3, 2017

jihoonson commented Jul 3, 2017

erankor commented Jul 3, 2017

jihoonson commented Jul 4, 2017

erankor commented Jul 4, 2017

jihoonson commented Jul 4, 2017

l15k4 commented Oct 30, 2017

jihoonson commented Oct 31, 2017

l15k4 commented Nov 6, 2017 • edited Loading

licl2014 commented Apr 17, 2018

jihoonson commented Oct 25, 2018

l15k4 commented Nov 6, 2017 •

edited

Loading