Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto merging segments created by the Kafka indexing service #4498

Closed
erankor opened this issue Jul 3, 2017 · 10 comments · Fixed by #4815
Closed

Auto merging segments created by the Kafka indexing service #4498

erankor opened this issue Jul 3, 2017 · 10 comments · Fixed by #4815

Comments

@erankor
Copy link

erankor commented Jul 3, 2017

Hi all,

We're using Druid with the Kafka indexing service and would like to set up a process for merging
the small segments generated for each kafka partition. As far as I can see, the recommended way of doing it (based on the documentation) is to use Hadoop, but that sounds like added complexity and I'd rather avoid it if possible.
From what I understand (and please correct me if I'm wrong) this pull #3611 added support for merging sharded segments to the basic IndexTask. So, it should theoretically be possible to merge the Kafka indexing segments without Hadoop.
However, enabling druid.coordinator.merge.on doesn't work, since it looks only for segments that use NoneShard (https://github.com/druid-io/druid/blob/b77fab8a30eaaaba4e9f5f87a21a8031e3a20f66/server/src/main/java/io/druid/server/coordinator/helper/DruidCoordinatorSegmentMerger.java#L245)
Can this condition be removed following the merge of the pull I referenced above?
Is there a better way to auto-merge segments created by Kafka Indexing service?

Thank you!

Eran

@jihoonson
Copy link
Contributor

Hi Eran,

regarding druid.coordinator.merge.on option, you're correct. The coordinator internally runs mergeTasks not native index tasks, which have the limitation that works for only noneShard specs.

Unfortunately, the currently possible solution based on druid's native index task is to manually set up a compaction workflow using some workflow tools like oozie. For example, you can schedule your compaction task to be run every midnight for merging segments ingested for a day.

FYI, there are two proposals for automatic background compaction (#4479) and improved segment generation in kafka indexing tasks (#4178).

@erankor
Copy link
Author

erankor commented Jul 3, 2017

@jihoonson, thank you for the super fast reply :)
A couple of follow up questions -

  1. What is the difference between 'merge task' and 'index task'?
  2. If write some script (external to Druid) that looks at the segment layout and creates 'index tasks' to merge segments from the same time, is that expected to work ok, or would you still recommend Hadoop?
  3. Both features that you referred to sound great, but they sound like a lot of work to me, wouldn't it be much easier to just change the coordinator to use index tasks instead of 'merge tasks'? or is there some reason this can't work?

Thanks again!

Eran

@jihoonson
Copy link
Contributor

@erankor

here are my answers.

  1. Index task is a general task type that can be used for ingesting data from any dataSource even including druid's other dataSources. So, using index tasks, you can add new dataSources by ingesting external data to druid or reindex existing dataSources (http://druid.io/docs/latest/ingestion/update-existing-data.html). Here, the later is able to be used for the compaction effectively because index tasks can automatically determine more optmized shardSpec and segment size (please refer to here for more details.) Meanwhile, merge tasks are capable of merging existing segments which have none shardSpec.
  2. I recommend to use Hadoop index task for now because the native index task has some limitations like the potential out of disk problem (Early publishing segments in IndexTask #4227) and slow ingestion due to the single processing model. This will be resolved in 0.11.0.
  3. Unfortunately, it's not possible because of the limitation of our current segment versioning system. If you run a compaction task and a kafka index task at the same time, one of their result segments will be overshadowed and eventually removed. Please refer to [Proposal] Background compaction #4479 for more details.

@erankor
Copy link
Author

erankor commented Jul 4, 2017

@jihoonson, thank you very much for your support! Hadoop it is then... :)

@erankor erankor closed this as completed Jul 4, 2017
@jihoonson
Copy link
Contributor

@erankor cool! Feel free to further ask if you have more questions.

@l15k4
Copy link

l15k4 commented Oct 30, 2017

Hey, we use hadoop index tasks only, are these compatible with the druid.coordinator.merge.on ?
Do these segment use the NoneShard Spec?

Or is it the other way around and only Index Tasks produce segments with NoneShard Spec ?

@jihoonson
Copy link
Contributor

Hi @l15k4, Hadoop index task does not use NoneShard spec unless it is forced which is not recommended.

The more recommended way is using IndexTask (http://druid.io/docs/latest/ingestion/tasks.html#index-task) and IngestSegmentFirehose (http://druid.io/docs/latest/ingestion/firehose.html#ingestsegmentfirehose). This should work for any type of shardSpec.

I'm currently working on a new compact task (#4985) which is a simplified version of indexTask + ingestSegmentFirehose. I expect that it will be included in 0.11.1.

@l15k4
Copy link

l15k4 commented Nov 6, 2017

@jihoonson I checked the descriptor.json in s3 deep storage and there is ShardSpec=none so I probably forced it without knowing.

We try to get s3-independent and at the same time leverage druid.coordinator.merge.on so I'm thinking about using tranquility. Can segments produced by tranquility be automatically merged?

@licl2014
Copy link
Contributor

@l15k4

Can segments produced by tranquility be automatically merged?

as same as kafka index service.

@jihoonson
Copy link
Contributor

@l15k4 @licl2014, just saw these comments.. Sorry for the late response.

NoneShardSpec is not appropriate for every kind of appending. Once a segment with NoneShardSpec is generated, no more segment can't be added to the same interval.

Tranquility may be fine because it rejects late data, but you can't add more data to the same interval with any task types including native or hadoop batch tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants