make the timeout publishing as a failure #4175

firstway · 2017-04-14T20:36:00Z

In one kafka ingestion task , if there is too much data(e.g. reset offset for a data source), it probably would NOT finish the publishing within the time of completionTimeout(default:30Mins).
In this case, I found that it did put the data into HDFS(deep storage), but it lost the meta-data because the KafkaIndexTask thread was interrupted, it's caught by line538(540), in this catch block, it does NOT rethrow exception if it's the publishing timeout case, and finally, it finishes as a SUCCESSFUL task, but in fact, it lose the meta-data for these segments created by this task.
In this PR, It determines whether the task really publish these meta-data successfully , and throw the exception if it does NOT.

gianm

Thanks for the contribution @firstway.

It seems like the problem you're having is that it's misleading when tasks are stopped due to timing out, and yet their status is SUCCESS. But they could get stopped for other reasons.

I'm wondering if it'd be better to deal with this at the supervisor level. Perhaps it could go straight to hard kill of timed out tasks, rather than an orderly shutdown. Then they'd be marked failed.

@dclim any thoughts?
@nishantmonu51 could you please review too, since you added the milestone :)

gianm · 2017-04-25T03:42:04Z

...nsions-core/kafka-indexing-service/src/main/java/io/druid/indexing/kafka/KafkaIndexTask.java

@@ -548,6 +550,11 @@ public String apply(DataSegment input)
        throw e;
      }

+      if (!publishedSuccessfully){


I think this will lead to improper behavior when a task is asked to stop for "normal" reasons (such as schema change). Maybe this is better dealt with at the supervisor level.

pjain1 · 2017-04-25T19:34:46Z

As far as I understand, currently KafkaSupervisor calls /stop on tasks which fails to complete within completionTimeout. So this is expected behavior, irrespective of whether the segment push to HDFS completes or not, the task status would be SUCCESS as the task is asked to stop and not actually Killed.

Therefore, a correct fix would be change KafkaSupervisor to actually kill the task after completionTimeout and that would report the task as FAILED. Btw #4178 changes the behavior to kill instead of stop. Please, correct me if I am wrong.

dclim · 2017-04-26T05:24:15Z

As @gianm and @pjain1 mentioned, there are a number of reasons why a task would be asked to stop before/during publishing in normal operation, the most common being in the case where you have replica tasks. If there are replica tasks, once one of them finishes publishing their segments, the remaining replicas are asked to stop, but these tasks haven't failed per se even though nothing was published, and marking them as failed tasks would not lead to a good experience.

Forcibly killing the task after a timeout so they return a failure status seems like a good solution to me. In the future, having more detailed return/error codes from tasks would be even better so that you could quickly determine why the task stopped running without having to dig through logs from the task or overlord.

pjain1 · 2017-05-23T17:15:04Z

Will fix this as part of #4178

firstway · 2017-05-24T03:24:20Z

@pjain1 thanks

make the timeout publishing as a failure

fec982d

nishantmonu51 added this to the 0.10.1 milestone Apr 14, 2017

gianm reviewed Apr 25, 2017

View reviewed changes

gianm requested review from dclim and nishantmonu51 April 25, 2017 03:44

gianm removed this from the 0.10.1 milestone May 23, 2017

firstway closed this May 24, 2017

firstway deleted the timeout_publishing_as_failure branch May 24, 2017 03:25

pjain1 mentioned this pull request Sep 16, 2017

Kafka Index Task that supports Incremental handoffs #4815

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make the timeout publishing as a failure #4175

make the timeout publishing as a failure #4175

firstway commented Apr 14, 2017

gianm left a comment

gianm Apr 25, 2017

pjain1 commented Apr 25, 2017 •

edited

Loading

dclim commented Apr 26, 2017

pjain1 commented May 23, 2017

firstway commented May 24, 2017

make the timeout publishing as a failure #4175

make the timeout publishing as a failure #4175

Conversation

firstway commented Apr 14, 2017

gianm left a comment

Choose a reason for hiding this comment

gianm Apr 25, 2017

Choose a reason for hiding this comment

pjain1 commented Apr 25, 2017 • edited Loading

dclim commented Apr 26, 2017

pjain1 commented May 23, 2017

firstway commented May 24, 2017

pjain1 commented Apr 25, 2017 •

edited

Loading