Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make the timeout publishing as a failure #4175

Closed

Conversation

firstway
Copy link

In one kafka ingestion task , if there is too much data(e.g. reset offset for a data source), it probably would NOT finish the publishing within the time of completionTimeout(default:30Mins).
In this case, I found that it did put the data into HDFS(deep storage), but it lost the meta-data because the KafkaIndexTask thread was interrupted, it's caught by line538(540), in this catch block, it does NOT rethrow exception if it's the publishing timeout case, and finally, it finishes as a SUCCESSFUL task, but in fact, it lose the meta-data for these segments created by this task.
In this PR, It determines whether the task really publish these meta-data successfully , and throw the exception if it does NOT.

@nishantmonu51 nishantmonu51 added this to the 0.10.1 milestone Apr 14, 2017
Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @firstway.

It seems like the problem you're having is that it's misleading when tasks are stopped due to timing out, and yet their status is SUCCESS. But they could get stopped for other reasons.

I'm wondering if it'd be better to deal with this at the supervisor level. Perhaps it could go straight to hard kill of timed out tasks, rather than an orderly shutdown. Then they'd be marked failed.

@dclim any thoughts?
@nishantmonu51 could you please review too, since you added the milestone :)

@@ -548,6 +550,11 @@ public String apply(DataSegment input)
throw e;
}

if (!publishedSuccessfully){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will lead to improper behavior when a task is asked to stop for "normal" reasons (such as schema change). Maybe this is better dealt with at the supervisor level.

@gianm gianm requested review from dclim and nishantmonu51 April 25, 2017 03:44
@pjain1
Copy link
Member

pjain1 commented Apr 25, 2017

As far as I understand, currently KafkaSupervisor calls /stop on tasks which fails to complete within completionTimeout. So this is expected behavior, irrespective of whether the segment push to HDFS completes or not, the task status would be SUCCESS as the task is asked to stop and not actually Killed.

Therefore, a correct fix would be change KafkaSupervisor to actually kill the task after completionTimeout and that would report the task as FAILED. Btw #4178 changes the behavior to kill instead of stop. Please, correct me if I am wrong.

@dclim
Copy link
Contributor

dclim commented Apr 26, 2017

As @gianm and @pjain1 mentioned, there are a number of reasons why a task would be asked to stop before/during publishing in normal operation, the most common being in the case where you have replica tasks. If there are replica tasks, once one of them finishes publishing their segments, the remaining replicas are asked to stop, but these tasks haven't failed per se even though nothing was published, and marking them as failed tasks would not lead to a good experience.

Forcibly killing the task after a timeout so they return a failure status seems like a good solution to me. In the future, having more detailed return/error codes from tasks would be even better so that you could quickly determine why the task stopped running without having to dig through logs from the task or overlord.

@gianm gianm removed this from the 0.10.1 milestone May 23, 2017
@pjain1
Copy link
Member

pjain1 commented May 23, 2017

Will fix this as part of #4178

@firstway
Copy link
Author

@pjain1 thanks

@firstway firstway closed this May 24, 2017
@firstway firstway deleted the timeout_publishing_as_failure branch May 24, 2017 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants