-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make the timeout publishing as a failure #4175
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @firstway.
It seems like the problem you're having is that it's misleading when tasks are stopped due to timing out, and yet their status is SUCCESS. But they could get stopped for other reasons.
I'm wondering if it'd be better to deal with this at the supervisor level. Perhaps it could go straight to hard kill of timed out tasks, rather than an orderly shutdown. Then they'd be marked failed.
@dclim any thoughts?
@nishantmonu51 could you please review too, since you added the milestone :)
@@ -548,6 +550,11 @@ public String apply(DataSegment input) | |||
throw e; | |||
} | |||
|
|||
if (!publishedSuccessfully){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will lead to improper behavior when a task is asked to stop for "normal" reasons (such as schema change). Maybe this is better dealt with at the supervisor level.
As far as I understand, currently KafkaSupervisor calls Therefore, a correct fix would be change KafkaSupervisor to actually kill the task after |
As @gianm and @pjain1 mentioned, there are a number of reasons why a task would be asked to stop before/during publishing in normal operation, the most common being in the case where you have replica tasks. If there are replica tasks, once one of them finishes publishing their segments, the remaining replicas are asked to stop, but these tasks haven't failed per se even though nothing was published, and marking them as failed tasks would not lead to a good experience. Forcibly killing the task after a timeout so they return a failure status seems like a good solution to me. In the future, having more detailed return/error codes from tasks would be even better so that you could quickly determine why the task stopped running without having to dig through logs from the task or overlord. |
Will fix this as part of #4178 |
@pjain1 thanks |
In one kafka ingestion task , if there is too much data(e.g. reset offset for a data source), it probably would NOT finish the publishing within the time of
completionTimeout(default:30Mins)
.In this case, I found that it did put the data into HDFS(deep storage), but it lost the meta-data because the KafkaIndexTask thread was interrupted, it's caught by line538(540), in this catch block, it does NOT rethrow exception if it's the publishing timeout case, and finally, it finishes as a SUCCESSFUL task, but in fact, it lose the meta-data for these segments created by this task.
In this PR, It determines whether the task really publish these meta-data successfully , and throw the exception if it does NOT.