Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka indexing service: Better task timeout handling #4693

Closed
gianm opened this issue Aug 16, 2017 · 4 comments
Closed

Kafka indexing service: Better task timeout handling #4693

gianm opened this issue Aug 16, 2017 · 4 comments

Comments

@gianm
Copy link
Contributor

gianm commented Aug 16, 2017

Currently the completionTimeout based timeouts can lead to issues where if it's too short, the tasks are killed in the middle of publishing, only to be started again from the beginning in an eternal Sisyphean struggle. Also, the tasks status is SUCCESS when this happens, which is confusing.

I'd suggest the following different behavior.

  • Set the default completionTimeout somewhat longer (right now it's 30 minutes). If it's too long then there might be too many tasks running at once, but I think we could afford 90 minutes.
  • Mark tasks FAILED if they time out before publishing (upload + commit to metadata store).
  • Mark tasks SUCCESS if they time out after publishing, but before historical handoff.

May be partially obsoleted by #4178.

@gianm gianm added this to the 0.11.0 milestone Aug 16, 2017
@gianm gianm changed the title Kafka indexing service: Task timeout handling Kafka indexing service: Better task timeout handling Aug 16, 2017
@jkogut
Copy link

jkogut commented Sep 10, 2017

Does it somehow related to KIS tasks that have status FAILED and duration -1 and their log is pretty OK but on the overlord side we can see some issue:

 17:29:46,544 INFO i.d.i.k.KafkaIndexTaskClient [KafkaIndexTaskClient-prod_dataSource_1_events-7] Still waiting for task [index_kafka_prod_dataSource_1_events_cb10bb47ee9a044_mhdiolnm] to pause; will try again in [PT4S]
 17:29:50,570 INFO i.d.i.k.KafkaIndexTaskClient [KafkaIndexTaskClient-prod_dataSource_1_events-7] Still waiting for task [index_kafka_prod_dataSource_1_events_cb10bb47ee9a044_mhdiolnm] to pause; will try again in [PT8S]
 17:29:58,590 INFO i.d.i.k.KafkaIndexTaskClient [KafkaIndexTaskClient-prod_dataSource_1_events-7] Still waiting for task [index_kafka_prod_dataSource_1_events_cb10bb47ee9a044_mhdiolnm] to pause; will try again in [PT10S]
 17:30:08,616 INFO i.d.i.k.KafkaIndexTaskClient [KafkaIndexTaskClient-prod_dataSource_1_events-7] Still waiting for task [index_kafka_prod_dataSource_1_events_cb10bb47ee9a044_mhdiolnm] to pause; will try again in [PT10S]
 17:30:18,642 INFO i.d.i.k.KafkaIndexTaskClient [KafkaIndexTaskClient-prod_dataSource_1_events-7] Still waiting for task [index_kafka_prod_dataSource_1_events_cb10bb47ee9a044_mhdiolnm] to pause; will try again in [PT10S]
 17:30:28,668 INFO i.d.i.k.KafkaIndexTaskClient [KafkaIndexTaskClient-prod_dataSource_1_events-7] Still waiting for task [index_kafka_prod_dataSource_1_events_cb10bb47ee9a044_mhdiolnm] to pause; will try again in [PT10S]
 17:30:38,694 INFO i.d.i.k.KafkaIndexTaskClient [KafkaIndexTaskClient-prod_dataSource_1_events-7] Still waiting for task [index_kafka_prod_dataSource_1_events_cb10bb47ee9a044_mhdiolnm] to pause; will try again in [PT10S]
 17:30:48,720 ERROR i.d.i.k.KafkaIndexTaskClient [KafkaIndexTaskClient-prod_dataSource_1_events-7] Task [index_kafka_prod_dataSource_1_events_cb10bb47ee9a044_mhdiolnm] failed to pause, aborting
 17:30:48,720 WARN i.d.i.k.s.KafkaSupervisor [KafkaSupervisor-prod_dataSource_1_events-Worker-1] Task [index_kafka_prod_dataSource_1_events_cb10bb47ee9a044_mhdiolnm] failed to respond to [pause] in a timely manner, killing task
 17:30:48,723 INFO i.d.i.o.RemoteTaskRunner [KafkaSupervisor-prod_dataSource_1_events-Worker-1] Sent shutdown message to worker: druid-mm02.deep.bi:9088, status 200 OK, response: {"task":"index_kafka_prod_dataSource_1_events_cb10bb47ee9a044_mhdiolnm"}
 17:30:48,723 INFO i.d.i.o.TaskLockbox [KafkaSupervisor-prod_dataSource_1_events-Worker-1] Removing task[index_kafka_prod_dataSource_1_events_cb10bb47ee9a044_mhdiolnm] from activeTasks
 17:30:48,723 INFO i.d.i.o.TaskLockbox [KafkaSupervisor-prod_dataSource_1_events-Worker-1] Removing task[index_kafka_prod_dataSource_1_events_cb10bb47ee9a044_mhdiolnm] from TaskLock[index_kafka_prod_dataSource_1_events]
 17:30:48,906 INFO i.d.i.o.MetadataTaskStorage [KafkaSupervisor-prod_dataSource_1_events-Worker-1] Updating task index_kafka_prod_dataSource_1_events_cb10bb47ee9a044_mhdiolnm to status: TaskStatus{id=index_kafka_prod_dataSource_1_events_cb10bb47ee9a044_mhdiolnm, status=FAILED, duration=-1}
 17:30:48,965 INFO i.d.i.o.RemoteTaskRunner [Curator-PathChildrenCache-1] Worker[druid-mm02.deep.bi:9088] wrote FAILED status for task [index_kafka_prod_dataSource_1_events_cb10bb47ee9a044_mhdiolnm] on [TaskLocation{host='druid-mm02.deep.bi', port=8103}]
 17:30:48,965 INFO i.d.i.o.RemoteTaskRunner [Curator-PathChildrenCache-1] Worker[druid-mm02.deep.bi:9088] completed task[index_kafka_prod_dataSource_1_events_cb10bb47ee9a044_mhdiolnm] with status[FAILED]
 17:30:48,965 INFO i.d.i.o.TaskQueue [Curator-PathChildrenCache-1] Received FAILED status for task: index_kafka_prod_dataSource_1_events_cb10bb47ee9a044_mhdiolnm
 17:30:49,098 INFO i.d.i.o.TaskQueue [KafkaSupervisor-prod_dataSource_1_events-Worker-1] Task done: KafkaIndexTask{id=index_kafka_prod_dataSource_1_events_cb10bb47ee9a044_mhdiolnm, type=index_kafka, dataSource=prod_dataSource_1_events}
 17:30:49,098 INFO i.d.i.o.RemoteTaskRunner [Curator-PathChildrenCache-1] Cleaning up task[index_kafka_prod_dataSource_1_events_cb10bb47ee9a044_mhdiolnm] on worker[druid-mm02.deep.bi:9088]
 17:30:49,101 WARN i.d.i.o.TaskQueue [Curator-PathChildrenCache-1] Unknown task completed: index_kafka_prod_dataSource_1_events_cb10bb47ee9a044_mhdiolnm
 17:30:49,101 INFO i.d.i.o.TaskQueue [Curator-PathChildrenCache-1] Task FAILED: KafkaIndexTask{id=index_kafka_prod_dataSource_1_events_cb10bb47ee9a044_mhdiolnm, type=index_kafka, dataSource=prod_dataSource_1_events} (1270654 run duration)
 17:30:49,101 INFO i.d.i.o.TaskRunnerUtils [Curator-PathChildrenCache-1] Task [index_kafka_prod_dataSource_1_events_cb10bb47ee9a044_mhdiolnm] status changed to [FAILED].
 17:30:49,101 INFO i.d.i.o.RemoteTaskRunner [Curator-PathChildrenCache-1] Task[index_kafka_prod_dataSource_1_events_cb10bb47ee9a044_mhdiolnm] went bye bye.

io.druid.java.util.common.ISE: Unable to grant lock to inactive Task [index_kafka_prod_dataSource_1_events_cb10bb47ee9a044_mhdiolnm]
io.druid.java.util.common.ISE: Unable to grant lock to inactive Task [index_kafka_prod_dataSource_1_events_cb10bb47ee9a044_mhdiolnm]

?

@gianm
Copy link
Contributor Author

gianm commented Jan 9, 2018

I believe this is still an issue but should be less of an issue due to incremental handoffs from #4815 (meaning less work to do at the end of the task).

@gianm gianm removed this from the 0.12.0 milestone Jan 9, 2018
@stale
Copy link

stale bot commented Jun 21, 2019

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the [email protected] list. Thank you for your contributions.

@stale stale bot added the stale label Jun 21, 2019
@stale
Copy link

stale bot commented Jul 5, 2019

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

@stale stale bot closed this as completed Jul 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants