Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move some lifecycle management from doTask -> shutdown for the mm-less task runner #14895

Merged
merged 9 commits into from
Aug 25, 2023

Conversation

georgew5656
Copy link
Contributor

Description

The mm-less task runner differs in behavior from the other runners that run on the overlord because it tries to handle the cleanup lifecycle on its own by immediately deleting k8s jobs and clearing it's tasks map as soon as it has finished running a task.

The HttpRemoteTaskRunner and RemoteTaskRunner don't do this. Instead, they rely on the TaskQueue's handlers for the futures that all the task runners return to call shutdown on the task once it has completed.

Updating the mm-less task runner to use logic more similar to the other task runners has a couple benefits.

  • The task location (including the k8sPodName) is successfully persisted to taskStorage in TaskQueue.notifyStatus. Currently the mm-less task runner reports no location in this function call because its run lifecycle will have already cleaned up the K8s Job and its tasks map.
  • Currently, when a task completes, the taskQueue handler will try to call shutdown on the k8s task runner after the runner has already shut down the task. this creates a bunch of "Ignoring request to cancel unknown task" logs and in general seems likely to cause unexpected behavior in the future. Changing the logic will remove this duplication.

Changes

  • In KubernetesPeonLifecycle, stop calling shutdown (to delete the K8s job) after a job has finished.
  • In KubernetesTaskRunner.doTask, stop removing the taskId from tasks in the logic of the run future.
  • In KubernetesTaskRunner.shutdown, remove taskId from tasks in addition to calling shutdown on the job. When taskQueue calls this shutdown function, everything in the task will be cleaned up as expected.
  • Remove the shutdownRequested flag from KubernetesWorkItem since we can now treat the presence (or lack of presence) of the taskId in the tasks map as a indicator of whether the task was shutdown.

Release note

Update mm-less task runner lifecycle logic to better match the logic in the HTTP and Zookeeper worker task runners.

Key changed/added classes in this PR
  • KubernetesPeonLifecycle
  • KubernetesTaskRunner
  • KubernetesWorkItem

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Left a minor comment.

@georgew5656 , what would be the side effect of this if the TaskQueue is hypothetically slow in cleaning up finished tasks?

@@ -188,7 +186,7 @@ protected synchronized TaskStatus join(long timeout) throws IllegalStateExceptio
*/
protected void shutdown()
{
if (State.PENDING.equals(state.get()) || State.RUNNING.equals(state.get())) {
if (State.PENDING.equals(state.get()) || State.RUNNING.equals(state.get()) || State.STOPPED.equals(state.get())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to handle STOPPED state here? Wouldn't the job have already finished?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is saying thats its okay to shutdown if its stopped

@georgew5656
Copy link
Contributor Author

Looks good. Left a minor comment.

@georgew5656 , what would be the side effect of this if the TaskQueue is hypothetically slow in cleaning up finished tasks?

The two things that will hang around for a while
The completed K8s job/pod. I don't think this is a huge issue since its not actually consuming any resources (basically a key/value entry in etcd).
The entry in the tasks map. The API call for listing tasks will still return this value and some of the task slot metrics will report the task as still running. New tasks will still be able to be run since the thread executing the future for the task will have completed.

Copy link
Contributor

@YongGang YongGang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup, left one minor comment

@@ -271,6 +262,10 @@ public void shutdown(String taskid, String reason)
return;
}

synchronized (tasks) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the code can be simplified:

    KubernetesWorkItem workItem;
    synchronized (tasks) {
       workItem = tasks.remove(taskid);
    }
    if (workItem == null) {
      log.info("Ignoring request to cancel unknown task [%s]", taskid);
      return;
    }

@dclim dclim merged commit 95b0de6 into apache:master Aug 25, 2023
@LakshSingla LakshSingla added this to the 28.0 milestone Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants