Support for middle manager less druid, tasks launch as k8s jobs #13156

churromorales · 2022-09-29T21:46:44Z

Description

Add an extension to allow tasks to be run as k8s jobs from the overlord, eliminating the need for a middle manager.

The core changes are as follows:

Refactored arguments to CliPeon to be more generic
Had to add a setup and cleanup method to AbstractTask.
a. Because tasks run on separate pods, the task needs to setup its own filesystem directories.
b. Again because the tasks run on separate pods, we push the task logs from the task itself and task reports. in the cleanup method.
A few other small changes to core required for tasks to run independently on their own.

How it works

The KubernetesTaskRunner runs in the overlord process. When it has a request to launch a task, it goes to the K8sApi, grabs its own PodSpec (the overlord itself). Takes that podSpec, modifies the necessary attributes (eg: command, labels, env variables etc). Takes the task.json, compresses and base64 encodes it. Then launches a K8s Job.

The K8s Job on startup, will unwrap the task.json env variable, write it to the appropriate directory and run the task.

The KubernetesTaskRunner monitors the lifecycle of the task, just as the ForkingTaskRunner and returns the TaskStatus.

What if you are running Sidecars?

The config option druid.indexer.runner.sidecarSupport will support launching sidecars, I utilize kubexit (https://github.com/karlkfi/kubexit) to setup the spec such that when the main container completes, it terminates the sidecars. This is a known issue with k8s jobs and this is how I work around it.

Another nice side-effect

Because the launching of tasks has been decoupled from the service itself, the tasks run independently regardless of the state of the overlord process. You can shut down the overlord process, and when it comes back. It will go to the k8s api and get the status of all peon jobs regardless of phase (in flight, completed, failed, pending) and will do the proper bookeeping for completed tasks and will resume monitoring tasks in flight.

To run a middle manager less druid, simply omit the middle manager from your deployment.

Make sure you also change druid.processing_intermediaryData.storage.type=deepStorage

In your overlord config:
1. Add the druid-kubernetes-overlord-extensions to your extensions load list.
2. druid.indexer.runner.type=k8s
3. druid.indexer.runner.namespace=<currentNamespace>
4. druid.indexer.queue.maxSize controls max concurrent tasks, you must set it to a value less than the default of Integer.MAX_VALUE
5. druid.indexer.task.encapsulatedTask=true
This PR has:

[ x] been self-reviewed.
- [x ] using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
[ x] added documentation for new or modified features or behaviors.
[ x] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
[ x] added or updated version, license, or notice information in licenses.yaml
[ x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
[ x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests. (this has been added but the k8s integration tests only work on a linux machine as they use conntrack with minikube. Thus I will have to let travis run and figure things out from there.
been tested in a test Druid cluster.

lgtm-com · 2022-09-29T23:12:20Z

This pull request introduces 5 alerts when merging 190f8dd into ce5f55e - view on LGTM.com

new alerts:

3 for Uncontrolled data used in path expression
2 for Unused format argument

imply-cheddar · 2022-09-30T01:39:59Z

distribution/docker/peon.sh

@@ -0,0 +1,153 @@
+#!/bin/sh


Meta commentary: it would be cool (and I don't think too crazy) if there was a Cli target that is like "task" which essentially does a remote bootstrap.

I.e. it could start up and the first thing it does is read some config for the overlord process, "phone home" to ask for the task spec and runtime properties, and then use those to bootstrap.

This might open up security holes, so it would probably have to be done with some sort of shared secret or something, maybe? But anyway, might simplify this script into something that is relatively generic and not even k8s-dependent.

Could also be done as an evolution of this change post-merge.

So when creating this patch, I was hoping to minimize core changes to increase the odds of this patch getting accepted. In all honesty, I believe if this is a first class feature, we can fix some other hacky stuff that was done. I definitely think having a Cli command for the task is a good idea, I think this is something that I could put up a PR if /when this gets merged.

lgtm-com · 2022-09-30T17:29:48Z

This pull request introduces 5 alerts when merging afff509 into ce5f55e - view on LGTM.com

new alerts:

3 for Uncontrolled data used in path expression
2 for Unused format argument

lgtm-com · 2022-09-30T21:00:33Z

This pull request introduces 5 alerts when merging 2f06ec1 into ebfe1c0 - view on LGTM.com

new alerts:

3 for Uncontrolled data used in path expression
2 for Unused format argument

paul-rogers

@churromorales, this is a very impressive PR! Getting K8s support into Druid is HUGE! I made a first pass to get the lay of the land: comments are all nits in this pass.

docs/development/extensions-contrib/k8s-jobs.md

...etes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/JobResponse.java

...s-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/K8sTaskAdapter.java

...-contrib/kubernetes-overlord-extensions/src/test/resources/expectedMultiContainerOutput.yaml

services/src/test/java/org/apache/druid/cli/CliPeonTest.java

Add configuration option to disable http/https proxy for the k8s client Update the docs to provide more detail about sidecar support

lgtm-com · 2022-10-04T01:15:31Z

This pull request introduces 5 alerts when merging 2e23952 into 92d2633 - view on LGTM.com

new alerts:

3 for Uncontrolled data used in path expression
2 for Unused format argument

georgew5656 · 2022-10-06T17:06:55Z

this looks great @churromorales, i've started looking through some of the kubernetes related code today, will hopefully finish up by tomorrow

paul-rogers

Working my way through a second, detailed review. The code looks quite solid so far. Most of the comments are around asking questions, suggesting doc. enhancements and style suggestions. More to come.

paul-rogers · 2022-10-13T01:39:27Z

distribution/docker/peon.sh

+# NOTE: this is a 'run' script for the stock tarball
+# It takes 1 required argument (the name of the service,
+# e.g. 'broker', 'historical' etc). Any additional arguments
+# are passed to that service.


This is similar to the standard script. Is it a copy? If a copy, then perhaps we can avoid an actual copy: add a rule to the distribution project (likely in assembly.xml) to copy the existing script to peon.sh where needed.

If this version has changes, can they be applied to the standard script somehow to avoid the copy?

the peon.sh is copied into the dockerifle, do we need to add a rule here? I just do it in the dockerfile itself, user wont have to worry, if anything changes with the script it gets automatically updated. I believe that is the same thing that happens with the druid.sh script.

Can we add a comment here specifying that this is similar to druid.sh and is used exclusively for the kubernetes-overlord-extension?

docs/development/extensions-contrib/k8s-jobs.md

paul-rogers · 2022-10-13T01:41:14Z

docs/development/extensions-contrib/k8s-jobs.md

+
+## How it works
+
+It takes the podSpec of your `Overlord` pod and creates a kubernetes job from this podSpec.  Thus if you have sidecars such as splunk, hubble, istio it can optionally launch a task as a k8s job.  All jobs are natively restorable, they are decopled from the druid deployment, thus restarting pods or doing upgrades has no affect on tasks in flight.  They will continue to run and when the overlord comes back up it will start tracking them again.  


It -> "The K8s extension"

Please capitalize proper nouns.

Perhaps provide just a bit more background for newbies. Where does the pod spec come from? Something like:

With this extension, your Overload tasks run in a Kubernetes pod. You define that pod using a pod spec. The pod spec names a Docker image with Druid installed, identifies K8s settings, provides Druid configuration, passes in secrets, etc. You create the pod spec outside of Druid. We suggest you test the spec directly in K8s before using it with Druid.

I'm sure some (all?) of the above is wrong: it just identifies what I, as a newbie to this extension, would need to know. Focus in particular on anything Druid-specific I need to provide in the pod spec.

Also, where do I put the pod spec? In K8s somewhere? On the local disk of each Druid node?

The above says "pod spec". Is that literally what we need? Or, do we provide a pod template? Per the docs,

Controllers for workload resources create Pods from a pod template and manage those Pods on your behalf.

PodTemplates are specifications for creating Pods, and are included in workload resources such as Deployments, Jobs, and DaemonSets.

Would I be on the right track to think of OL as a controller in this context?

Or, does this extension use the pod spec as the template (because, say, the other template properties don't apply here)?

There seems to be one pod spec for a cluster. Should I then ensure that the pod is large enough for my largest task? Would a future enhancement be the ability to override the pod spec per-task so that light-weight tasks (the MSQ controller) takes less resources than a heavy-weight task (an MSQ worker)?

So the reason for the podSpec is that I didn't want to take a brand new docker image and create a job spec for it. I don't know your cluster config. Suppose you have secrets mounted, volumes setup up your way, env variables, certs, annotations for things like istio, etc... The parent pod spec logic is to ensure you have those things for your peon task and you don't have to worry about it. But, I do take that pod spec and massage it. I do things like change the resources required. For CPU we always give the task a single core (just like what druid was always doing). For memory we take the jvm opts you pass and configure the container resources from that. I believe its something like (Xmx + dbb)*1.2. So while the cpu resources are fixed at a core, the memory is deduced from your jvm opts. I want to make sure I answered your question correctly here, please let me know things don't make sense.

Perhaps provide just a bit more background for newbies. Where does the pod spec come from? Something like:

With this extension, your Overload tasks run in a Kubernetes pod. You define that pod using a pod spec. The pod spec names a Docker image with Druid installed, identifies K8s settings, provides Druid configuration, passes in secrets, etc. You create the pod spec outside of Druid. We suggest you test the spec directly in K8s before using it with Druid.

I'm sure some (all?) of the above is wrong: it just identifies what I, as a newbie to this extension, would need to know. Focus in particular on anything Druid-specific I need to provide in the pod spec.

Also, where do I put the pod spec? In K8s somewhere? On the local disk of each Druid node?

Maybe I wasn't clear, you don't have to do anything here. The podSpec comes from the overlord pod, the pod that launches the task. The K8sTaskAdapter takes the overlord spec and modifies it to run a peon k8s job. The user has to do nothing here.

I'm sure some (all?) of the above is wrong: it just identifies what I, as a newbie to this extension, would need to know. Focus in particular on anything Druid-specific I need to provide in the pod spec.

Also, where do I put the pod spec? In K8s somewhere? On the local disk of each Druid node?

Should i just exclude the podSpec part? Users have to do nothing here, I am just explaining how it works at a high-level.

it makes sense to get the podspec from the overlord. But I suppose that I can override some elements in the spec if I want as a user?

yes definitely. Right now I give you the ability to override annotations and labels. In a later PR we can make this more configurable for users. I have used this patch for a few druid clusters and did not need to override anything else, but that is just my experience. I think going forward we make this more configurable in the future. The goal of this patch was to make it as easy as possible for users to switch from using MM's to k8s jobs. Currently its add 4 required configurations for the overlord and you are good to go.

For sure, I don't think we need those features right away.

paul-rogers · 2022-10-13T01:49:15Z

docs/development/extensions-contrib/k8s-jobs.md

+The extension uses the task queue to limit how many concurrent tasks (k8s jobs) are in flight so it is required you have a reasonable value for `druid.indexer.queue.maxSize`.  Additionally set the variable `druid.indexer.runner.namespace` to the namespace in which you are running druid.
+
+Other configurations required are: 
+`druid.indexer.runner.type: k8s` and `druid.indexer.task.enableTaskLevelLogPush: true`


From this, can I infer that, with this extension, all tasks run in K8s? There is no mixed mode where I can choose the hosting type per-task? (Not sure mixed usage is useful, just asking.)

If use of this extension is binary, then perhaps note somewhere that the cluster should be configured with no MiddleManagers and no Indexers. (If that is, in fact, what I should do.)

That is correct, there is no mixed mode support. If you turn on k8s mode all tasks are k8s tasks. We have been running this patch and we don't launch any MM's. If you deploy using the operator, this is quite easy to do as you just remove the middleManager section of the spec completely. I will add that to the docs, good point.

paul-rogers · 2022-10-13T01:51:40Z

docs/development/extensions-contrib/k8s-jobs.md

+
+To use this extension please make sure to  [include](../extensions.md#loading-extensions)`druid-kubernetes-overlord-extensions` in the extensions load list for your overlord process.
+
+The extension uses the task queue to limit how many concurrent tasks (k8s jobs) are in flight so it is required you have a reasonable value for `druid.indexer.queue.maxSize`.  Additionally set the variable `druid.indexer.runner.namespace` to the namespace in which you are running druid.


In other words, the extension will run K8s jobs up to the limit specified? The assumption is that K8s has at least this much capacity? What happens if I configure the maxSize to be greater than my K8s capacity? Will the "extra" Druid tasks remain queued?

Druid recently introduced MSQ, which launches a controller, then launches a wave of workers. Could we get into a deadlock situation if K8s has enough capacity to launch only half the workers? If so, we should note to users to set maxSize to less than the K8s capacity.

So this one was a tricky one. Some users have quotas setup for their namespace while others do not. Quotas such as how many jobs can i have concurrently running. I believe if you do this, you will just have unscheduled pods while resources are not available and then when they become available they will launch. If you use aws, you could perhaps tie in the autoscaler and have some logic stating if i have x amount of pods unscheduled then scale up ec2 instances. And scale down the same with the reverse logic, if I have ec2 instances not using resources, scale things down.

To answer your question, yes we should always ask users to set maxSize less than the k8s capacity, unless they have autoscaling configured. In that case its up to the user to determine what they can potentially scale up to.

...es-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java

paul-rogers · 2022-10-13T02:47:19Z

...es-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java

+    // join queries
+    if (task.supportsQueries()) {
+      command.add("--loadBroadcastSegments");
+      command.add("true");


General question: if a peon supports queries, it will have incoming connections. How do we ensure that the Peon's IP address is visible to the Broker? K8s usually has some kind of overlay network. Are we assuming Druid runs within K8s so it is on that same network? Or, that somewhere we've set up a proxy into the overlay network? It's been a while since I used K8s, perhaps the proxy isn't needed? Or, the networks are merged in AWS? Is that true on plain-Jane K8s?

Good question. For this patch, I am forcing the tasks to run in the current namespace that druid is running in. You could easily have the tasks run in a different namespace but would require a bit of changes to this patch (not much though). So when a job gets launched, we wait for the pod to startup, once the pod starts, we grab the ip address of the pod and broadcast that as the TaskLocation. There would be no need for a proxy if you wanted to go across namespaces in the same context, but if you had druid running in namespace A and tasks running in namespace B you would have to setup your k8s to be able to communicate bidirectionally.

The goal of this patch, is if you have are running druid on k8s, you shouldn't need to do anything except add the extension and a few configuration options to the overlord and you are good to go. In the gotchas section of the readme, there are some things you have to look out for with sidecars which i mentioned. Also your service account must have the ability to deal with k8s jobs, I included a sample role / rolebinding as an example in the readme.md. We have been using this on some of our clusters for a while now and we haven't really had to add any extra configuration or do anything special in k8s to get this to work.

...es-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java

...extensions/src/main/java/org/apache/druid/k8s/overlord/common/MultiContainerTaskAdapter.java

...s-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/K8sTaskAdapter.java

georgew5656 · 2022-10-13T15:27:47Z

...s-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/K8sTaskAdapter.java

+    mainContainer.setPorts(Lists.newArrayList(httpsPort, tcpPort));
+  }
+
+  protected void addEnvironmentVariables(Container mainContainer, PeonCommandContext context, String taskContents)


similar to the other comments, I think we want to make additional things configurable here if possible

agreed, i will make env's and annotations configurable.

actually one quesiton, for env variables being configurable, do you have a usecase? It takes all the env variables of the overlord pod running, would there be a situation where you would need something additional here? The container shouldn't need more than what druid has to work, right? Annotations I will add, because that could be something you have setup for k8s, like some autoscaling annotation you want to add or networking....labels are already configurable.

annotations and labels are configurable, if you can provide me with an example of how we would need env variables, i would be happy to add it as configurable. But I personally think that could cause more harm than good potentially.

I thought I added it as a comment but what I actually think is most important to be configurable is nodeSelector/affinity/tolerations in case people want to configure what node pools to run on.

If nodeSelector is not set than maybe we can use whatever the overlord uses as its node selector.

Env variables is not as important, but I could see something like DD_TRACE_AGENT_URL for enabling tracing with datadog

i also like the idea of using a custom pod spec, but that seems like a bigger change that we could do as a followup

you have a valid point, maybe in a future PR, we could add some config map where you can define a podSpec template. That should be pretty easy to do.

may not be an issue but is there a limit in k8s on how large the env variable value can be? The task JSON could be large.

There is and its no different from a configmap max size either. That is why I compress and base64 encode the task.json file when i pass it in the spec.

lgtm-com · 2022-10-13T21:18:41Z

This pull request introduces 5 alerts when merging 6c13bf0 into 45dfd67 - view on LGTM.com

new alerts:

3 for Uncontrolled data used in path expression
2 for Unused format argument

churromorales · 2022-10-13T23:43:15Z

I noticed a slight issue when testing on heavily loaded k8s clusters. Those that take a while to spin up / down pods.

Currently the way things work:
1. The overlord launches a task, waits for the pod to come up, it not only waits for a pod ip to be available, but also opens a socket connection to the peon processes web server. Once you have a socket connection the task is considered started.
2. Then it will monitor the job to complete, but this is a blocking call, after the job completes, sends pushes task reports, logs and then does a delete for the peon k8s job.

What was problematic here, is I noticed that the process was complete, the task itself was finished, but the k8s operations were slow on my cluster. The task itself took less than a minute to complete, but the entire k8s lifecycle was taking much longer around 10 minutes. (A very heavily loaded cluster).

Thus the task status only updates in the finally block of the K8sTaskRunner.run() so the task lock is only released when the task finishes and the k8s job is deleted. Which is technically correct, but in reality, the task jvm process itself could've completed long before this. I have another commit and I will test on our clusters but a brief description of the patch is this:

The overlord launches a task, waits for the pod to come up. No more opening a socket for the webserver. Instead I added 2 TaskActions: One to update the TaskStatus, one to update the TaskLocation.
Now in the AbstractTask the setup method will update its own location before the run method is called. If the overlord goes down for whatever reason, that is fine. We don't lose anything really, the call fails and the task itself dies.
In the cleanup method of the AbstractTaskwe also update the TaskLocation to unknown and update the status to success or failure. So when the process exits, we can give up the lock and things wont be blocked. In case the pod goes away during the cleanup, that is okay, the overlord will still monitor the job and report the correct results albeit a bit slower.

I thought about this approach originally, but didn't want to make too many changes to core, but after running this patch on our clusters, I do think this is the best approach. I will call it out in a separate commit so you guys can review this work, it doesn't seem too abrasive to me, but we should focus on correctness here and not hold the locks longer than necessary as we were doing in the forking task runner.

I will test this out over the next day or so and then add the commit to this PR.

Fryuni · 2022-10-14T00:32:12Z

Hey, this is an awesome PR, thanks for all the work
I am concerned about this part, though:

The KubernetesTaskRunner runs in the overlord process. When it has a request to launch a task, it goes to the K8sApi, grabs its own PodSpec (the overlord itself). Takes that podSpec, modifies the necessary attributes (eg: command, labels, env variables etc). Takes the task.json, compresses and base64 encodes it. Then launches a K8s Job.

The resource use for overlords and tasks is vastly different, using the same podSpec for those would require reserving much more resources for the overlords and the tasks than necessary.

One idea is that instead of reading from the overlord pod itself, it would be better to provide the name of a config map containing the podSpec.

Another idea: the middle-managers have tiers, which we can use to assign tasks to middle-managers with different resource allocations. Those tiers could be used to define different podSpecs for different tasks without changing the tasks API.

Those changes can be done on a later PR. All in all, this PR is a huge step forward in making Druid more scalable for cloud environments.

churromorales · 2022-10-14T17:30:11Z

Thank you for your kind words. I hope this patch will be useful to folks upstream.

The resource use for overlords and tasks is vastly different, using the same podSpec for those would require reserving much more resources for the overlords and the tasks than necessary.

This doesn't use the resources of the overlord, we overwrite the resources for podSpec. The memory is derived from your JAVA_OPTS (for your task) basically 1.2*(Xmx + Dbb) for the memory limit and the cpu is always 1. You get a core per task just like we did before. We only grab certain items from the parent podSpec. You can see the logic in the K8sTaskAdapter

I am open to whatever changes you guys think are necessary. I have a few more ideas for subsequent PRs and I do like the configmap pod spec for users that want more control. That might be a configurable option we can provide in another PR.

Fryuni · 2022-10-14T23:34:19Z

This doesn't use the resources of the overlord, we overwrite the resources for podSpec. The memory is derived from your JAVA_OPTS (for your task) basically 1.2*(Xmx + Dbb) for the memory limit and the cpu is always 1.

That is super great news on the memory side, but not so great on the CPU side for us. We have 23 continuously running ingestion tasks and they only use a tenth of a core (100m) each, so fixing the resource request to one core is a 9x overprovision totaling 21 cores. A full core works for our compaction tasks, but those run for about 10 minutes every 4 hours.

We'll explore some things with Druid in the near future, so we might use this PR before it lands and give some feedback.

… / locaiton, for slower k8s clusters, this reduces locking time significantly

a2l007 · 2022-10-28T15:48:37Z

distribution/docker/peon.sh

+# NOTE: this is a 'run' script for the stock tarball
+# It takes 1 required argument (the name of the service,
+# e.g. 'broker', 'historical' etc). Any additional arguments
+# are passed to that service.


Can we add a comment here specifying that this is similar to druid.sh and is used exclusively for the kubernetes-overlord-extension?

...es-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java

...s-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/K8sTaskAdapter.java

...rlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/KubernetesClientApi.java

...lord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/KubernetesPeonClient.java

indexing-service/src/main/java/org/apache/druid/indexing/common/task/AbstractTask.java

indexing-service/src/main/java/org/apache/druid/indexing/overlord/http/OverlordResource.java

lgtm-com · 2022-10-28T20:31:09Z

This pull request introduces 5 alerts when merging b12de57 into 4f0145f - view on LGTM.com

new alerts:

3 for Uncontrolled data used in path expression
2 for Unused format argument

clintropolis

very nice! 🚀

i only reviewed this PR at a high level to view changes to core druid stuffs and didn't look too closely at the kubernetes specific stuff because that isn't my area of expertise, but everything seems overall fine to me, and kubernetes area does look like it has had some people looking at stuff so I'll go ahead and leave an approval.

I think the discussions around supporting different podSpec/nodeSelector/(other kubernetes words I vaguely understand) are interesting, when we get there we should be thinking about if we can overhaul the existing worker select strategy stuff to accommodate this or if it needs new similar machinery, but being able to run have control over which tasks run in which places is generally useful for mixed workloads so it would definitely be a good investigation for follow-up.

Longer term it would be nice to eventually get this combined into the druid-kubernetes-extensions core extension so that operators need to only load a single extension to have a Zookeeper free Druid experience 🎉

clintropolis · 2022-10-28T22:02:39Z

services/src/test/java/org/apache/druid/cli/MainTest.java

-/**
- * This test verifies injection for {@link ServerRunnable}s which are discoverable Druid servers.
- */


nit: intended change?

clintropolis · 2022-10-28T22:06:40Z

services/src/main/java/org/apache/druid/cli/CliPeon.java

+            if ("k8s".equals(properties.getProperty("druid.indexer.runner.type", null))) {
+              log.info("Running peon in k8s mode");
+              executorLifecycleConfig.setParentStreamDefined(false);
+            }


nit: this also seems a bit strange and special caseish for a specific thing, though i'm not sure introducing another property here is the solution...

clintropolis · 2022-10-28T22:57:05Z

docs/development/extensions-contrib/k8s-jobs.md

+The extension uses the task queue to limit how many concurrent tasks (K8s jobs) are in flight so it is required you have a reasonable value for `druid.indexer.queue.maxSize`.  Additionally set the variable `druid.indexer.runner.namespace` to the namespace in which you are running druid.
+
+Other configurations required are: 
+`druid.indexer.runner.type: k8s` and `druid.indexer.task.encapsulatedTask: true`


druid.indexer.task.encapsulatedTask doesn't seem documented anywhere (or in javadocs), do we need to elaborate on what it does?

clintropolis · 2022-10-28T23:00:24Z

indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskRunner.java

+  /**
+   * Beacause the k8s task runner is an extension, we need to know the task runner type in the overlord resource
+   */
+  default boolean isK8sTaskRunner()


since this class is marked @PublicApi, I think we should consider naming this something less specific, like isDirectTaskRunner or isWorkerFreeTaskRunner something to indicate that it is worker free but can still run up to getTotalTaskSlotCount tasks (based on how its being used in OverlordResource)

paul-rogers

@churromorales, thanks for the fantastic contribution!

I expect that, once this is merged into master, and we all can play with it, we'll find additional things to improve. K8s is so complex, and there are so many different ways of using it, that it would be surprising if there were not opportunities for enhancements. But, the fastest way to find those improvements is by trying it out, as you've been doing.

Also, it is important to note that a) this is an extension, and b) marked experimental. That means it doesn't have to cover every possible use case from the get-go, it means folks should try it and see how it works.

Given all that, LGTM. There are few active comments, and the build is still in progress. Once those are resolved, it has my +1.

churromorales · 2022-10-31T22:10:54Z

* Avoid creating TaskLocations with `null` host and non-`-1` values for `port` and `tlsPort`. This may be happening in `K8sWorkItem` if `mainPod.getStatus().getPodIP()` is `null`. Do you think this is possible? To keep things coherent I suggest enforcing it in the TaskLocation constructor, which could throw an exception if host is `null` and `port` and `tlsPort` are anything other than `-1`. (This would ensure that `null` host always compares equal to `TaskLocation.unknown()`.)

@gianm thats absolutely the problem I believe, thank you for pointing me in the right direction, I couldn't tell from the trace where the issue was. I'll fix that in a follow up PR so we can get these two features working together. Thank you again for getting back to me about this.

lgtm-com · 2022-10-31T23:45:15Z

This pull request introduces 5 alerts when merging fb6609e into 675fd98 - view on LGTM.com

new alerts:

3 for Uncontrolled data used in path expression
2 for Unused format argument

lgtm-com · 2022-11-01T18:31:30Z

This pull request introduces 5 alerts when merging a4b4cdb into fd7864a - view on LGTM.com

new alerts:

3 for Uncontrolled data used in path expression
2 for Unused format argument

lgtm-com · 2022-11-01T21:59:05Z

This pull request introduces 5 alerts when merging ab612d8 into 176934e - view on LGTM.com

new alerts:

3 for Uncontrolled data used in path expression
2 for Unused format argument

lgtm-com · 2022-11-02T17:43:09Z

This pull request introduces 5 alerts when merging 9ebadf4 into 176934e - view on LGTM.com

new alerts:

3 for Uncontrolled data used in path expression
2 for Unused format argument

…lt is different

lgtm-com · 2022-11-02T22:14:46Z

This pull request introduces 5 alerts when merging d40db8f into 176934e - view on LGTM.com

new alerts:

3 for Uncontrolled data used in path expression
2 for Unused format argument

a2l007 · 2022-11-03T02:44:29Z

Thanks for the contribution @churromorales !
Looking forward to the followup PRs to handle the outstanding items identified from this PR.

abhishekagarwal87 · 2022-11-03T06:55:00Z

Thank you for the contribution @churromorales. This will be a major feature in the upcoming 25.0.0 release.

Support for middle manager less druid, tasks launch as k8s jobs

190f8dd

churromorales mentioned this pull request Sep 29, 2022

A new Druid MiddleManager Resource Scheduling Model Based On K8s(MOK) #10910

Closed

9 tasks

imply-cheddar reviewed Sep 30, 2022

View reviewed changes

FrankChen021 added Area - Extension Kubernetes labels Sep 30, 2022

abhishekagarwal87 added Area - Batch Ingestion Area - Ingestion labels Sep 30, 2022

a2l007 added the Design Review label Sep 30, 2022

Fixing forking task runner test

afff509

Test cleanup, dependency cleanup, intellij inspections cleanup

2f06ec1

paul-rogers reviewed Sep 30, 2022

View reviewed changes

Changes per PR review

2e23952

Add configuration option to disable http/https proxy for the k8s client Update the docs to provide more detail about sidecar support

Removing un-needed log lines

738f72d

abhishekagarwal87 self-requested a review October 12, 2022 09:32

paul-rogers reviewed Oct 13, 2022

View reviewed changes

georgew5656 reviewed Oct 13, 2022

View reviewed changes

Small changes per PR review

6c13bf0

churromorales closed this Oct 14, 2022

churromorales reopened this Oct 14, 2022

Upon task completion we callback to the overlord to update the status…

4279747

… / locaiton, for slower k8s clusters, this reduces locking time significantly

a2l007 reviewed Oct 28, 2022

View reviewed changes

Rahul Gidwani added 2 commits October 28, 2022 12:02

Cleanup, adding test to AbstractTask

cdd23b5

Add comment in peon.sh

b12de57

clintropolis approved these changes Oct 28, 2022

View reviewed changes

paul-rogers approved these changes Oct 28, 2022

View reviewed changes

Rahul Gidwani added 2 commits October 31, 2022 13:55

Bumping code coverage

7a4182a

More tests to make code coverage happy

87a6a92

Doh a duplicate dependnecy

fb6609e

a2l007 mentioned this pull request Nov 1, 2022

[Flaky ut] KafkaSupervisorTest.testResumeAllActivelyReadingTasks #13289

Closed

Integration test setup is weird for k8s, will do this in a different PR

a4b4cdb

Reverting back all integration test changes, will do in anotbher PR

ab612d8

use StringUtils.base64 instead of Base64

9ebadf4

Jdk is nasty, if i compress in jdk 11 in jdk 17 the decompressed resu…

d40db8f

…lt is different

a2l007 merged commit e5ad24f into apache:master Nov 3, 2022

a2l007 added the Release Notes label Nov 3, 2022

churromorales mentioned this pull request Nov 4, 2022

Fixing the K8s task runner to work with MSQ #13305

Merged

kfaraz added this to the 25.0 milestone Nov 22, 2022

This was referenced Dec 18, 2022

[Draft] 25.0.0 Release Notes #13592

Closed

Add SegmentAllocationQueue to batch allocation actions #13369

Merged

churromorales mentioned this pull request Jan 17, 2023

When a task fails and doesn't throw an exception, report it correctly… #13668

Merged

gianm mentioned this pull request Apr 25, 2023

Community roadmap 2023 #14157

Open


		## How it works

		It takes the podSpec of your `Overlord` pod and creates a kubernetes job from this podSpec. Thus if you have sidecars such as splunk, hubble, istio it can optionally launch a task as a k8s job. All jobs are natively restorable, they are decopled from the druid deployment, thus restarting pods or doing upgrades has no affect on tasks in flight. They will continue to run and when the overlord comes back up it will start tracking them again.


		To use this extension please make sure to [include](../extensions.md#loading-extensions)`druid-kubernetes-overlord-extensions` in the extensions load list for your overlord process.

		The extension uses the task queue to limit how many concurrent tasks (k8s jobs) are in flight so it is required you have a reasonable value for `druid.indexer.queue.maxSize`. Additionally set the variable `druid.indexer.runner.namespace` to the namespace in which you are running druid.

Support for middle manager less druid, tasks launch as k8s jobs #13156

Support for middle manager less druid, tasks launch as k8s jobs #13156

Conversation

churromorales commented Sep 29, 2022 • edited Loading

Description

How it works

What if you are running Sidecars?

Another nice side-effect

lgtm-com bot commented Sep 29, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lgtm-com bot commented Sep 30, 2022

lgtm-com bot commented Sep 30, 2022

paul-rogers left a comment

Choose a reason for hiding this comment

lgtm-com bot commented Oct 4, 2022

georgew5656 commented Oct 6, 2022

paul-rogers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

churromorales Oct 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lgtm-com bot commented Oct 13, 2022

churromorales commented Oct 13, 2022 • edited Loading

Fryuni commented Oct 14, 2022 • edited Loading

churromorales commented Oct 14, 2022 • edited Loading

Fryuni commented Oct 14, 2022 • edited Loading

Choose a reason for hiding this comment

lgtm-com bot commented Oct 28, 2022

clintropolis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paul-rogers left a comment

Choose a reason for hiding this comment

churromorales commented Oct 31, 2022

lgtm-com bot commented Oct 31, 2022

lgtm-com bot commented Nov 1, 2022

lgtm-com bot commented Nov 1, 2022

lgtm-com bot commented Nov 2, 2022

lgtm-com bot commented Nov 2, 2022

a2l007 commented Nov 3, 2022

abhishekagarwal87 commented Nov 3, 2022

churromorales commented Sep 29, 2022 •

edited

Loading

churromorales Oct 13, 2022 •

edited

Loading

churromorales commented Oct 13, 2022 •

edited

Loading

Fryuni commented Oct 14, 2022 •

edited

Loading

churromorales commented Oct 14, 2022 •

edited

Loading

Fryuni commented Oct 14, 2022 •

edited

Loading