-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for middle manager less druid, tasks launch as k8s jobs #13156
Conversation
This pull request introduces 5 alerts when merging 190f8dd into ce5f55e - view on LGTM.com new alerts:
|
@@ -0,0 +1,153 @@ | |||
#!/bin/sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meta commentary: it would be cool (and I don't think too crazy) if there was a Cli target that is like "task" which essentially does a remote bootstrap.
I.e. it could start up and the first thing it does is read some config for the overlord process, "phone home" to ask for the task spec and runtime properties, and then use those to bootstrap.
This might open up security holes, so it would probably have to be done with some sort of shared secret or something, maybe? But anyway, might simplify this script into something that is relatively generic and not even k8s-dependent.
Could also be done as an evolution of this change post-merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So when creating this patch, I was hoping to minimize core changes to increase the odds of this patch getting accepted. In all honesty, I believe if this is a first class feature, we can fix some other hacky stuff that was done. I definitely think having a Cli command for the task is a good idea, I think this is something that I could put up a PR if /when this gets merged.
This pull request introduces 5 alerts when merging afff509 into ce5f55e - view on LGTM.com new alerts:
|
This pull request introduces 5 alerts when merging 2f06ec1 into ebfe1c0 - view on LGTM.com new alerts:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@churromorales, this is a very impressive PR! Getting K8s support into Druid is HUGE! I made a first pass to get the lay of the land: comments are all nits in this pass.
...etes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/JobResponse.java
Outdated
Show resolved
Hide resolved
...s-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/K8sTaskAdapter.java
Outdated
Show resolved
Hide resolved
...-contrib/kubernetes-overlord-extensions/src/test/resources/expectedMultiContainerOutput.yaml
Outdated
Show resolved
Hide resolved
Add configuration option to disable http/https proxy for the k8s client Update the docs to provide more detail about sidecar support
This pull request introduces 5 alerts when merging 2e23952 into 92d2633 - view on LGTM.com new alerts:
|
this looks great @churromorales, i've started looking through some of the kubernetes related code today, will hopefully finish up by tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working my way through a second, detailed review. The code looks quite solid so far. Most of the comments are around asking questions, suggesting doc. enhancements and style suggestions. More to come.
# NOTE: this is a 'run' script for the stock tarball | ||
# It takes 1 required argument (the name of the service, | ||
# e.g. 'broker', 'historical' etc). Any additional arguments | ||
# are passed to that service. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is similar to the standard script. Is it a copy? If a copy, then perhaps we can avoid an actual copy: add a rule to the distribution
project (likely in assembly.xml
) to copy the existing script to peon.sh
where needed.
If this version has changes, can they be applied to the standard script somehow to avoid the copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the peon.sh
is copied into the dockerifle, do we need to add a rule here? I just do it in the dockerfile itself, user wont have to worry, if anything changes with the script it gets automatically updated. I believe that is the same thing that happens with the druid.sh
script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a comment here specifying that this is similar to druid.sh and is used exclusively for the kubernetes-overlord-extension?
|
||
## How it works | ||
|
||
It takes the podSpec of your `Overlord` pod and creates a kubernetes job from this podSpec. Thus if you have sidecars such as splunk, hubble, istio it can optionally launch a task as a k8s job. All jobs are natively restorable, they are decopled from the druid deployment, thus restarting pods or doing upgrades has no affect on tasks in flight. They will continue to run and when the overlord comes back up it will start tracking them again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It -> "The K8s extension"
Please capitalize proper nouns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps provide just a bit more background for newbies. Where does the pod spec come from? Something like:
With this extension, your Overload tasks run in a Kubernetes pod. You define that pod using a pod spec. The pod spec names a Docker image with Druid installed, identifies K8s settings, provides Druid configuration, passes in secrets, etc. You create the pod spec outside of Druid. We suggest you test the spec directly in K8s before using it with Druid.
I'm sure some (all?) of the above is wrong: it just identifies what I, as a newbie to this extension, would need to know. Focus in particular on anything Druid-specific I need to provide in the pod spec.
Also, where do I put the pod spec? In K8s somewhere? On the local disk of each Druid node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above says "pod spec". Is that literally what we need? Or, do we provide a pod template? Per the docs,
Controllers for workload resources create Pods from a pod template and manage those Pods on your behalf.
PodTemplates are specifications for creating Pods, and are included in workload resources such as Deployments, Jobs, and DaemonSets.
Would I be on the right track to think of OL as a controller in this context?
Or, does this extension use the pod spec as the template (because, say, the other template properties don't apply here)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be one pod spec for a cluster. Should I then ensure that the pod is large enough for my largest task? Would a future enhancement be the ability to override the pod spec per-task so that light-weight tasks (the MSQ controller) takes less resources than a heavy-weight task (an MSQ worker)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the reason for the podSpec is that I didn't want to take a brand new docker image and create a job spec for it. I don't know your cluster config. Suppose you have secrets mounted, volumes setup up your way, env variables, certs, annotations for things like istio, etc... The parent pod spec logic is to ensure you have those things for your peon task and you don't have to worry about it. But, I do take that pod spec and massage it. I do things like change the resources required. For CPU we always give the task a single core (just like what druid was always doing). For memory we take the jvm opts you pass and configure the container resources from that. I believe its something like (Xmx + dbb)*1.2. So while the cpu resources are fixed at a core, the memory is deduced from your jvm opts. I want to make sure I answered your question correctly here, please let me know things don't make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps provide just a bit more background for newbies. Where does the pod spec come from? Something like:
With this extension, your Overload tasks run in a Kubernetes pod. You define that pod using a pod spec. The pod spec names a Docker image with Druid installed, identifies K8s settings, provides Druid configuration, passes in secrets, etc. You create the pod spec outside of Druid. We suggest you test the spec directly in K8s before using it with Druid.
I'm sure some (all?) of the above is wrong: it just identifies what I, as a newbie to this extension, would need to know. Focus in particular on anything Druid-specific I need to provide in the pod spec.
Also, where do I put the pod spec? In K8s somewhere? On the local disk of each Druid node?
Maybe I wasn't clear, you don't have to do anything here. The podSpec comes from the overlord pod, the pod that launches the task. The K8sTaskAdapter
takes the overlord spec and modifies it to run a peon k8s job. The user has to do nothing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure some (all?) of the above is wrong: it just identifies what I, as a newbie to this extension, would need to know. Focus in particular on anything Druid-specific I need to provide in the pod spec.
Also, where do I put the pod spec? In K8s somewhere? On the local disk of each Druid node?
Should i just exclude the podSpec part? Users have to do nothing here, I am just explaining how it works at a high-level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it makes sense to get the podspec from the overlord. But I suppose that I can override some elements in the spec if I want as a user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes definitely. Right now I give you the ability to override annotations and labels. In a later PR we can make this more configurable for users. I have used this patch for a few druid clusters and did not need to override anything else, but that is just my experience. I think going forward we make this more configurable in the future. The goal of this patch was to make it as easy as possible for users to switch from using MM's to k8s jobs. Currently its add 4 required configurations for the overlord and you are good to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure, I don't think we need those features right away.
The extension uses the task queue to limit how many concurrent tasks (k8s jobs) are in flight so it is required you have a reasonable value for `druid.indexer.queue.maxSize`. Additionally set the variable `druid.indexer.runner.namespace` to the namespace in which you are running druid. | ||
|
||
Other configurations required are: | ||
`druid.indexer.runner.type: k8s` and `druid.indexer.task.enableTaskLevelLogPush: true` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From this, can I infer that, with this extension, all tasks run in K8s? There is no mixed mode where I can choose the hosting type per-task? (Not sure mixed usage is useful, just asking.)
If use of this extension is binary, then perhaps note somewhere that the cluster should be configured with no MiddleManagers and no Indexers. (If that is, in fact, what I should do.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct, there is no mixed mode support. If you turn on k8s mode all tasks are k8s tasks. We have been running this patch and we don't launch any MM's. If you deploy using the operator, this is quite easy to do as you just remove the middleManager section of the spec completely. I will add that to the docs, good point.
|
||
To use this extension please make sure to [include](../extensions.md#loading-extensions)`druid-kubernetes-overlord-extensions` in the extensions load list for your overlord process. | ||
|
||
The extension uses the task queue to limit how many concurrent tasks (k8s jobs) are in flight so it is required you have a reasonable value for `druid.indexer.queue.maxSize`. Additionally set the variable `druid.indexer.runner.namespace` to the namespace in which you are running druid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, the extension will run K8s jobs up to the limit specified? The assumption is that K8s has at least this much capacity? What happens if I configure the maxSize
to be greater than my K8s capacity? Will the "extra" Druid tasks remain queued?
Druid recently introduced MSQ, which launches a controller, then launches a wave of workers. Could we get into a deadlock situation if K8s has enough capacity to launch only half the workers? If so, we should note to users to set maxSize
to less than the K8s capacity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this one was a tricky one. Some users have quotas setup for their namespace while others do not. Quotas such as how many jobs can i have concurrently running. I believe if you do this, you will just have unscheduled pods while resources are not available and then when they become available they will launch. If you use aws, you could perhaps tie in the autoscaler and have some logic stating if i have x amount of pods unscheduled then scale up ec2 instances. And scale down the same with the reverse logic, if I have ec2 instances not using resources, scale things down.
To answer your question, yes we should always ask users to set maxSize
less than the k8s capacity, unless they have autoscaling configured. In that case its up to the user to determine what they can potentially scale up to.
...es-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java
Outdated
Show resolved
Hide resolved
...es-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java
Show resolved
Hide resolved
// join queries | ||
if (task.supportsQueries()) { | ||
command.add("--loadBroadcastSegments"); | ||
command.add("true"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General question: if a peon supports queries, it will have incoming connections. How do we ensure that the Peon's IP address is visible to the Broker? K8s usually has some kind of overlay network. Are we assuming Druid runs within K8s so it is on that same network? Or, that somewhere we've set up a proxy into the overlay network? It's been a while since I used K8s, perhaps the proxy isn't needed? Or, the networks are merged in AWS? Is that true on plain-Jane K8s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. For this patch, I am forcing the tasks to run in the current namespace that druid is running in. You could easily have the tasks run in a different namespace but would require a bit of changes to this patch (not much though). So when a job gets launched, we wait for the pod to startup, once the pod starts, we grab the ip address of the pod and broadcast that as the TaskLocation
. There would be no need for a proxy if you wanted to go across namespaces in the same context, but if you had druid running in namespace A
and tasks running in namespace B
you would have to setup your k8s to be able to communicate bidirectionally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal of this patch, is if you have are running druid on k8s
, you shouldn't need to do anything except add the extension and a few configuration options to the overlord and you are good to go. In the gotchas section of the readme, there are some things you have to look out for with sidecars which i mentioned. Also your service account must have the ability to deal with k8s jobs, I included a sample role / rolebinding as an example in the readme.md. We have been using this on some of our clusters for a while now and we haven't really had to add any extra configuration or do anything special in k8s to get this to work.
...es-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java
Outdated
Show resolved
Hide resolved
...es-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java
Outdated
Show resolved
Hide resolved
...extensions/src/main/java/org/apache/druid/k8s/overlord/common/MultiContainerTaskAdapter.java
Show resolved
Hide resolved
...s-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/K8sTaskAdapter.java
Show resolved
Hide resolved
mainContainer.setPorts(Lists.newArrayList(httpsPort, tcpPort)); | ||
} | ||
|
||
protected void addEnvironmentVariables(Container mainContainer, PeonCommandContext context, String taskContents) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to the other comments, I think we want to make additional things configurable here if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, i will make env's and annotations configurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually one quesiton, for env variables being configurable, do you have a usecase? It takes all the env variables of the overlord pod running, would there be a situation where you would need something additional here? The container shouldn't need more than what druid has to work, right? Annotations I will add, because that could be something you have setup for k8s, like some autoscaling annotation you want to add or networking....labels are already configurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
annotations and labels are configurable, if you can provide me with an example of how we would need env variables, i would be happy to add it as configurable. But I personally think that could cause more harm than good potentially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I added it as a comment but what I actually think is most important to be configurable is nodeSelector/affinity/tolerations in case people want to configure what node pools to run on.
If nodeSelector is not set than maybe we can use whatever the overlord uses as its node selector.
Env variables is not as important, but I could see something like DD_TRACE_AGENT_URL for enabling tracing with datadog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i also like the idea of using a custom pod spec, but that seems like a bigger change that we could do as a followup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have a valid point, maybe in a future PR, we could add some config map where you can define a podSpec template. That should be pretty easy to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may not be an issue but is there a limit in k8s on how large the env variable value can be? The task JSON could be large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is and its no different from a configmap max size either. That is why I compress and base64 encode the task.json file when i pass it in the spec.
This pull request introduces 5 alerts when merging 6c13bf0 into 45dfd67 - view on LGTM.com new alerts:
|
I noticed a slight issue when testing on heavily loaded k8s clusters. Those that take a while to spin up / down pods. Currently the way things work: What was problematic here, is I noticed that the process was complete, the task itself was finished, but the k8s operations were slow on my cluster. The task itself took less than a minute to complete, but the entire k8s lifecycle was taking much longer around 10 minutes. (A very heavily loaded cluster). Thus the task status only updates in the finally block of the
I thought about this approach originally, but didn't want to make too many changes to I will test this out over the next day or so and then add the commit to this PR. |
Hey, this is an awesome PR, thanks for all the work
The resource use for overlords and tasks is vastly different, using the same podSpec for those would require reserving much more resources for the overlords and the tasks than necessary. One idea is that instead of reading from the overlord pod itself, it would be better to provide the name of a config map containing the podSpec. Another idea: the middle-managers have tiers, which we can use to assign tasks to middle-managers with different resource allocations. Those tiers could be used to define different podSpecs for different tasks without changing the tasks API. Those changes can be done on a later PR. All in all, this PR is a huge step forward in making Druid more scalable for cloud environments. |
Thank you for your kind words. I hope this patch will be useful to folks upstream.
This doesn't use the resources of the overlord, we overwrite the resources for podSpec. The memory is derived from your JAVA_OPTS (for your task) basically 1.2*(Xmx + Dbb) for the memory limit and the cpu is always 1. You get a core per task just like we did before. We only grab certain items from the parent podSpec. You can see the logic in the I am open to whatever changes you guys think are necessary. I have a few more ideas for subsequent PRs and I do like the configmap pod spec for users that want more control. That might be a configurable option we can provide in another PR. |
That is super great news on the memory side, but not so great on the CPU side for us. We have 23 continuously running ingestion tasks and they only use a tenth of a core (100m) each, so fixing the resource request to one core is a 9x overprovision totaling 21 cores. A full core works for our compaction tasks, but those run for about 10 minutes every 4 hours. We'll explore some things with Druid in the near future, so we might use this PR before it lands and give some feedback. |
… / locaiton, for slower k8s clusters, this reduces locking time significantly
# NOTE: this is a 'run' script for the stock tarball | ||
# It takes 1 required argument (the name of the service, | ||
# e.g. 'broker', 'historical' etc). Any additional arguments | ||
# are passed to that service. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a comment here specifying that this is similar to druid.sh and is used exclusively for the kubernetes-overlord-extension?
...es-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java
Outdated
Show resolved
Hide resolved
...es-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java
Outdated
Show resolved
Hide resolved
...s-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/K8sTaskAdapter.java
Show resolved
Hide resolved
...rlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/KubernetesClientApi.java
Show resolved
Hide resolved
...lord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/KubernetesPeonClient.java
Show resolved
Hide resolved
indexing-service/src/main/java/org/apache/druid/indexing/common/task/AbstractTask.java
Show resolved
Hide resolved
indexing-service/src/main/java/org/apache/druid/indexing/overlord/http/OverlordResource.java
Show resolved
Hide resolved
This pull request introduces 5 alerts when merging b12de57 into 4f0145f - view on LGTM.com new alerts:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice! 🚀
i only reviewed this PR at a high level to view changes to core druid stuffs and didn't look too closely at the kubernetes specific stuff because that isn't my area of expertise, but everything seems overall fine to me, and kubernetes area does look like it has had some people looking at stuff so I'll go ahead and leave an approval.
I think the discussions around supporting different podSpec/nodeSelector/(other kubernetes words I vaguely understand) are interesting, when we get there we should be thinking about if we can overhaul the existing worker select strategy stuff to accommodate this or if it needs new similar machinery, but being able to run have control over which tasks run in which places is generally useful for mixed workloads so it would definitely be a good investigation for follow-up.
Longer term it would be nice to eventually get this combined into the druid-kubernetes-extensions
core extension so that operators need to only load a single extension to have a Zookeeper free Druid experience 🎉
/** | ||
* This test verifies injection for {@link ServerRunnable}s which are discoverable Druid servers. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: intended change?
if ("k8s".equals(properties.getProperty("druid.indexer.runner.type", null))) { | ||
log.info("Running peon in k8s mode"); | ||
executorLifecycleConfig.setParentStreamDefined(false); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this also seems a bit strange and special caseish for a specific thing, though i'm not sure introducing another property here is the solution...
The extension uses the task queue to limit how many concurrent tasks (K8s jobs) are in flight so it is required you have a reasonable value for `druid.indexer.queue.maxSize`. Additionally set the variable `druid.indexer.runner.namespace` to the namespace in which you are running druid. | ||
|
||
Other configurations required are: | ||
`druid.indexer.runner.type: k8s` and `druid.indexer.task.encapsulatedTask: true` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
druid.indexer.task.encapsulatedTask
doesn't seem documented anywhere (or in javadocs), do we need to elaborate on what it does?
/** | ||
* Beacause the k8s task runner is an extension, we need to know the task runner type in the overlord resource | ||
*/ | ||
default boolean isK8sTaskRunner() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this class is marked @PublicApi
, I think we should consider naming this something less specific, like isDirectTaskRunner
or isWorkerFreeTaskRunner
something to indicate that it is worker free but can still run up to getTotalTaskSlotCount
tasks (based on how its being used in OverlordResource
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@churromorales, thanks for the fantastic contribution!
I expect that, once this is merged into master, and we all can play with it, we'll find additional things to improve. K8s is so complex, and there are so many different ways of using it, that it would be surprising if there were not opportunities for enhancements. But, the fastest way to find those improvements is by trying it out, as you've been doing.
Also, it is important to note that a) this is an extension, and b) marked experimental. That means it doesn't have to cover every possible use case from the get-go, it means folks should try it and see how it works.
Given all that, LGTM. There are few active comments, and the build is still in progress. Once those are resolved, it has my +1.
@gianm thats absolutely the problem I believe, thank you for pointing me in the right direction, I couldn't tell from the trace where the issue was. I'll fix that in a follow up PR so we can get these two features working together. Thank you again for getting back to me about this. |
This pull request introduces 5 alerts when merging fb6609e into 675fd98 - view on LGTM.com new alerts:
|
This pull request introduces 5 alerts when merging a4b4cdb into fd7864a - view on LGTM.com new alerts:
|
This pull request introduces 5 alerts when merging ab612d8 into 176934e - view on LGTM.com new alerts:
|
This pull request introduces 5 alerts when merging 9ebadf4 into 176934e - view on LGTM.com new alerts:
|
This pull request introduces 5 alerts when merging d40db8f into 176934e - view on LGTM.com new alerts:
|
Thanks for the contribution @churromorales ! |
Thank you for the contribution @churromorales. This will be a major feature in the upcoming 25.0.0 release. |
Description
Add an extension to allow tasks to be run as k8s jobs from the overlord, eliminating the need for a middle manager.
The core changes are as follows:
a. Because tasks run on separate pods, the task needs to setup its own filesystem directories.
b. Again because the tasks run on separate pods, we push the task logs from the task itself and task reports. in the cleanup method.
How it works
The KubernetesTaskRunner runs in the overlord process. When it has a request to launch a task, it goes to the K8sApi, grabs its own PodSpec (the overlord itself). Takes that podSpec, modifies the necessary attributes (eg: command, labels, env variables etc). Takes the task.json, compresses and base64 encodes it. Then launches a K8s Job.
The K8s Job on startup, will unwrap the task.json env variable, write it to the appropriate directory and run the task.
The KubernetesTaskRunner monitors the lifecycle of the task, just as the ForkingTaskRunner and returns the TaskStatus.
What if you are running Sidecars?
The config option
druid.indexer.runner.sidecarSupport
will support launching sidecars, I utilize kubexit (https://github.com/karlkfi/kubexit) to setup the spec such that when the main container completes, it terminates the sidecars. This is a known issue with k8s jobs and this is how I work around it.Another nice side-effect
Because the launching of tasks has been decoupled from the service itself, the tasks run independently regardless of the state of the overlord process. You can shut down the overlord process, and when it comes back. It will go to the k8s api and get the status of all peon jobs regardless of phase (in flight, completed, failed, pending) and will do the proper bookeeping for completed tasks and will resume monitoring tasks in flight.
To run a middle manager less druid, simply omit the middle manager from your deployment.
Make sure you also change
druid.processing_intermediaryData.storage.type=deepStorage
In your overlord config:
1. Add the
druid-kubernetes-overlord-extensions
to your extensions load list.2.
druid.indexer.runner.type=k8s
3.
druid.indexer.runner.namespace=<currentNamespace>
4.
druid.indexer.queue.maxSize
controls max concurrent tasks, you must set it to a value less than the default ofInteger.MAX_VALUE
5.
druid.indexer.task.encapsulatedTask=true
This PR has:
conntrack
with minikube. Thus I will have to let travis run and figure things out from there.