Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for middle manager less druid, tasks launch as k8s jobs #13156

Merged
merged 22 commits into from
Nov 3, 2022

Conversation

churromorales
Copy link
Contributor

@churromorales churromorales commented Sep 29, 2022

Description

Add an extension to allow tasks to be run as k8s jobs from the overlord, eliminating the need for a middle manager.

The core changes are as follows:

  1. Refactored arguments to CliPeon to be more generic
  2. Had to add a setup and cleanup method to AbstractTask.
    a. Because tasks run on separate pods, the task needs to setup its own filesystem directories.
    b. Again because the tasks run on separate pods, we push the task logs from the task itself and task reports. in the cleanup method.
  3. A few other small changes to core required for tasks to run independently on their own.

How it works

The KubernetesTaskRunner runs in the overlord process. When it has a request to launch a task, it goes to the K8sApi, grabs its own PodSpec (the overlord itself). Takes that podSpec, modifies the necessary attributes (eg: command, labels, env variables etc). Takes the task.json, compresses and base64 encodes it. Then launches a K8s Job.

The K8s Job on startup, will unwrap the task.json env variable, write it to the appropriate directory and run the task.

The KubernetesTaskRunner monitors the lifecycle of the task, just as the ForkingTaskRunner and returns the TaskStatus.

What if you are running Sidecars?

The config option druid.indexer.runner.sidecarSupport will support launching sidecars, I utilize kubexit (https://github.com/karlkfi/kubexit) to setup the spec such that when the main container completes, it terminates the sidecars. This is a known issue with k8s jobs and this is how I work around it.

Another nice side-effect

Because the launching of tasks has been decoupled from the service itself, the tasks run independently regardless of the state of the overlord process. You can shut down the overlord process, and when it comes back. It will go to the k8s api and get the status of all peon jobs regardless of phase (in flight, completed, failed, pending) and will do the proper bookeeping for completed tasks and will resume monitoring tasks in flight.

To run a middle manager less druid, simply omit the middle manager from your deployment.

Make sure you also change druid.processing_intermediaryData.storage.type=deepStorage

In your overlord config:
1. Add the druid-kubernetes-overlord-extensions to your extensions load list.
2. druid.indexer.runner.type=k8s
3. druid.indexer.runner.namespace=<currentNamespace>
4. druid.indexer.queue.maxSize controls max concurrent tasks, you must set it to a value less than the default of Integer.MAX_VALUE
5. druid.indexer.task.encapsulatedTask=true
This PR has:

  • [ x] been self-reviewed.
    • [x ] using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
  • [ x] added documentation for new or modified features or behaviors.
  • [ x] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • [ x] added or updated version, license, or notice information in licenses.yaml
  • [ x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • [ x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests. (this has been added but the k8s integration tests only work on a linux machine as they use conntrack with minikube. Thus I will have to let travis run and figure things out from there.
  • been tested in a test Druid cluster.

@lgtm-com
Copy link

lgtm-com bot commented Sep 29, 2022

This pull request introduces 5 alerts when merging 190f8dd into ce5f55e - view on LGTM.com

new alerts:

  • 3 for Uncontrolled data used in path expression
  • 2 for Unused format argument

@@ -0,0 +1,153 @@
#!/bin/sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meta commentary: it would be cool (and I don't think too crazy) if there was a Cli target that is like "task" which essentially does a remote bootstrap.

I.e. it could start up and the first thing it does is read some config for the overlord process, "phone home" to ask for the task spec and runtime properties, and then use those to bootstrap.

This might open up security holes, so it would probably have to be done with some sort of shared secret or something, maybe? But anyway, might simplify this script into something that is relatively generic and not even k8s-dependent.

Could also be done as an evolution of this change post-merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when creating this patch, I was hoping to minimize core changes to increase the odds of this patch getting accepted. In all honesty, I believe if this is a first class feature, we can fix some other hacky stuff that was done. I definitely think having a Cli command for the task is a good idea, I think this is something that I could put up a PR if /when this gets merged.

@lgtm-com
Copy link

lgtm-com bot commented Sep 30, 2022

This pull request introduces 5 alerts when merging afff509 into ce5f55e - view on LGTM.com

new alerts:

  • 3 for Uncontrolled data used in path expression
  • 2 for Unused format argument

@lgtm-com
Copy link

lgtm-com bot commented Sep 30, 2022

This pull request introduces 5 alerts when merging 2f06ec1 into ebfe1c0 - view on LGTM.com

new alerts:

  • 3 for Uncontrolled data used in path expression
  • 2 for Unused format argument

Copy link
Contributor

@paul-rogers paul-rogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@churromorales, this is a very impressive PR! Getting K8s support into Druid is HUGE! I made a first pass to get the lay of the land: comments are all nits in this pass.

Add configuration option to disable http/https proxy for the k8s client
Update the docs to provide more detail about sidecar support
@lgtm-com
Copy link

lgtm-com bot commented Oct 4, 2022

This pull request introduces 5 alerts when merging 2e23952 into 92d2633 - view on LGTM.com

new alerts:

  • 3 for Uncontrolled data used in path expression
  • 2 for Unused format argument

@georgew5656
Copy link
Contributor

this looks great @churromorales, i've started looking through some of the kubernetes related code today, will hopefully finish up by tomorrow

@abhishekagarwal87 abhishekagarwal87 self-requested a review October 12, 2022 09:32
Copy link
Contributor

@paul-rogers paul-rogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working my way through a second, detailed review. The code looks quite solid so far. Most of the comments are around asking questions, suggesting doc. enhancements and style suggestions. More to come.

# NOTE: this is a 'run' script for the stock tarball
# It takes 1 required argument (the name of the service,
# e.g. 'broker', 'historical' etc). Any additional arguments
# are passed to that service.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to the standard script. Is it a copy? If a copy, then perhaps we can avoid an actual copy: add a rule to the distribution project (likely in assembly.xml) to copy the existing script to peon.sh where needed.

If this version has changes, can they be applied to the standard script somehow to avoid the copy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the peon.sh is copied into the dockerifle, do we need to add a rule here? I just do it in the dockerfile itself, user wont have to worry, if anything changes with the script it gets automatically updated. I believe that is the same thing that happens with the druid.sh script.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment here specifying that this is similar to druid.sh and is used exclusively for the kubernetes-overlord-extension?

docs/development/extensions-contrib/k8s-jobs.md Outdated Show resolved Hide resolved

## How it works

It takes the podSpec of your `Overlord` pod and creates a kubernetes job from this podSpec. Thus if you have sidecars such as splunk, hubble, istio it can optionally launch a task as a k8s job. All jobs are natively restorable, they are decopled from the druid deployment, thus restarting pods or doing upgrades has no affect on tasks in flight. They will continue to run and when the overlord comes back up it will start tracking them again.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It -> "The K8s extension"

Please capitalize proper nouns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps provide just a bit more background for newbies. Where does the pod spec come from? Something like:

With this extension, your Overload tasks run in a Kubernetes pod. You define that pod using a pod spec. The pod spec names a Docker image with Druid installed, identifies K8s settings, provides Druid configuration, passes in secrets, etc. You create the pod spec outside of Druid. We suggest you test the spec directly in K8s before using it with Druid.

I'm sure some (all?) of the above is wrong: it just identifies what I, as a newbie to this extension, would need to know. Focus in particular on anything Druid-specific I need to provide in the pod spec.

Also, where do I put the pod spec? In K8s somewhere? On the local disk of each Druid node?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above says "pod spec". Is that literally what we need? Or, do we provide a pod template? Per the docs,

Controllers for workload resources create Pods from a pod template and manage those Pods on your behalf.

PodTemplates are specifications for creating Pods, and are included in workload resources such as Deployments, Jobs, and DaemonSets.

Would I be on the right track to think of OL as a controller in this context?

Or, does this extension use the pod spec as the template (because, say, the other template properties don't apply here)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be one pod spec for a cluster. Should I then ensure that the pod is large enough for my largest task? Would a future enhancement be the ability to override the pod spec per-task so that light-weight tasks (the MSQ controller) takes less resources than a heavy-weight task (an MSQ worker)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the reason for the podSpec is that I didn't want to take a brand new docker image and create a job spec for it. I don't know your cluster config. Suppose you have secrets mounted, volumes setup up your way, env variables, certs, annotations for things like istio, etc... The parent pod spec logic is to ensure you have those things for your peon task and you don't have to worry about it. But, I do take that pod spec and massage it. I do things like change the resources required. For CPU we always give the task a single core (just like what druid was always doing). For memory we take the jvm opts you pass and configure the container resources from that. I believe its something like (Xmx + dbb)*1.2. So while the cpu resources are fixed at a core, the memory is deduced from your jvm opts. I want to make sure I answered your question correctly here, please let me know things don't make sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps provide just a bit more background for newbies. Where does the pod spec come from? Something like:

With this extension, your Overload tasks run in a Kubernetes pod. You define that pod using a pod spec. The pod spec names a Docker image with Druid installed, identifies K8s settings, provides Druid configuration, passes in secrets, etc. You create the pod spec outside of Druid. We suggest you test the spec directly in K8s before using it with Druid.

I'm sure some (all?) of the above is wrong: it just identifies what I, as a newbie to this extension, would need to know. Focus in particular on anything Druid-specific I need to provide in the pod spec.

Also, where do I put the pod spec? In K8s somewhere? On the local disk of each Druid node?

Maybe I wasn't clear, you don't have to do anything here. The podSpec comes from the overlord pod, the pod that launches the task. The K8sTaskAdapter takes the overlord spec and modifies it to run a peon k8s job. The user has to do nothing here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure some (all?) of the above is wrong: it just identifies what I, as a newbie to this extension, would need to know. Focus in particular on anything Druid-specific I need to provide in the pod spec.

Also, where do I put the pod spec? In K8s somewhere? On the local disk of each Druid node?

Should i just exclude the podSpec part? Users have to do nothing here, I am just explaining how it works at a high-level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it makes sense to get the podspec from the overlord. But I suppose that I can override some elements in the spec if I want as a user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes definitely. Right now I give you the ability to override annotations and labels. In a later PR we can make this more configurable for users. I have used this patch for a few druid clusters and did not need to override anything else, but that is just my experience. I think going forward we make this more configurable in the future. The goal of this patch was to make it as easy as possible for users to switch from using MM's to k8s jobs. Currently its add 4 required configurations for the overlord and you are good to go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure, I don't think we need those features right away.

The extension uses the task queue to limit how many concurrent tasks (k8s jobs) are in flight so it is required you have a reasonable value for `druid.indexer.queue.maxSize`. Additionally set the variable `druid.indexer.runner.namespace` to the namespace in which you are running druid.

Other configurations required are:
`druid.indexer.runner.type: k8s` and `druid.indexer.task.enableTaskLevelLogPush: true`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From this, can I infer that, with this extension, all tasks run in K8s? There is no mixed mode where I can choose the hosting type per-task? (Not sure mixed usage is useful, just asking.)

If use of this extension is binary, then perhaps note somewhere that the cluster should be configured with no MiddleManagers and no Indexers. (If that is, in fact, what I should do.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct, there is no mixed mode support. If you turn on k8s mode all tasks are k8s tasks. We have been running this patch and we don't launch any MM's. If you deploy using the operator, this is quite easy to do as you just remove the middleManager section of the spec completely. I will add that to the docs, good point.


To use this extension please make sure to [include](../extensions.md#loading-extensions)`druid-kubernetes-overlord-extensions` in the extensions load list for your overlord process.

The extension uses the task queue to limit how many concurrent tasks (k8s jobs) are in flight so it is required you have a reasonable value for `druid.indexer.queue.maxSize`. Additionally set the variable `druid.indexer.runner.namespace` to the namespace in which you are running druid.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, the extension will run K8s jobs up to the limit specified? The assumption is that K8s has at least this much capacity? What happens if I configure the maxSize to be greater than my K8s capacity? Will the "extra" Druid tasks remain queued?

Druid recently introduced MSQ, which launches a controller, then launches a wave of workers. Could we get into a deadlock situation if K8s has enough capacity to launch only half the workers? If so, we should note to users to set maxSize to less than the K8s capacity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this one was a tricky one. Some users have quotas setup for their namespace while others do not. Quotas such as how many jobs can i have concurrently running. I believe if you do this, you will just have unscheduled pods while resources are not available and then when they become available they will launch. If you use aws, you could perhaps tie in the autoscaler and have some logic stating if i have x amount of pods unscheduled then scale up ec2 instances. And scale down the same with the reverse logic, if I have ec2 instances not using resources, scale things down.

To answer your question, yes we should always ask users to set maxSize less than the k8s capacity, unless they have autoscaling configured. In that case its up to the user to determine what they can potentially scale up to.

// join queries
if (task.supportsQueries()) {
command.add("--loadBroadcastSegments");
command.add("true");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question: if a peon supports queries, it will have incoming connections. How do we ensure that the Peon's IP address is visible to the Broker? K8s usually has some kind of overlay network. Are we assuming Druid runs within K8s so it is on that same network? Or, that somewhere we've set up a proxy into the overlay network? It's been a while since I used K8s, perhaps the proxy isn't needed? Or, the networks are merged in AWS? Is that true on plain-Jane K8s?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. For this patch, I am forcing the tasks to run in the current namespace that druid is running in. You could easily have the tasks run in a different namespace but would require a bit of changes to this patch (not much though). So when a job gets launched, we wait for the pod to startup, once the pod starts, we grab the ip address of the pod and broadcast that as the TaskLocation. There would be no need for a proxy if you wanted to go across namespaces in the same context, but if you had druid running in namespace A and tasks running in namespace B you would have to setup your k8s to be able to communicate bidirectionally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of this patch, is if you have are running druid on k8s, you shouldn't need to do anything except add the extension and a few configuration options to the overlord and you are good to go. In the gotchas section of the readme, there are some things you have to look out for with sidecars which i mentioned. Also your service account must have the ability to deal with k8s jobs, I included a sample role / rolebinding as an example in the readme.md. We have been using this on some of our clusters for a while now and we haven't really had to add any extra configuration or do anything special in k8s to get this to work.

mainContainer.setPorts(Lists.newArrayList(httpsPort, tcpPort));
}

protected void addEnvironmentVariables(Container mainContainer, PeonCommandContext context, String taskContents)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to the other comments, I think we want to make additional things configurable here if possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, i will make env's and annotations configurable.

Copy link
Contributor Author

@churromorales churromorales Oct 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually one quesiton, for env variables being configurable, do you have a usecase? It takes all the env variables of the overlord pod running, would there be a situation where you would need something additional here? The container shouldn't need more than what druid has to work, right? Annotations I will add, because that could be something you have setup for k8s, like some autoscaling annotation you want to add or networking....labels are already configurable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

annotations and labels are configurable, if you can provide me with an example of how we would need env variables, i would be happy to add it as configurable. But I personally think that could cause more harm than good potentially.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I added it as a comment but what I actually think is most important to be configurable is nodeSelector/affinity/tolerations in case people want to configure what node pools to run on.

If nodeSelector is not set than maybe we can use whatever the overlord uses as its node selector.

Env variables is not as important, but I could see something like DD_TRACE_AGENT_URL for enabling tracing with datadog

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i also like the idea of using a custom pod spec, but that seems like a bigger change that we could do as a followup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have a valid point, maybe in a future PR, we could add some config map where you can define a podSpec template. That should be pretty easy to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may not be an issue but is there a limit in k8s on how large the env variable value can be? The task JSON could be large.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is and its no different from a configmap max size either. That is why I compress and base64 encode the task.json file when i pass it in the spec.

@lgtm-com
Copy link

lgtm-com bot commented Oct 13, 2022

This pull request introduces 5 alerts when merging 6c13bf0 into 45dfd67 - view on LGTM.com

new alerts:

  • 3 for Uncontrolled data used in path expression
  • 2 for Unused format argument

@churromorales
Copy link
Contributor Author

churromorales commented Oct 13, 2022

I noticed a slight issue when testing on heavily loaded k8s clusters. Those that take a while to spin up / down pods.

Currently the way things work:
1. The overlord launches a task, waits for the pod to come up, it not only waits for a pod ip to be available, but also opens a socket connection to the peon processes web server. Once you have a socket connection the task is considered started.
2. Then it will monitor the job to complete, but this is a blocking call, after the job completes, sends pushes task reports, logs and then does a delete for the peon k8s job.

What was problematic here, is I noticed that the process was complete, the task itself was finished, but the k8s operations were slow on my cluster. The task itself took less than a minute to complete, but the entire k8s lifecycle was taking much longer around 10 minutes. (A very heavily loaded cluster).

Thus the task status only updates in the finally block of the K8sTaskRunner.run() so the task lock is only released when the task finishes and the k8s job is deleted. Which is technically correct, but in reality, the task jvm process itself could've completed long before this. I have another commit and I will test on our clusters but a brief description of the patch is this:

  1. The overlord launches a task, waits for the pod to come up. No more opening a socket for the webserver. Instead I added 2 TaskActions: One to update the TaskStatus, one to update the TaskLocation.
  2. Now in the AbstractTask the setup method will update its own location before the run method is called. If the overlord goes down for whatever reason, that is fine. We don't lose anything really, the call fails and the task itself dies.
  3. In the cleanup method of the AbstractTaskwe also update the TaskLocation to unknown and update the status to success or failure. So when the process exits, we can give up the lock and things wont be blocked. In case the pod goes away during the cleanup, that is okay, the overlord will still monitor the job and report the correct results albeit a bit slower.

I thought about this approach originally, but didn't want to make too many changes to core, but after running this patch on our clusters, I do think this is the best approach. I will call it out in a separate commit so you guys can review this work, it doesn't seem too abrasive to me, but we should focus on correctness here and not hold the locks longer than necessary as we were doing in the forking task runner.

I will test this out over the next day or so and then add the commit to this PR.

@Fryuni
Copy link

Fryuni commented Oct 14, 2022

Hey, this is an awesome PR, thanks for all the work
I am concerned about this part, though:

The KubernetesTaskRunner runs in the overlord process. When it has a request to launch a task, it goes to the K8sApi, grabs its own PodSpec (the overlord itself). Takes that podSpec, modifies the necessary attributes (eg: command, labels, env variables etc). Takes the task.json, compresses and base64 encodes it. Then launches a K8s Job.

The resource use for overlords and tasks is vastly different, using the same podSpec for those would require reserving much more resources for the overlords and the tasks than necessary.

One idea is that instead of reading from the overlord pod itself, it would be better to provide the name of a config map containing the podSpec.

Another idea: the middle-managers have tiers, which we can use to assign tasks to middle-managers with different resource allocations. Those tiers could be used to define different podSpecs for different tasks without changing the tasks API.

Those changes can be done on a later PR. All in all, this PR is a huge step forward in making Druid more scalable for cloud environments.

@churromorales
Copy link
Contributor Author

churromorales commented Oct 14, 2022

Thank you for your kind words. I hope this patch will be useful to folks upstream.

The resource use for overlords and tasks is vastly different, using the same podSpec for those would require reserving much more resources for the overlords and the tasks than necessary.

This doesn't use the resources of the overlord, we overwrite the resources for podSpec. The memory is derived from your JAVA_OPTS (for your task) basically 1.2*(Xmx + Dbb) for the memory limit and the cpu is always 1. You get a core per task just like we did before. We only grab certain items from the parent podSpec. You can see the logic in the K8sTaskAdapter

I am open to whatever changes you guys think are necessary. I have a few more ideas for subsequent PRs and I do like the configmap pod spec for users that want more control. That might be a configurable option we can provide in another PR.

@Fryuni
Copy link

Fryuni commented Oct 14, 2022

This doesn't use the resources of the overlord, we overwrite the resources for podSpec. The memory is derived from your JAVA_OPTS (for your task) basically 1.2*(Xmx + Dbb) for the memory limit and the cpu is always 1.

That is super great news on the memory side, but not so great on the CPU side for us. We have 23 continuously running ingestion tasks and they only use a tenth of a core (100m) each, so fixing the resource request to one core is a 9x overprovision totaling 21 cores. A full core works for our compaction tasks, but those run for about 10 minutes every 4 hours.

We'll explore some things with Druid in the near future, so we might use this PR before it lands and give some feedback.

… / locaiton, for slower k8s clusters, this reduces locking time significantly
# NOTE: this is a 'run' script for the stock tarball
# It takes 1 required argument (the name of the service,
# e.g. 'broker', 'historical' etc). Any additional arguments
# are passed to that service.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment here specifying that this is similar to druid.sh and is used exclusively for the kubernetes-overlord-extension?

@lgtm-com
Copy link

lgtm-com bot commented Oct 28, 2022

This pull request introduces 5 alerts when merging b12de57 into 4f0145f - view on LGTM.com

new alerts:

  • 3 for Uncontrolled data used in path expression
  • 2 for Unused format argument

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice! 🚀

i only reviewed this PR at a high level to view changes to core druid stuffs and didn't look too closely at the kubernetes specific stuff because that isn't my area of expertise, but everything seems overall fine to me, and kubernetes area does look like it has had some people looking at stuff so I'll go ahead and leave an approval.

I think the discussions around supporting different podSpec/nodeSelector/(other kubernetes words I vaguely understand) are interesting, when we get there we should be thinking about if we can overhaul the existing worker select strategy stuff to accommodate this or if it needs new similar machinery, but being able to run have control over which tasks run in which places is generally useful for mixed workloads so it would definitely be a good investigation for follow-up.

Longer term it would be nice to eventually get this combined into the druid-kubernetes-extensions core extension so that operators need to only load a single extension to have a Zookeeper free Druid experience 🎉

Comment on lines -32 to -34
/**
* This test verifies injection for {@link ServerRunnable}s which are discoverable Druid servers.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: intended change?

Comment on lines +225 to +228
if ("k8s".equals(properties.getProperty("druid.indexer.runner.type", null))) {
log.info("Running peon in k8s mode");
executorLifecycleConfig.setParentStreamDefined(false);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this also seems a bit strange and special caseish for a specific thing, though i'm not sure introducing another property here is the solution...

The extension uses the task queue to limit how many concurrent tasks (K8s jobs) are in flight so it is required you have a reasonable value for `druid.indexer.queue.maxSize`. Additionally set the variable `druid.indexer.runner.namespace` to the namespace in which you are running druid.

Other configurations required are:
`druid.indexer.runner.type: k8s` and `druid.indexer.task.encapsulatedTask: true`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

druid.indexer.task.encapsulatedTask doesn't seem documented anywhere (or in javadocs), do we need to elaborate on what it does?

/**
* Beacause the k8s task runner is an extension, we need to know the task runner type in the overlord resource
*/
default boolean isK8sTaskRunner()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this class is marked @PublicApi, I think we should consider naming this something less specific, like isDirectTaskRunner or isWorkerFreeTaskRunner something to indicate that it is worker free but can still run up to getTotalTaskSlotCount tasks (based on how its being used in OverlordResource)

Copy link
Contributor

@paul-rogers paul-rogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@churromorales, thanks for the fantastic contribution!

I expect that, once this is merged into master, and we all can play with it, we'll find additional things to improve. K8s is so complex, and there are so many different ways of using it, that it would be surprising if there were not opportunities for enhancements. But, the fastest way to find those improvements is by trying it out, as you've been doing.

Also, it is important to note that a) this is an extension, and b) marked experimental. That means it doesn't have to cover every possible use case from the get-go, it means folks should try it and see how it works.

Given all that, LGTM. There are few active comments, and the build is still in progress. Once those are resolved, it has my +1.

@churromorales
Copy link
Contributor Author

* Avoid creating TaskLocations with `null` host and non-`-1` values for `port` and `tlsPort`. This may be happening in `K8sWorkItem` if `mainPod.getStatus().getPodIP()` is `null`. Do you think this is possible? To keep things coherent I suggest enforcing it in the TaskLocation constructor, which could throw an exception if host is `null` and `port` and `tlsPort` are anything other than `-1`. (This would ensure that `null` host always compares equal to `TaskLocation.unknown()`.)

@gianm thats absolutely the problem I believe, thank you for pointing me in the right direction, I couldn't tell from the trace where the issue was. I'll fix that in a follow up PR so we can get these two features working together. Thank you again for getting back to me about this.

@lgtm-com
Copy link

lgtm-com bot commented Oct 31, 2022

This pull request introduces 5 alerts when merging fb6609e into 675fd98 - view on LGTM.com

new alerts:

  • 3 for Uncontrolled data used in path expression
  • 2 for Unused format argument

@lgtm-com
Copy link

lgtm-com bot commented Nov 1, 2022

This pull request introduces 5 alerts when merging a4b4cdb into fd7864a - view on LGTM.com

new alerts:

  • 3 for Uncontrolled data used in path expression
  • 2 for Unused format argument

@lgtm-com
Copy link

lgtm-com bot commented Nov 1, 2022

This pull request introduces 5 alerts when merging ab612d8 into 176934e - view on LGTM.com

new alerts:

  • 3 for Uncontrolled data used in path expression
  • 2 for Unused format argument

@lgtm-com
Copy link

lgtm-com bot commented Nov 2, 2022

This pull request introduces 5 alerts when merging 9ebadf4 into 176934e - view on LGTM.com

new alerts:

  • 3 for Uncontrolled data used in path expression
  • 2 for Unused format argument

@lgtm-com
Copy link

lgtm-com bot commented Nov 2, 2022

This pull request introduces 5 alerts when merging d40db8f into 176934e - view on LGTM.com

new alerts:

  • 3 for Uncontrolled data used in path expression
  • 2 for Unused format argument

@a2l007
Copy link
Contributor

a2l007 commented Nov 3, 2022

Thanks for the contribution @churromorales !
Looking forward to the followup PRs to handle the outstanding items identified from this PR.

@a2l007 a2l007 merged commit e5ad24f into apache:master Nov 3, 2022
@abhishekagarwal87
Copy link
Contributor

Thank you for the contribution @churromorales. This will be a major feature in the upcoming 25.0.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.