A new Druid MiddleManager Resource Scheduling Model Based On K8s(MOK) #10910

zhangyue19921010 · 2021-02-20T03:48:18Z

Description

Please read #10824 for details.

This PR adds a new extension named druid-kubernetes-middlemanager-extensions in extension-contrib which means there is no harm to druid core and already has been tested on DEV druid cluster which is running on K8s including index_kafka, index, index_parallel and compact.

Now we can use a single 2Cores and 2Gi Memory middelmanager pod to control Dozens or even hundreds peon pods and there is no need to let MiddleManager take up a lot of resources in advance.

Also now different kinds of tasks can use different configs including CPU and Memory resources.

As for the Autoscale : Maybe there is no need for middlemanager autoscaler in this scenario because the resources occupied by middlemanager are small enough and Combined with #10524 Druid has the ability which is tested to create peon pods and auto scale the pod numbers!

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

zhangyue19921010 · 2021-02-20T13:32:39Z

all the ci jobs are passed now, except job https://travis-ci.com/github/apache/druid/jobs/484742979 which is failed by network issue. Maybe retry will be successful.

zhangyue19921010 · 2021-02-24T09:33:23Z

Will add a document for this feature soon to explain how to enable it.

zhangyue19921010 · 2021-02-26T07:21:26Z

Modify current K8s related IT job(76), now this job is:

Build Druid Cluster on K8s.
No Zookeeper dependency.
MiddleManager launches Peon pods to ingest data.

Also CI is passed now which means this pr is ready for review!

Users also can run these commands on their local env to try this new feature.

export CONFIG_FILE='k8s_run_config_file.json' 
export IT_TEST='-Dit.test=ITNestedQueryPushDownTest' 
export POD_NAME=int-test 
export POD_NAMESPACE=default 
export BUILD_DRUID_CLSUTER=true
export MAVEN_SKIP="-Pskip-static-checks -Ddruid.console.skip=true -Dmaven.javadoc.skip=true"
mvn -B clean install -q -ff -Pskip-static-checks -Ddruid.console.skip=true -Dmaven.javadoc.skip=true -Pskip-tests -T1C
mvn verify -pl integration-tests -P int-tests-config-file ${IT_TEST} ${MAVEN_SKIP} -Dpod.name=${POD_NAME} -Dpod.namespace=${POD_NAMESPACE} -Dbuild.druid.cluster=${BUILD_DRUID_CLSUTER}

glasser · 2021-02-26T21:21:31Z

We'd love to be able to run peons as k8s pods.

It does kind of raise the question of why we would want middlemanagers at all rather than just letting the overlord create k8s pods... But if it's a lot easier to build this way, then that's reasonable.

zhangyue19921010 · 2021-03-01T03:26:45Z

We'd love to be able to run peons as k8s pods.

It does kind of raise the question of why we would want middlemanagers at all rather than just letting the overlord create k8s pods... But if it's a lot easier to build this way, then that's reasonable.

Hi @glasser
letting the overlord create k8s pods is a huge change including API, Task scheduling model and etc.
Let middleMange to create peon pods and control pods lifecycle is much more easier. Maybe we can let overlord to create pod directly and disable mm in the future.

gianm · 2021-03-02T17:56:56Z

letting the overlord create k8s pods is a huge change including API, Task scheduling model and etc.

That's too bad, since one of the original goals of the TaskRunner interface is that it could be used to allow the Overlord to launch tasks directly as YARN applications. (At the time, YARN was more popular than K8S 🙂)

I guess it didn't live up to this goal, if you found it easier to have the MMs launch K8S pods.

Anyway, I'm not a K8S expert, but this seems like a very interesting idea.

zhangyue19921010 · 2021-03-03T02:17:55Z

letting the overlord create k8s pods is a huge change including API, Task scheduling model and etc.

That's too bad, since one of the original goals of the TaskRunner interface is that it could be used to allow the Overlord to launch tasks directly as YARN applications. (At the time, YARN was more popular than K8S 🙂)

I guess it didn't live up to this goal, if you found it easier to have the MMs launch K8S pods.

Anyway, I'm not a K8S expert, but this seems like a very interesting idea.

Hi @gianm Thanks for your attention. Maybe I didn’t make my point clear.
In my opinion, I just follow the current workflow as overlord -> middlemanger -> peon
The difference between ForkingTaskRunner and new K8sForkingTaskRunner(something like ThreadingTaskRunner for CliIndexer) is that middlemanger launch peon as pod in Kubernetes rather than a child process on the same machine.

Because of the same workflow mentioned above, we don't need to consider potential api, task life control or other changes. So that I believe it's easier to have the MMs launch K8S pods :)

nishantmonu51 · 2021-03-30T09:04:29Z

letting the overlord create k8s pods is a huge change including API, Task scheduling model and etc.

That's too bad, since one of the original goals of the TaskRunner interface is that it could be used to allow the Overlord to launch tasks directly as YARN applications. (At the time, YARN was more popular than K8S 🙂)
I guess it didn't live up to this goal, if you found it easier to have the MMs launch K8S pods.
Anyway, I'm not a K8S expert, but this seems like a very interesting idea.

Hi @gianm Thanks for your attention. Maybe I didn’t make my point clear.
In my opinion, I just follow the current workflow as overlord -> middlemanger -> peon
The difference between ForkingTaskRunner and new K8sForkingTaskRunner(something like ThreadingTaskRunner for CliIndexer) is that middlemanger launch peon as pod in Kubernetes rather than a child process on the same machine.

Because of the same workflow mentioned above, we don't need to consider potential api, task life control or other changes. So that I believe it's easier to have the MMs launch K8S pods :)

I think it's possible and a valid use-case to use ForkingTaskRunner/RemoteTaskRunner/K8sForkingTaskRunner directly on the overlord, it's just a configuration change on the overlord unless the k8sForkingTaskRunner depends on something specific from the MiddleManager.

nishantmonu51 · 2021-03-30T09:05:50Z

@zhangyue19921010 : Can you add some user docs on how to use the new runner and the configurations required to get started ?

zhangyue19921010 · 2021-03-30T09:11:12Z

@zhangyue19921010 : Can you add some user docs on how to use the new runner and the configurations required to get started ?

Sure, will add docs ASAP.

nishantmonu51 · 2021-03-30T09:19:51Z

...anager-extensions/src/main/java/org/apache/druid/k8s/middlemanager/K8sForkingTaskRunner.java

+  private static final String DRUID_PEON_JAVA_OPTS = "druid.peon.javaOpts";
+  private static final String DRUID_PEON_POD_MEMORY = "druid.peon.pod.memory";
+  private static final String DRUID_PEON_POD_CPU = "druid.peon.pod.cpu";
+  private static final String LABEL_KEY = "druid.ingest.task.id";


task Runner seems to be hardcoding most of the parameters/configs to be used by the task Runner, Is it possible to use something like Helm charts and templates ? The goal is for the devOps person to maintain those templates and from druid side, the task runner would make use of those templates to create a Job ?

nishantmonu51 · 2021-03-30T09:20:23Z

...anager-extensions/src/main/java/org/apache/druid/k8s/middlemanager/K8sForkingTaskRunner.java

+                                                        ":", ""),
+                                                ".", ""),
+                                        "-", ""));
+                        String label_value;


nit: refactor and extract to sanitizeName method

nishantmonu51 · 2021-03-30T09:23:38Z

...anager-extensions/src/main/java/org/apache/druid/k8s/middlemanager/K8sForkingTaskRunner.java

+                        command.add(StringUtils.format("-Ddruid.tlsPort=%d", tlsChildPort));
+
+                        // Let tasks know where they are running on.
+                        // This information is used in native parallel indexing with shuffle.


curious: How does shuffle work with K8sTaskRunner ? IIRC shuffle implementation relies on MM to be available and serve data processed by task even after the task finishes. In this case since the task pod is deleted after the task finishes, it won't work with shuffle enabled. This expected behavior needs to be verified and documented.

yes this is the exact question i have. In a k8s world, we would prefer not to have pvcs for the task or keep any state, do you know if the deep storage shuffle implementation could work here? I don't know enough about what has been done recently to have an opinion. I would think the support would need to be there or native batch wouldn't work which would be concerning.

nishantmonu51 · 2021-03-30T09:32:49Z

...-extensions/src/main/java/org/apache/druid/k8s/middlemanager/common/DefaultK8sApiClient.java

+
+      while (!phase.equalsIgnoreCase("Failed") && !phase.equalsIgnoreCase("Succeeded")) {
+        LOGGER.info("Still wait for peon pod finished [%s/%s] current status is [%s]", namespace, name, phase);
+        Thread.sleep(3 * 1000);


is there a better alternative here ? does k8s support listening/watching to some pod lifecycle events ?

harinirajendran · 2021-09-30T13:11:10Z

@zhangyue19921010 is there any plans of merging this PR to master sooner?

marcospassos · 2022-01-31T12:27:56Z

Same issue here. Looking forward to this PR getting merged.

churromorales · 2022-03-22T19:39:13Z

this is awesome, but i see that the parallel index task now uses the shuffle which now relies on the middle manager as something more than just launching and managing tasks.

I don't know enough about the new feature, but saw there was a deep storage implementation for shuffle? Will this work without the MM, if so I am happy to take a stab at taking this PR as a base and trying to totally remove the MM in a k8s world.

abhishekagarwal87 · 2022-08-03T05:02:54Z

@zhangyue19921010 are you still working on this PR? Happy to help with the reviewing if needed.

churromorales · 2022-08-12T17:00:45Z

@zhangyue19921010 are you still working on this PR? Happy to help with the reviewing if needed.

@abhishekagarwal87
I am happy to pick this up. I have reviewed the code and there are a few things missing.

Lets make this work with k8s jobs, not pods. Let k8s manage lifecycles not the middle manager. Also simplify the configuration, I am thinking one configuration option to use this feature from a user perspective. druid.peon.k8s.tasks=true
Make the task itself push the reports.json, right now it seems to be ignored.
There is no checkpointing for k8s tasks. We need to have the tasks themselves be able to push checkpoints and recover from them.

Just to note, this patch to make it work the same as Druid does today requires core changes. It can no longer be isolated to just an extension. If that seems like a good plan, I am happy to finish up this work and try to get it into druid. Let me know your thoughts.

abhishekagarwal87 · 2022-08-15T05:55:58Z

Thank you for taking this up @churromorales. It will be nice to keep this as extension. You can certainly add extension points in the core if you need them. Or some code can go to the core itself. In any case, that's not a blocker for starting the implementation.

churromorales · 2022-09-29T21:53:33Z

Thank you for taking this up @churromorales. It will be nice to keep this as extension. You can certainly add extension points in the core if you need them. Or some code can go to the core itself. In any case, that's not a blocker for starting the implementation.

@abhishekagarwal87 sorry it took a while: #13156
I have tested this on our druid clusters for various tasks, ingestion, parallel index, compaction, etc. I used the operator to deploy and removed the Middle manager from the spec altogether from the deployment. The code is totally different from this patch and should be feature complete to the way druid works today with the current task runner implementations.

Let me know what you think

abhishekagarwal87 · 2022-09-30T06:44:28Z

@churromorales - This is really great. I will take a look at your PR very soon.

abhishekagarwal87 · 2023-02-08T03:30:12Z

Closing this PR since the feature is merged in another PR.

yuezhang added 4 commits February 20, 2021 10:56

code changing

0f8b557

add a new extension in extension-contribute

1737267

revert msic

d04e591

fix analyze dependencies

4ce27e5

yuezhang added 6 commits February 26, 2021 10:48

get report.json and status.json from peon pod

25a359e

UT added

2213be2

Try to add IT

e33e845

Try to add IT

197cab2

ready to test IT

fe9f0ef

done

ef64f15

himanshug self-requested a review March 4, 2021 20:25

jasonk000 mentioned this pull request Mar 9, 2021

KubernetesTaskRunner for running druid tasks as kubernetes jobs #8801

Open

clintropolis added Area - Ingestion Area - Operations labels Mar 11, 2021

zhangyue19921010 added the Area - Extension label Mar 24, 2021

nishantmonu51 reviewed Apr 5, 2021

View reviewed changes

asdf2014 added the Kubernetes label Apr 9, 2021

kishan-tr mentioned this pull request Jul 28, 2022

k8s middle-manager extension for spawning peon pods [WIP] confluentinc/druid#105

Draft

9 tasks

churromorales mentioned this pull request Aug 11, 2022

Launch tasks as k8s jobs in kubernetes environments. #12892

Closed

abhishekagarwal87 closed this Feb 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A new Druid MiddleManager Resource Scheduling Model Based On K8s(MOK) #10910

A new Druid MiddleManager Resource Scheduling Model Based On K8s(MOK) #10910

zhangyue19921010 commented Feb 20, 2021 •

edited

Loading

zhangyue19921010 commented Feb 20, 2021

zhangyue19921010 commented Feb 24, 2021

zhangyue19921010 commented Feb 26, 2021 •

edited

Loading

glasser commented Feb 26, 2021

zhangyue19921010 commented Mar 1, 2021

gianm commented Mar 2, 2021 •

edited

Loading

zhangyue19921010 commented Mar 3, 2021 •

edited

Loading

nishantmonu51 commented Mar 30, 2021

nishantmonu51 commented Mar 30, 2021

zhangyue19921010 commented Mar 30, 2021

nishantmonu51 Mar 30, 2021

nishantmonu51 Mar 30, 2021

nishantmonu51 Mar 30, 2021

churromorales Mar 22, 2022

nishantmonu51 Mar 30, 2021

harinirajendran commented Sep 30, 2021

marcospassos commented Jan 31, 2022

churromorales commented Mar 22, 2022

abhishekagarwal87 commented Aug 3, 2022

churromorales commented Aug 12, 2022 •

edited

Loading

abhishekagarwal87 commented Aug 15, 2022

churromorales commented Sep 29, 2022 •

edited

Loading

abhishekagarwal87 commented Sep 30, 2022

abhishekagarwal87 commented Feb 8, 2023

A new Druid MiddleManager Resource Scheduling Model Based On K8s(MOK) #10910

A new Druid MiddleManager Resource Scheduling Model Based On K8s(MOK) #10910

Conversation

zhangyue19921010 commented Feb 20, 2021 • edited Loading

Description

zhangyue19921010 commented Feb 20, 2021

zhangyue19921010 commented Feb 24, 2021

zhangyue19921010 commented Feb 26, 2021 • edited Loading

glasser commented Feb 26, 2021

zhangyue19921010 commented Mar 1, 2021

gianm commented Mar 2, 2021 • edited Loading

zhangyue19921010 commented Mar 3, 2021 • edited Loading

nishantmonu51 commented Mar 30, 2021

nishantmonu51 commented Mar 30, 2021

zhangyue19921010 commented Mar 30, 2021

nishantmonu51 Mar 30, 2021

Choose a reason for hiding this comment

nishantmonu51 Mar 30, 2021

Choose a reason for hiding this comment

nishantmonu51 Mar 30, 2021

Choose a reason for hiding this comment

churromorales Mar 22, 2022

Choose a reason for hiding this comment

nishantmonu51 Mar 30, 2021

Choose a reason for hiding this comment

harinirajendran commented Sep 30, 2021

marcospassos commented Jan 31, 2022

churromorales commented Mar 22, 2022

abhishekagarwal87 commented Aug 3, 2022

churromorales commented Aug 12, 2022 • edited Loading

abhishekagarwal87 commented Aug 15, 2022

churromorales commented Sep 29, 2022 • edited Loading

abhishekagarwal87 commented Sep 30, 2022

abhishekagarwal87 commented Feb 8, 2023

zhangyue19921010 commented Feb 20, 2021 •

edited

Loading

zhangyue19921010 commented Feb 26, 2021 •

edited

Loading

gianm commented Mar 2, 2021 •

edited

Loading

zhangyue19921010 commented Mar 3, 2021 •

edited

Loading

churromorales commented Aug 12, 2022 •

edited

Loading

churromorales commented Sep 29, 2022 •

edited

Loading