Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A new Druid MiddleManager Resource Scheduling Model Based On K8s(MOK) #10910

Conversation

zhangyue19921010
Copy link
Contributor

@zhangyue19921010 zhangyue19921010 commented Feb 20, 2021

Fixes #10824.

Description

Please read #10824 for details.

This PR adds a new extension named druid-kubernetes-middlemanager-extensions in extension-contrib which means there is no harm to druid core and already has been tested on DEV druid cluster which is running on K8s including index_kafka, index, index_parallel and compact.

Now we can use a single 2Cores and 2Gi Memory middelmanager pod to control Dozens or even hundreds peon pods and there is no need to let MiddleManager take up a lot of resources in advance.

Also now different kinds of tasks can use different configs including CPU and Memory resources.

As for the Autoscale : Maybe there is no need for middlemanager autoscaler in this scenario because the resources occupied by middlemanager are small enough and Combined with #10524 Druid has the ability which is tested to create peon pods and auto scale the pod numbers!

image


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@zhangyue19921010
Copy link
Contributor Author

all the ci jobs are passed now, except job https://travis-ci.com/github/apache/druid/jobs/484742979 which is failed by network issue. Maybe retry will be successful.

@zhangyue19921010
Copy link
Contributor Author

Will add a document for this feature soon to explain how to enable it.

@zhangyue19921010
Copy link
Contributor Author

zhangyue19921010 commented Feb 26, 2021

Modify current K8s related IT job(76), now this job is:

  1. Build Druid Cluster on K8s.
  2. No Zookeeper dependency.
  3. MiddleManager launches Peon pods to ingest data.

Also CI is passed now which means this pr is ready for review!

Users also can run these commands on their local env to try this new feature.

export CONFIG_FILE='k8s_run_config_file.json' 
export IT_TEST='-Dit.test=ITNestedQueryPushDownTest' 
export POD_NAME=int-test 
export POD_NAMESPACE=default 
export BUILD_DRUID_CLSUTER=true
export MAVEN_SKIP="-Pskip-static-checks -Ddruid.console.skip=true -Dmaven.javadoc.skip=true"
mvn -B clean install -q -ff -Pskip-static-checks -Ddruid.console.skip=true -Dmaven.javadoc.skip=true -Pskip-tests -T1C
mvn verify -pl integration-tests -P int-tests-config-file ${IT_TEST} ${MAVEN_SKIP} -Dpod.name=${POD_NAME} -Dpod.namespace=${POD_NAMESPACE} -Dbuild.druid.cluster=${BUILD_DRUID_CLSUTER}

@glasser
Copy link
Contributor

glasser commented Feb 26, 2021

We'd love to be able to run peons as k8s pods.

It does kind of raise the question of why we would want middlemanagers at all rather than just letting the overlord create k8s pods... But if it's a lot easier to build this way, then that's reasonable.

@zhangyue19921010
Copy link
Contributor Author

We'd love to be able to run peons as k8s pods.

It does kind of raise the question of why we would want middlemanagers at all rather than just letting the overlord create k8s pods... But if it's a lot easier to build this way, then that's reasonable.

Hi @glasser
letting the overlord create k8s pods is a huge change including API, Task scheduling model and etc.
Let middleMange to create peon pods and control pods lifecycle is much more easier. Maybe we can let overlord to create pod directly and disable mm in the future.

@gianm
Copy link
Contributor

gianm commented Mar 2, 2021

letting the overlord create k8s pods is a huge change including API, Task scheduling model and etc.

That's too bad, since one of the original goals of the TaskRunner interface is that it could be used to allow the Overlord to launch tasks directly as YARN applications. (At the time, YARN was more popular than K8S 🙂)

I guess it didn't live up to this goal, if you found it easier to have the MMs launch K8S pods.

Anyway, I'm not a K8S expert, but this seems like a very interesting idea.

@zhangyue19921010
Copy link
Contributor Author

zhangyue19921010 commented Mar 3, 2021

letting the overlord create k8s pods is a huge change including API, Task scheduling model and etc.

That's too bad, since one of the original goals of the TaskRunner interface is that it could be used to allow the Overlord to launch tasks directly as YARN applications. (At the time, YARN was more popular than K8S 🙂)

I guess it didn't live up to this goal, if you found it easier to have the MMs launch K8S pods.

Anyway, I'm not a K8S expert, but this seems like a very interesting idea.

Hi @gianm Thanks for your attention. Maybe I didn’t make my point clear.
In my opinion, I just follow the current workflow as overlord -> middlemanger -> peon
The difference between ForkingTaskRunner and new K8sForkingTaskRunner(something like ThreadingTaskRunner for CliIndexer) is that middlemanger launch peon as pod in Kubernetes rather than a child process on the same machine.

Because of the same workflow mentioned above, we don't need to consider potential api, task life control or other changes. So that I believe it's easier to have the MMs launch K8S pods :)

@nishantmonu51
Copy link
Member

letting the overlord create k8s pods is a huge change including API, Task scheduling model and etc.

That's too bad, since one of the original goals of the TaskRunner interface is that it could be used to allow the Overlord to launch tasks directly as YARN applications. (At the time, YARN was more popular than K8S 🙂)
I guess it didn't live up to this goal, if you found it easier to have the MMs launch K8S pods.
Anyway, I'm not a K8S expert, but this seems like a very interesting idea.

Hi @gianm Thanks for your attention. Maybe I didn’t make my point clear.
In my opinion, I just follow the current workflow as overlord -> middlemanger -> peon
The difference between ForkingTaskRunner and new K8sForkingTaskRunner(something like ThreadingTaskRunner for CliIndexer) is that middlemanger launch peon as pod in Kubernetes rather than a child process on the same machine.

Because of the same workflow mentioned above, we don't need to consider potential api, task life control or other changes. So that I believe it's easier to have the MMs launch K8S pods :)

I think it's possible and a valid use-case to use ForkingTaskRunner/RemoteTaskRunner/K8sForkingTaskRunner directly on the overlord, it's just a configuration change on the overlord unless the k8sForkingTaskRunner depends on something specific from the MiddleManager.

@nishantmonu51
Copy link
Member

@zhangyue19921010 : Can you add some user docs on how to use the new runner and the configurations required to get started ?

@zhangyue19921010
Copy link
Contributor Author

@zhangyue19921010 : Can you add some user docs on how to use the new runner and the configurations required to get started ?

Sure, will add docs ASAP.

private static final String DRUID_PEON_JAVA_OPTS = "druid.peon.javaOpts";
private static final String DRUID_PEON_POD_MEMORY = "druid.peon.pod.memory";
private static final String DRUID_PEON_POD_CPU = "druid.peon.pod.cpu";
private static final String LABEL_KEY = "druid.ingest.task.id";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

task Runner seems to be hardcoding most of the parameters/configs to be used by the task Runner, Is it possible to use something like Helm charts and templates ? The goal is for the devOps person to maintain those templates and from druid side, the task runner would make use of those templates to create a Job ?

":", ""),
".", ""),
"-", ""));
String label_value;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: refactor and extract to sanitizeName method

command.add(StringUtils.format("-Ddruid.tlsPort=%d", tlsChildPort));

// Let tasks know where they are running on.
// This information is used in native parallel indexing with shuffle.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious: How does shuffle work with K8sTaskRunner ? IIRC shuffle implementation relies on MM to be available and serve data processed by task even after the task finishes. In this case since the task pod is deleted after the task finishes, it won't work with shuffle enabled. This expected behavior needs to be verified and documented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is the exact question i have. In a k8s world, we would prefer not to have pvcs for the task or keep any state, do you know if the deep storage shuffle implementation could work here? I don't know enough about what has been done recently to have an opinion. I would think the support would need to be there or native batch wouldn't work which would be concerning.


while (!phase.equalsIgnoreCase("Failed") && !phase.equalsIgnoreCase("Succeeded")) {
LOGGER.info("Still wait for peon pod finished [%s/%s] current status is [%s]", namespace, name, phase);
Thread.sleep(3 * 1000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a better alternative here ? does k8s support listening/watching to some pod lifecycle events ?

@harinirajendran
Copy link
Contributor

@zhangyue19921010 is there any plans of merging this PR to master sooner?

@marcospassos
Copy link

Same issue here. Looking forward to this PR getting merged.

@churromorales
Copy link
Contributor

this is awesome, but i see that the parallel index task now uses the shuffle which now relies on the middle manager as something more than just launching and managing tasks.

I don't know enough about the new feature, but saw there was a deep storage implementation for shuffle? Will this work without the MM, if so I am happy to take a stab at taking this PR as a base and trying to totally remove the MM in a k8s world.

@abhishekagarwal87
Copy link
Contributor

@zhangyue19921010 are you still working on this PR? Happy to help with the reviewing if needed.

@churromorales
Copy link
Contributor

churromorales commented Aug 12, 2022

@zhangyue19921010 are you still working on this PR? Happy to help with the reviewing if needed.

@abhishekagarwal87
I am happy to pick this up. I have reviewed the code and there are a few things missing.

  1. Lets make this work with k8s jobs, not pods. Let k8s manage lifecycles not the middle manager. Also simplify the configuration, I am thinking one configuration option to use this feature from a user perspective. druid.peon.k8s.tasks=true
  2. Make the task itself push the reports.json, right now it seems to be ignored.
  3. There is no checkpointing for k8s tasks. We need to have the tasks themselves be able to push checkpoints and recover from them.

Just to note, this patch to make it work the same as Druid does today requires core changes. It can no longer be isolated to just an extension. If that seems like a good plan, I am happy to finish up this work and try to get it into druid. Let me know your thoughts.

@abhishekagarwal87
Copy link
Contributor

Thank you for taking this up @churromorales. It will be nice to keep this as extension. You can certainly add extension points in the core if you need them. Or some code can go to the core itself. In any case, that's not a blocker for starting the implementation.

@churromorales
Copy link
Contributor

churromorales commented Sep 29, 2022

Thank you for taking this up @churromorales. It will be nice to keep this as extension. You can certainly add extension points in the core if you need them. Or some code can go to the core itself. In any case, that's not a blocker for starting the implementation.

@abhishekagarwal87 sorry it took a while: #13156
I have tested this on our druid clusters for various tasks, ingestion, parallel index, compaction, etc. I used the operator to deploy and removed the Middle manager from the spec altogether from the deployment. The code is totally different from this patch and should be feature complete to the way druid works today with the current task runner implementations.

Let me know what you think

@abhishekagarwal87
Copy link
Contributor

@churromorales - This is really great. I will take a look at your PR very soon.

@abhishekagarwal87
Copy link
Contributor

Closing this PR since the feature is merged in another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A new Druid MiddleManager Resource Scheduling Model Based On K8s(MOK)
10 participants