-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A new Druid MiddleManager Resource Scheduling Model Based On K8s(MOK) #10910
A new Druid MiddleManager Resource Scheduling Model Based On K8s(MOK) #10910
Conversation
all the ci jobs are passed now, except job https://travis-ci.com/github/apache/druid/jobs/484742979 which is failed by network issue. Maybe retry will be successful. |
Will add a document for this feature soon to explain how to enable it. |
Modify current K8s related IT job(76), now this job is:
Also CI is passed now which means this pr is ready for review! Users also can run these commands on their local env to try this new feature.
|
We'd love to be able to run peons as k8s pods. It does kind of raise the question of why we would want middlemanagers at all rather than just letting the overlord create k8s pods... But if it's a lot easier to build this way, then that's reasonable. |
Hi @glasser |
That's too bad, since one of the original goals of the TaskRunner interface is that it could be used to allow the Overlord to launch tasks directly as YARN applications. (At the time, YARN was more popular than K8S 🙂) I guess it didn't live up to this goal, if you found it easier to have the MMs launch K8S pods. Anyway, I'm not a K8S expert, but this seems like a very interesting idea. |
Hi @gianm Thanks for your attention. Maybe I didn’t make my point clear. Because of the same workflow mentioned above, we don't need to consider potential api, task life control or other changes. So that I believe it's easier to have the MMs launch K8S pods :) |
I think it's possible and a valid use-case to use ForkingTaskRunner/RemoteTaskRunner/K8sForkingTaskRunner directly on the overlord, it's just a configuration change on the overlord unless the k8sForkingTaskRunner depends on something specific from the MiddleManager. |
@zhangyue19921010 : Can you add some user docs on how to use the new runner and the configurations required to get started ? |
Sure, will add docs ASAP. |
private static final String DRUID_PEON_JAVA_OPTS = "druid.peon.javaOpts"; | ||
private static final String DRUID_PEON_POD_MEMORY = "druid.peon.pod.memory"; | ||
private static final String DRUID_PEON_POD_CPU = "druid.peon.pod.cpu"; | ||
private static final String LABEL_KEY = "druid.ingest.task.id"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
task Runner seems to be hardcoding most of the parameters/configs to be used by the task Runner, Is it possible to use something like Helm charts and templates ? The goal is for the devOps person to maintain those templates and from druid side, the task runner would make use of those templates to create a Job ?
":", ""), | ||
".", ""), | ||
"-", "")); | ||
String label_value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: refactor and extract to sanitizeName method
command.add(StringUtils.format("-Ddruid.tlsPort=%d", tlsChildPort)); | ||
|
||
// Let tasks know where they are running on. | ||
// This information is used in native parallel indexing with shuffle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious: How does shuffle work with K8sTaskRunner ? IIRC shuffle implementation relies on MM to be available and serve data processed by task even after the task finishes. In this case since the task pod is deleted after the task finishes, it won't work with shuffle enabled. This expected behavior needs to be verified and documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this is the exact question i have. In a k8s world, we would prefer not to have pvcs for the task or keep any state, do you know if the deep storage shuffle implementation could work here? I don't know enough about what has been done recently to have an opinion. I would think the support would need to be there or native batch wouldn't work which would be concerning.
|
||
while (!phase.equalsIgnoreCase("Failed") && !phase.equalsIgnoreCase("Succeeded")) { | ||
LOGGER.info("Still wait for peon pod finished [%s/%s] current status is [%s]", namespace, name, phase); | ||
Thread.sleep(3 * 1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a better alternative here ? does k8s support listening/watching to some pod lifecycle events ?
@zhangyue19921010 is there any plans of merging this PR to master sooner? |
Same issue here. Looking forward to this PR getting merged. |
this is awesome, but i see that the parallel index task now uses the shuffle which now relies on the middle manager as something more than just launching and managing tasks. I don't know enough about the new feature, but saw there was a deep storage implementation for shuffle? Will this work without the MM, if so I am happy to take a stab at taking this PR as a base and trying to totally remove the MM in a k8s world. |
@zhangyue19921010 are you still working on this PR? Happy to help with the reviewing if needed. |
@abhishekagarwal87
Just to note, this patch to make it work the same as Druid does today requires core changes. It can no longer be isolated to just an extension. If that seems like a good plan, I am happy to finish up this work and try to get it into druid. Let me know your thoughts. |
Thank you for taking this up @churromorales. It will be nice to keep this as extension. You can certainly add extension points in the core if you need them. Or some code can go to the core itself. In any case, that's not a blocker for starting the implementation. |
@abhishekagarwal87 sorry it took a while: #13156 Let me know what you think |
@churromorales - This is really great. I will take a look at your PR very soon. |
Closing this PR since the feature is merged in another PR. |
Fixes #10824.
Description
Please read #10824 for details.
This PR adds a new extension named
druid-kubernetes-middlemanager-extensions
in extension-contrib which means there is no harm to druid core and already has been tested on DEV druid cluster which is running on K8s including index_kafka, index, index_parallel and compact.Now we can use a single 2Cores and 2Gi Memory middelmanager pod to control Dozens or even hundreds peon pods and there is no need to let MiddleManager take up a lot of resources in advance.
Also now different kinds of tasks can use different configs including CPU and Memory resources.
As for the Autoscale : Maybe there is no need for middlemanager autoscaler in this scenario because the resources occupied by middlemanager are small enough and Combined with #10524 Druid has the ability which is tested to create peon pods and auto scale the pod numbers!
This PR has: