Multi Cluster Load Balance #4929

hzy46 · 2020-09-22T08:32:37Z

Motivation

A user can access to multiple clusters, and the clusters may vary significantly in availability, cost, and congestion. For example, there may be an expensive powerful cluster with V100 GPUs, but always busy, while another cheap K80 cluster is much idle. It is reasonable to balance the workloads (jobs) to the most appropriate cluster to reduce total waiting time or cost. Since different customers may have diverse needs, the policies to direct the load balance must be highly customizable. (Note: Only waiting jobs could be balanced, running jobs will not be touched.)

Scenario A: The user has one on-prem cluster (limited resource) and an on-cloud cluster (scalable). If the on-prem cluster is too congested (too many long-waiting jobs), some of the long-waiting jobs may be balanced to the on-cloud cluster to reduce the total waiting time.
Scenario A+ (generalization of A): The user has access to multiple clusters. If some clusters are too congested (too many long-waiting jobs), some of the long-waiting jobs could be balanced to another (free) cluster to reduce total waiting time.

Proposal - multiple cluster load balance service

Design Goal: Provide a mechanism (service) to empower users take full advantage of multiple clusters automatically, thus meet users' custom requirement (e.g. reduce completed time, reduce waiting time, reduce cost, etc.)

The service provides the following functions to users:

monitor the congestion of multiple clusters, and filter the balanceable jobs per the job-selection policies (e.g. jobs waiting for a given time in a cluster)
for each selected job, find the most appropriate target cluster according to cluster-selection policies (e.g. available and most powerful, or cheapest) and clone the job from current cluster to target cluster
after cloning the job to target cluster, stop the original job intermediately or let the old/new jobs compete (until one of them is started) according to post-action policies

Notes:

All the policies mentioned above would be customizable by users (admins), there may be also other polies such as
- historical policies to setup the historical (or stateful) constraints for a job (e.g. max times of migrations for a job, max number of clones allowed at the same time)
- policies are defined when service starting, not support dynamic changing in the first; per-job policy setting could be an option for further discussion
To support above policies, the scheduling service may communicate with each cluster for
- cluster status, such as available sku types and # of skus in the VC ...
- job status, such as job status, submitting time, job tags ...
- (future) job execution statistics and estimations for advanced scheduling, e.g. estimated data throughput, estimated job execution time, ...
A job is balanceable means a job is in WAITING state and defined by a template (and cluster specific context). The format of template and context will be in a separate issue
The balancing attempts (history) will be traced in database or by job tagging
This service is backend only. A webportal to view, manage multiple clusters, nodes, and jobs will be in a separate issue

The text was updated successfully, but these errors were encountered:

hzy46 · 2020-10-26T08:31:56Z

Final Design:

a section on user profile page to let users register bounded clusters (change the name to Bounded Clusters in the following pic)

User can click a transfer button to transfer job on job detail page
The transfer page looks like:

After job is transferred, show the message to user:

This comment has been minimized.

Sign in to view

hzy46 mentioned this issue Sep 23, 2020

[Design] Multiple clusters supporting #4801

Closed

22 tasks

scarlett2018 mentioned this issue Sep 24, 2020

2020 Sept ~ Oct release plan #4898

Closed

31 tasks

mydmdm changed the title ~~Multi Cluster Scheduling Proposal~~ Multi Cluster Load Balance Sep 24, 2020

scarlett2018 mentioned this issue Oct 21, 2020

2020 Oct ~ Nov release plan #4988

Closed

38 tasks

scarlett2018 assigned hzy46 Oct 26, 2020

scarlett2018 added the P0 label Oct 26, 2020

suiguoxin mentioned this issue Nov 2, 2020

V1.4 Release Plan #5043

Closed

suiguoxin mentioned this issue Nov 16, 2020

V1.4 Bug Bash #5087

Closed

39 tasks

mydmdm mentioned this issue Jan 14, 2021

Use marketplace and bounded cluster to sync user's favorite jobs/prerequisites #5236

Open

mydmdm linked a pull request Jan 14, 2021 that will close this issue

Support job transfer #5082

Merged

mydmdm closed this as completed Jan 14, 2021

suiguoxin mentioned this issue Aug 23, 2021

Frontend-Backend separation support for providing multi-cluster joint management. #5605

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi Cluster Load Balance #4929

Multi Cluster Load Balance #4929

hzy46 commented Sep 22, 2020 •

edited by mydmdm

Loading

This comment has been minimized.

hzy46 commented Oct 26, 2020

Multi Cluster Load Balance #4929

Multi Cluster Load Balance #4929

Comments

hzy46 commented Sep 22, 2020 • edited by mydmdm Loading

Motivation

Proposal - multiple cluster load balance service

This comment has been minimized.

hzy46 commented Oct 26, 2020

hzy46 commented Sep 22, 2020 •

edited by mydmdm

Loading