You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.
A user can access to multiple clusters, and the clusters may vary significantly in availability, cost, and congestion. For example, there may be an expensive powerful cluster with V100 GPUs, but always busy, while another cheap K80 cluster is much idle. It is reasonable to balance the workloads (jobs) to the most appropriate cluster to reduce total waiting time or cost. Since different customers may have diverse needs, the policies to direct the load balance must be highly customizable. (Note: Only waiting jobs could be balanced, running jobs will not be touched.)
Scenario A: The user has one on-prem cluster (limited resource) and an on-cloud cluster (scalable). If the on-prem cluster is too congested (too many long-waiting jobs), some of the long-waiting jobs may be balanced to the on-cloud cluster to reduce the total waiting time.
Scenario A+ (generalization of A): The user has access to multiple clusters. If some clusters are too congested (too many long-waiting jobs), some of the long-waiting jobs could be balanced to another (free) cluster to reduce total waiting time.
Proposal - multiple cluster load balance service
Design Goal: Provide a mechanism (service) to empower users take full advantage of multiple clusters automatically, thus meet users' custom requirement (e.g. reduce completed time, reduce waiting time, reduce cost, etc.)
The service provides the following functions to users:
monitor the congestion of multiple clusters, and filter the balanceable jobs per the job-selection policies (e.g. jobs waiting for a given time in a cluster)
for each selected job, find the most appropriate target cluster according to cluster-selection policies (e.g. available and most powerful, or cheapest) and clone the job from current cluster to target cluster
after cloning the job to target cluster, stop the original job intermediately or let the old/new jobs compete (until one of them is started) according to post-action policies
Notes:
All the policies mentioned above would be customizable by users (admins), there may be also other polies such as
historical policies to setup the historical (or stateful) constraints for a job (e.g. max times of migrations for a job, max number of clones allowed at the same time)
policies are defined when service starting, not support dynamic changing in the first; per-job policy setting could be an option for further discussion
To support above policies, the scheduling service may communicate with each cluster for
cluster status, such as available sku types and # of skus in the VC ...
job status, such as job status, submitting time, job tags ...
(future) job execution statistics and estimations for advanced scheduling, e.g. estimated data throughput, estimated job execution time, ...
A job is balanceable means a job is in WAITING state and defined by a template (and cluster specific context). The format of template and context will be in a separate issue
The balancing attempts (history) will be traced in database or by job tagging
This service is backend only. A webportal to view, manage multiple clusters, nodes, and jobs will be in a separate issue
The text was updated successfully, but these errors were encountered:
Motivation
A user can access to multiple clusters, and the clusters may vary significantly in availability, cost, and congestion. For example, there may be an expensive powerful cluster with V100 GPUs, but always busy, while another cheap K80 cluster is much idle. It is reasonable to balance the workloads (jobs) to the most appropriate cluster to reduce total waiting time or cost. Since different customers may have diverse needs, the policies to direct the load balance must be highly customizable. (Note: Only waiting jobs could be balanced, running jobs will not be touched.)
Scenario A: The user has one on-prem cluster (limited resource) and an on-cloud cluster (scalable). If the on-prem cluster is too congested (too many long-waiting jobs), some of the long-waiting jobs may be balanced to the on-cloud cluster to reduce the total waiting time.
Scenario A+ (generalization of A): The user has access to multiple clusters. If some clusters are too congested (too many long-waiting jobs), some of the long-waiting jobs could be balanced to another (free) cluster to reduce total waiting time.
Proposal - multiple cluster load balance service
Design Goal: Provide a mechanism (service) to empower users take full advantage of multiple clusters automatically, thus meet users' custom requirement (e.g. reduce completed time, reduce waiting time, reduce cost, etc.)
The service provides the following functions to users:
monitor the congestion of multiple clusters, and filter the balanceable jobs per the job-selection policies (e.g. jobs waiting for a given time in a cluster)
for each selected job, find the most appropriate target cluster according to cluster-selection policies (e.g. available and most powerful, or cheapest) and clone the job from current cluster to target cluster
after cloning the job to target cluster, stop the original job intermediately or let the old/new jobs compete (until one of them is started) according to post-action policies
Notes:
All the policies mentioned above would be customizable by users (admins), there may be also other polies such as
To support above policies, the scheduling service may communicate with each cluster for
A job is balanceable means a job is in
WAITING
state and defined by a template (and cluster specific context). The format of template and context will be in a separate issueThe balancing attempts (history) will be traced in database or by job tagging
This service is backend only. A webportal to view, manage multiple clusters, nodes, and jobs will be in a separate issue
The text was updated successfully, but these errors were encountered: