From 5839bd349bc34ee73a2bec9f079abe08042f8847 Mon Sep 17 00:00:00 2001 From: Aldo Culquicondor <1299064+alculquicondor@users.noreply.github.com> Date: Wed, 24 Aug 2022 17:15:50 -0400 Subject: [PATCH] Enhance and update documentation (#351) Change-Id: I25de039eea653e95a712f9d8450f14e77f16452f --- CHANGELOG/CHANGELOG-0.2.md | 21 ++++- README.md | 34 +++++-- docs/concepts/README.md | 11 +-- docs/concepts/cluster_queue.md | 113 +++++++++++++++++++----- docs/concepts/local_queue.md | 17 ++++ docs/concepts/queue.md | 9 -- docs/concepts/workload.md | 13 ++- docs/setup/install.md | 14 +++ docs/tasks/administer_cluster_quotas.md | 8 +- docs/tasks/run_jobs.md | 2 + 10 files changed, 188 insertions(+), 54 deletions(-) create mode 100644 docs/concepts/local_queue.md delete mode 100644 docs/concepts/queue.md diff --git a/CHANGELOG/CHANGELOG-0.2.md b/CHANGELOG/CHANGELOG-0.2.md index 5634b0d8e6..ad7ad2b1f3 100644 --- a/CHANGELOG/CHANGELOG-0.2.md +++ b/CHANGELOG/CHANGELOG-0.2.md @@ -2,8 +2,25 @@ Changes since `v0.1.0`: -- Fixed bug in a BestEffortFIFO ClusterQueue where a workload might not be - retried after a transient error. +### Features - Bumped the API version from v1alpha1 to v1alpha2. v1alpha1 is no longer supported and Queue is now named LocalQueue. +- Add webhooks to validate and add defaults to all kueue APIs. +- Support [codependent resources](/docs/concepts/cluster_queue.md#codepedent-resources) + by assigning the same flavor to codependent resources in a pod set. +- Support [pod overhead](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-overhead/) + in Workload pod sets. +- Default requests to limits if requests are not set in a Workload pod set, to + match internal defaulting for k8s Pods. - Added [prometheus metrics](/docs/reference/metrics.md) to monitor health of the system and the status of ClusterQueues. + +### Bug fixes + +- Prevent Workloads that don't match the ClusterQueue's namespaceSelector from + blocking other Workloads in a StrictFIFO ClusterQueue. +- Fixed number of pending workloads in a BestEffortFIFO ClusterQueue. +- Fixed bug in a BestEffortFIFO ClusterQueue where a workload might not be + retried after a transient error. +- Fixed requeuing an out-of-date workload when failed to admit it. +- Fixed bug in a BestEffortFIFO ClusterQueue where unadmissible workloads + were not removed from the ClusterQueue when removing the corresponding Queue. diff --git a/README.md b/README.md index 51f966e520..db60a047df 100644 --- a/README.md +++ b/README.md @@ -5,6 +5,20 @@ Kueue is a set of APIs and controller for [job](docs/concepts/workload.md) a job should be [admitted](docs/concepts#admission) to start (as in pods can be created) and when it should stop (as in active pods should be deleted). +## Why use Kueue + +Kueue is a lean controller that you can install on top of a vanilla Kubernetes +cluster without replacing any components. It is compatible with cloud +environments where: +- Nodes and other compute resources can be scaled up and down. +- Compute resources are heterogeneous (in architecture, availability, price, etc.). + +Kueue APIs allow you to express: +- Quotas and policies for fair sharing among tenants. +- Resource fungibility: if a [resource flavor](docs/concepts/cluster_queue.md#resourceflavor-object) + is fully utilized, run the [job](docs/concepts/workload.md) using a different + flavor. + The main design principle for Kueue is to avoid duplicating mature functionality in [Kubernetes components](https://kubernetes.io/docs/concepts/overview/components/) and well-established third-party controllers. Autoscaling, pod-to-node scheduling and @@ -12,14 +26,6 @@ job lifecycle management are the responsibility of cluster-autoscaler, kube-scheduler and kube-controller-manager, respectively. Advanced admission control can be delegated to controllers such as [gatekeeper](https://github.com/open-policy-agent/gatekeeper). - -Learn more by reading the design docs: -- [bit.ly/kueue-apis](https://bit.ly/kueue-apis) (please join the [mailing list](https://groups.google.com/a/kubernetes.io/g/wg-batch) -to get access) discusses the API proposal and a high-level description of how it -operates. -- [bit.ly/kueue-controller-design](https://bit.ly/kueue-controller-design) -presents the detailed design of the controller. - ## Installation **Requires Kubernetes 1.22 or newer**. @@ -52,6 +58,18 @@ Learn more about: - Kueue [concepts](docs/concepts). - Common and advanced [tasks](docs/tasks). +## Architecture + + + +Learn more about the architecture of Kueue in the design docs: + +- [bit.ly/kueue-apis](https://bit.ly/kueue-apis) (please join the [mailing list](https://groups.google.com/a/kubernetes.io/g/wg-batch) +to get access) discusses the API proposal and a high-level description of how it +operates. +- [bit.ly/kueue-controller-design](https://bit.ly/kueue-controller-design) +presents the detailed design of the controller. + ## Community, discussion, contribution, and support Learn how to engage with the Kubernetes community on the [community page](http://kubernetes.io/community/). diff --git a/docs/concepts/README.md b/docs/concepts/README.md index 5526ce770e..0a2c9bcbff 100644 --- a/docs/concepts/README.md +++ b/docs/concepts/README.md @@ -10,7 +10,7 @@ abstractions that Kueue uses to represent your cluster and workloads. A cluster-scoped resource that governs a pool of resources, defining usage limits and fair sharing rules. -### [Queue](queue.md) +### [Local Queue](local_queue.md) A namespaced resource that groups closely related workloads belonging to a single tenant. @@ -30,11 +30,12 @@ models, etc. ### Admission -The process of admitting a workload to start (pods to be created). A workload +The process of admitting a Workload to start (pods to be created). A Workload is admitted by a ClusterQueue according to the available resources and gets -resource flavors assigned for each requested resource. Sometimes referred to -as _workload scheduling_ or _job scheduling_ (not to be confused with -[pod scheduling](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/)). +resource flavors assigned for each requested resource. + +Sometimes referred to as _workload scheduling_ or _job scheduling_ +(not to be confused with [pod scheduling](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/)). ### [Cohort](cluster_queue.md#cohort) diff --git a/docs/concepts/cluster_queue.md b/docs/concepts/cluster_queue.md index edd0b22c34..3c612f9cee 100644 --- a/docs/concepts/cluster_queue.md +++ b/docs/concepts/cluster_queue.md @@ -1,8 +1,9 @@ # Cluster Queue -A `ClusterQueue` is a cluster-scoped object that governs a pool of resources +A ClusterQueue is a cluster-scoped object that governs a pool of resources such as CPU, memory and hardware accelerators. A `ClusterQueue` defines: -- The resource _flavors_ that it manages, with usage limits and order of consumption. +- The [resource _flavors_](#resourceflavor-object) that it manages, with usage + limits and order of consumption. - Fair sharing rules across the tenants of the cluster. Only [cluster administrators](/docs/tasks#batch-administrator) should create `ClusterQueue` objects. @@ -35,6 +36,74 @@ This ClusterQueue admits [workloads](workload.md) if and only if: You can specify the quota as a [quantity](https://kubernetes.io/docs/reference/kubernetes-api/common-definitions/quantity/). +## Resources + +In a ClusterQueue, you can define quotas for multiple [compute resources](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-types) +(cpu, memory, GPUs, etc.). + +For each resource, you can define quotas for multiple _flavors_. A +flavor represents different variations of a resource. The variations can be +defined in a [ResourceFlavor object](#resourceflavor-object). + +In a process called [admission](.#admission), Kueue assigns +[Workload pod sets](workload.md#pod-sets) a flavor for each resource it requests. +Kueue assigns the first flavor in the ClusterQueue's `.spec.resources[*].flavors` +list that has enough unused `min` quota in the ClusterQueue or the +ClusterQueue's [cohort](#cohort). + +### Codepedent resources + +It is possible that multiple resources are tied to the same flavors. This is +typical for `cpu` and `memory`, where the flavors are generally tied to a +machine family or availability guarantees. + +If this is the case, the resources in the ClusterQueue must list the same +flavors in the same order. When two or more resources match their flavors, +they are said to be codependent. During admission, for each pod set in a +Workload, Kueue assigns the same flavor to the codependent resources that the +pod set requests. + +An example of a ClusterQueue with codependent resources looks like the following: + +```yaml +apiVersion: kueue.x-k8s.io/v1alpha1 +kind: ClusterQueue +metadata: + name: cluster-total +spec: + namespaceSelector: {} + resources: + - name: "cpu" + flavors: + - name: spot + quota: + min: 18 + - name: on_demand + quota: + min: 9 + - name: "memory" + flavors: + - name: spot + quota: + min: 72Gi + - name: on_demand + quota: + min: 36Gi + - name: "gpu" + flavors: + - name: vendor1 + quota: + min: 10 + - name: vendor2 + quota: + min: 10 +``` + +In the example above, `cpu` and `memory` are codependent resources, while `gpu` +is independent. + +If two resources are not codependent, they must not have any flavors in common. + ## Namespace selector You can limit which namespaces can have workloads admitted in the ClusterQueue @@ -81,7 +150,7 @@ Resources in a cluster are typically not homogeneous. Resources could differ in: - architecture (ex: x86 vs ARM CPUs) - brands and models (ex: Radeon 7000 vs Nvidia A100 vs T4 GPUs) -A `ResourceFlavor` is an object that represents these variations and allows you +A ResourceFlavor is an object that represents these variations and allows you to associate them with node labels and taints. **Note**: If your cluster is homogeneous, you can use an [empty ResourceFlavor](#empty-resourceflavor) @@ -102,13 +171,8 @@ taints: value: "true" ``` -You can use the `.metadata.name` to reference a flavor from a ClusterQueue in -the `.spec.resources[*].flavors[*].name` field. - -For each resource of each [pod set](workload.md#pod-sets) in a Workload, Kueue -assigns the first flavor in the `.spec.resources[*].flavors` -list that has enough unused quota in the ClusterQueue or the ClusterQueue's -[cohort](#cohort). +You can use the `.metadata.name` to reference a ResourceFlavor from a +ClusterQueue in the `.spec.resources[*].flavors[*].name` field. ### ResourceFlavor labels @@ -132,9 +196,9 @@ steps: didn't specify them already. For example, for a [batch/v1.Job](https://kubernetes.io/docs/concepts/workloads/controllers/job/), - Kueue adds the labels to `.spec.template.spec.nodeSelector`. This guarantees - that the workload Pods run on the nodes associated to the flavor that Kueue - decided that the workload should use. + Kueue adds the labels to the `.spec.template.spec.nodeSelector` field. This + guarantees that the workload Pods run on the nodes associated to the flavor + that Kueue decided that the workload should use. ### ResourceFlavor taints @@ -143,8 +207,9 @@ with taints. Taints on the ResourceFlavor work similarly to [node taints](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/). For Kueue to admit a workload to use the ResourceFlavor, the PodSpecs in the -workload should have a toleration for it. As opposed to ResourceFlavor labels, -Kueue will not add tolerations for the flavor taints. +workload should have a toleration for it. As opposed to the behavior for +[ResourceFlavor labels](#resourceflavor-labels), Kueue will not add tolerations +for the flavor taints. ### Empty ResourceFlavor @@ -173,18 +238,18 @@ ClusterQueue. ### Flavors and borrowing semantics -When borrowing, Kueue satisfies the following semantics: +When borrowing, Kueue satisfies the following admission semantics: -- When assigning flavors, Kueue goes through the list of flavors in - `.spec.resources[*].flavors`. For each flavor, Kueue attempts to - fit the workload using the min quota of the ClusterQueue or the unused - min quota of other ClusterQueues in the cohort, up to the max quota of the - ClusterQueue. If the workload doesn't fit, Kueue proceeds evaluating the next +- When assigning flavors, Kueue goes through the list of flavors in the + ClusterQueue's `.spec.resources[*].flavors`. For each flavor, Kueue attempts + to fit a Workload's pod set using the `min` quota of the ClusterQueue or the + unused `min` quota of other ClusterQueues in the cohort, up to the `max` quota + of the ClusterQueue. If the workload doesn't fit, Kueue proceeds evaluating the next flavor in the list. -- Borrowing happens per-flavor. A ClusterQueue can only borrow quota of flavors - it defines. +- A ClusterQueue can only borrow quota of flavors it defines and it can only + borrow quota for one flavor. -### Example +### Borrowing example Assume you created the following two ClusterQueues: diff --git a/docs/concepts/local_queue.md b/docs/concepts/local_queue.md new file mode 100644 index 0000000000..de984334b3 --- /dev/null +++ b/docs/concepts/local_queue.md @@ -0,0 +1,17 @@ +# Local Queue + +A `LocalQueue` is a namespaced object that groups closely related workloads +belonging to a single tenant. A `LocalQueue` points to one [`ClusterQueue`](cluster_queue.md) +from which resources are allocated to run its workloads. + +Users submit jobs to a `LocalQueue`, instead of directly to a `ClusterQueue`. +Tenants can discover which queues they can submit jobs to by listing the +local queues in their namespace. The command looks similar to the following: + +```sh +kubectl get -n my-namespace localqueues +# Alternatively, use the alias `queue` or `queues` +kubectl get -n my-namespace queues +``` + +`queue` and `queues` are aliases for `localqueue`. diff --git a/docs/concepts/queue.md b/docs/concepts/queue.md deleted file mode 100644 index d71ad26cfd..0000000000 --- a/docs/concepts/queue.md +++ /dev/null @@ -1,9 +0,0 @@ -# Queue - -A `Queue` is a namespaced object that groups closely related workloads -belonging to a single tenant. A `Queue` points to one [`ClusterQueue`](cluster_queue.md) -from which resources are allocated to run its workloads. - -Users submit jobs to a `Queue`, instead of directly to a `ClusterQueue`. This -allows tenants to discover which queues they can submit jobs to by listing the -queues in their namespace. diff --git a/docs/concepts/workload.md b/docs/concepts/workload.md index 08449d02dc..0b4168831e 100644 --- a/docs/concepts/workload.md +++ b/docs/concepts/workload.md @@ -23,6 +23,7 @@ metadata: name: sample-job namespace: default spec: + queueName: user-queue podSets: - count: 3 name: main @@ -36,9 +37,13 @@ spec: cpu: "1" memory: 200Mi restartPolicy: Never - queueName: user-queue ``` +## Queue name + +To indicate in which [LocalQueue](local_queue.md) you want your Workload to be +enqueued, set the name of the LocalQueue in the `.spec.queueName` field. + ## Pod sets A Workload might be composed of multiple Pods with different pod specs. @@ -63,4 +68,8 @@ of the Job's pod template. As described previously, Kueue has built-in support for workloads created with the Job API. But any custom workload API can integrate with Kueue by -creating a corresponding Workload object for it. \ No newline at end of file +creating a corresponding Workload object for it. + +## What's next + +- Learn how to [run jobs](/docs/tasks/run_jobs.md). \ No newline at end of file diff --git a/docs/setup/install.md b/docs/setup/install.md index 4577d9d195..75c48de8ac 100644 --- a/docs/setup/install.md +++ b/docs/setup/install.md @@ -39,6 +39,20 @@ to scrape metrics from kueue components, run the following command: kubectl apply -f https://github.com/kubernetes-sigs/kueue/releases/download/$VERSION/prometheus.yaml ``` +### Uninstall + +To uninstall a released version of Kueue from your cluster, run the following command: + +```shell +VERSION=v0.1.1 +kubectl delete -f https://github.com/kubernetes-sigs/kueue/releases/download/$VERSION/manifests.yaml +``` + +### Upgrading from 0.1 to 0.2 + +Upgrading from `0.1.x` to `0.2.y` is not supported due to breaking API changes. +To install Kueue `0.2.y`, [uninstall](#uninstall) the older version first. + ## Install a custom-configured released version To install a custom-configured released version of Kueue in your cluster, execute the following steps: diff --git a/docs/tasks/administer_cluster_quotas.md b/docs/tasks/administer_cluster_quotas.md index afdfef433c..43a8225b20 100644 --- a/docs/tasks/administer_cluster_quotas.md +++ b/docs/tasks/administer_cluster_quotas.md @@ -93,7 +93,7 @@ kubectl apply -f default-flavor.yaml The `.metadata.name` matches the `.spec.resources[*].flavors[0].resourceFlavor` field in the ClusterQueue. -### 3. Create [Queues](/docs/concepts/queue.md) +### 3. Create [LocalQueues](/docs/concepts/local_queue.md) Users cannot directly send [workloads](/docs/concepts/workload.md) to ClusterQueues. Instead, users need to send their workloads to a Queue in their @@ -101,12 +101,12 @@ namespace. Thus, for the queuing system to be complete, you need to create a Queue in each namespace that needs access to the ClusterQueue. -Write the manifest for the Queue. It should look similar to the following: +Write the manifest for the LocalQueue. It should look similar to the following: ```yaml # default-user-queue.yaml apiVersion: kueue.x-k8s.io/v1alpha1 -kind: Queue +kind: LocalQueue metadata: namespace: default name: user-queue @@ -114,7 +114,7 @@ spec: clusterQueue: cluster-total ``` -To create the Queue, run the following command: +To create the LocalQueue, run the following command: ```shell kubectl apply -f default-user-queue.yaml diff --git a/docs/tasks/run_jobs.md b/docs/tasks/run_jobs.md index e846e3c822..b8153b833d 100644 --- a/docs/tasks/run_jobs.md +++ b/docs/tasks/run_jobs.md @@ -18,6 +18,8 @@ Make sure the following conditions are met: Run the following command to list the Queues available in your namespace. ```shell +kubectl -n default get localqueues +# Or use the 'queues' alias. kubectl -n default get queues ```