-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
chore: Add README.md with basic information about Prover Autoscaler (#…
…3241) ## What ❔ <!-- What are the changes this PR brings about? --> <!-- Example: This PR adds a PR template to the repo. --> <!-- (For bigger PRs adding more context is appreciated) --> ## Why ❔ <!-- Why are these changes done? What goal do they contribute to? What are the principles behind them? --> <!-- Example: PR templates ensure PR reviewers, observers, and future iterators are in context about the evolution of repos. --> ## Checklist <!-- Check your PR fulfills the following items. --> <!-- For draft PRs check the boxes as you complete them. --> - [x] PR title corresponds to the body of PR (we generate changelog entries from PRs). - [ ] Tests for the changes have been added / updated. - [x] Documentation comments have been added / updated. - [x] Code has been formatted via `zkstack dev fmt` and `zkstack dev lint`. ref ZKD-1855 --------- Co-authored-by: EmilLuta <[email protected]>
- Loading branch information
Showing
1 changed file
with
234 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,234 @@ | ||
# Prover Autoscaler | ||
|
||
Prover Autoscaler is needed to automatically scale Prover related Kubernetes Deployments according to the load in a | ||
cluster with higher chances to get Nodes to run. If the cluster runs out of resources it moves the load to next one. | ||
|
||
## Design | ||
|
||
Prover Autoscaler has the main Scaler part and Agents running in each cluster. | ||
|
||
### Agent | ||
|
||
Agents watch via Kubernetes API status of Deployments, Pods and out of resources Events; perform scaling by requests | ||
from Scaler. They watch only specified in config namespaces. Agent provides listens on 2 ports: `prometheus_port` to | ||
export metrics (path is `/metrics`), and `http_port` with 3 paths: `/healthz`, `/cluster` to get the cluster status and | ||
`/scale` to scale Deployments up or down. | ||
|
||
### Scaler | ||
|
||
Scaler collects cluster statuses from Agents, job queues from prover-job-monitor, calculates needed number of replicas | ||
and sends scale requests to Agents. | ||
|
||
Requests flow diagram: | ||
|
||
```mermaid | ||
sequenceDiagram | ||
participant prover-job-monitor | ||
participant Scaler | ||
box cluster1 | ||
participant Agent1 | ||
participant K8s API1 | ||
end | ||
box cluster2 | ||
participant Agent2 | ||
participant K8s API2 | ||
end | ||
loop Watch | ||
Agent1->>K8s API1: Watch namespaces | ||
end | ||
loop Watch | ||
Agent2->>K8s API2: Watch namespaces | ||
end | ||
loop Recalculate | ||
Scaler->>prover-job-monitor: /report | ||
Scaler->>Agent1: /cluster | ||
Scaler->>Agent2: /cluster | ||
Scaler->>Agent1: /scale | ||
end | ||
``` | ||
|
||
Scaler supports 2 types of scaling algorithms: GPU and Simple. GPU usually is prover itself and all other Deployments | ||
are using Simple algorithm. | ||
|
||
Simple algorithm tries to scale the Deployment up to `queue / speed` replicas (rounded up) in the best cluster. If there | ||
is not enough capacity it continue in the next best cluster and so on. On each run it selects "best cluster" using | ||
priority, number of capacity issues and cluster size. The capacity is limited by config (`max_provers` or | ||
`max_replicas`) and also by availability of machines in the cluster. Autoscaler detects that a cluster is running out of | ||
particular machines by watching for `FailedScaleUp` events and also by checking if a Pod stuck in Pending for longer | ||
than `long_pending_duration`. If not enough capacity is detected not running Pods will be moved. | ||
|
||
GPU algorithm works similar to Simple one, but it also recognise different GPU types and distribute load across L4 GPUs | ||
first, then T4, V100, P100 and A100, if available. | ||
|
||
Different namespaces are running different protocol versions and completely independent. Normally only one namespace is | ||
active, and only during protocol upgrade both are active. Each namespace has to have correct version of binaries | ||
installed, see `protocol_versions` config option. | ||
|
||
## Dependencies | ||
|
||
- [prover-job-monitor](.../prover_job_monitor/) | ||
- Kubernetes API | ||
- GCP API (optional) | ||
|
||
## Permissions | ||
|
||
Agents need the following Kubernetes permissions: | ||
|
||
```yaml | ||
- apiGroups: | ||
- '' | ||
resources: | ||
- pods | ||
- events | ||
- namespaces | ||
- nodes | ||
verbs: | ||
- get | ||
- watch | ||
- list | ||
- apiGroups: | ||
- apps | ||
resources: | ||
- deployments | ||
- replicasets | ||
verbs: | ||
- get | ||
- list | ||
- watch | ||
- patch | ||
- update | ||
``` | ||
## Configuration | ||
Prover Autoscaler requires a config file provided via `--config-path` flag, supported format: YAML. Also you need to | ||
specify which job to run Scaler or Agent using `--job=scaler` or `--job=agent` flag correspondingly. | ||
|
||
### Common configuration | ||
|
||
- `graceful_shutdown_timeout` is time to wait for all the task to finish before force shutdown. Default: 5s. | ||
- `observability` section configures type of `log_format` (`plain` or `json`) and log levels per module with | ||
`log_directives`. | ||
|
||
Example: | ||
|
||
```yaml | ||
graceful_shutdown_timeout: 5s | ||
observability: | ||
log_format: plain | ||
log_directives: 'zksync_prover_autoscaler=debug' | ||
``` | ||
|
||
### Agent configuration | ||
|
||
`agent_config` section configures Agent parameters: | ||
|
||
- `prometheus_port` is a port for Prometheus metrics to be served on (path is `/metrics`). | ||
- `http_port` is the main port for Scaler to connect to. | ||
- `namespaces` is list of namespaces to watch. | ||
- `dry_run` if enabled, Agent will not change number of replicas, just report success. Default: true. | ||
|
||
Example: | ||
|
||
```yaml | ||
agent_config: | ||
prometheus_port: 8080 | ||
http_port: 8081 | ||
namespaces: | ||
- prover-old | ||
- prover-new | ||
dry_run: true | ||
``` | ||
|
||
### Scaler configuration | ||
|
||
`scaler_config` section configures Scaler parameters: | ||
|
||
- `dry_run` if enabled, Scaler will not send any scaler requests. Default: false. | ||
- `prometheus_port` is a port for Prometheus metrics to be served on (path is `/metrics`). | ||
- `prover_job_monitor_url` is full URL to get queue report from prover-job-monitor. | ||
- `agents` is Agent list to send requests to. | ||
- `scaler_run_interval` is interval between re-calculations. Default: 10s. | ||
- `protocol_versions` is a map namespaces to protocol version it processes. Should correspond binary versions running | ||
there! | ||
- `cluster_priorities` is a map cluster name to priority, the lower will be used first. | ||
- `min_provers` is a map namespace to minimum number of provers to run even if the queue is empty. | ||
- `max_provers` is a map of cluster name to map GPU type to maximum number of provers. | ||
- `prover_speed` is a map GPU to speed divider. Default: 500. | ||
- `long_pending_duration` is time after a pending pod considered long pending and will be relocated to different | ||
cluster. Default: 10m. | ||
- `scaler_targets` subsection is a list of Simple targets: | ||
- `queue_report_field` is name of corresponding queue report section. See example for possible options. | ||
- `deployment` is name of a Deployment to scale. | ||
- `max_replicas` is a map of cluster name to maximum number of replicas. | ||
- `speed` is a divider for corresponding queue. | ||
|
||
Example: | ||
|
||
```yaml | ||
scaler_config: | ||
dry_run: true | ||
prometheus_port: 8082 | ||
prover_job_monitor_url: http://prover-job-monitor.default.svc.cluster.local:3074/queue_report | ||
agents: | ||
- http://prover-autoscaler-agent.cluster1.com | ||
- http://prover-autoscaler-agent.cluster2.com | ||
- http://prover-autoscaler-agent.cluster3.com | ||
scaler_run_interval: 30s | ||
protocol_versions: | ||
prover-old: 0.24.2 | ||
prover-new: 0.25.0 | ||
cluster_priorities: | ||
cluster1: 0 | ||
cluster2: 100 | ||
cluster3: 200 | ||
min_provers: | ||
prover-new: 0 | ||
max_provers: | ||
cluster1: | ||
L4: 1 | ||
T4: 200 | ||
cluster2: | ||
L4: 100 | ||
T4: 200 | ||
cluster3: | ||
L4: 100 | ||
T4: 100 | ||
prover_speed: | ||
L4: 500 | ||
T4: 400 | ||
long_pending_duration: 10m | ||
scaler_targets: | ||
- queue_report_field: basic_witness_jobs | ||
deployment: witness-generator-basic-fri | ||
max_replicas: | ||
cluster1: 10 | ||
cluster2: 20 | ||
speed: 10 | ||
- queue_report_field: leaf_witness_jobs | ||
deployment: witness-generator-leaf-fri | ||
max_replicas: | ||
cluster1: 10 | ||
speed: 10 | ||
- queue_report_field: node_witness_jobs | ||
deployment: witness-generator-node-fri | ||
max_replicas: | ||
cluster1: 10 | ||
speed: 10 | ||
- queue_report_field: recursion_tip_witness_jobs | ||
deployment: witness-generator-recursion-tip-fri | ||
max_replicas: | ||
cluster1: 10 | ||
speed: 10 | ||
- queue_report_field: scheduler_witness_jobs | ||
deployment: witness-generator-scheduler-fri | ||
max_replicas: | ||
cluster1: 10 | ||
speed: 10 | ||
- queue_report_field: proof_compressor_jobs | ||
deployment: proof-fri-gpu-compressor | ||
max_replicas: | ||
cluster1: 10 | ||
cluster2: 10 | ||
speed: 5 | ||
``` |