Skip to content

Commit

Permalink
chore: Add README.md with basic information about Prover Autoscaler (#…
Browse files Browse the repository at this point in the history
…3241)

## What ❔

<!-- What are the changes this PR brings about? -->
<!-- Example: This PR adds a PR template to the repo. -->
<!-- (For bigger PRs adding more context is appreciated) -->

## Why ❔

<!-- Why are these changes done? What goal do they contribute to? What
are the principles behind them? -->
<!-- Example: PR templates ensure PR reviewers, observers, and future
iterators are in context about the evolution of repos. -->

## Checklist

<!-- Check your PR fulfills the following items. -->
<!-- For draft PRs check the boxes as you complete them. -->

- [x] PR title corresponds to the body of PR (we generate changelog
entries from PRs).
- [ ] Tests for the changes have been added / updated.
- [x] Documentation comments have been added / updated.
- [x] Code has been formatted via `zkstack dev fmt` and `zkstack dev
lint`.

ref ZKD-1855

---------

Co-authored-by: EmilLuta <[email protected]>
  • Loading branch information
yorik and EmilLuta authored Nov 11, 2024
1 parent 8b31a85 commit 5cea9ec
Showing 1 changed file with 234 additions and 0 deletions.
234 changes: 234 additions & 0 deletions prover/crates/bin/prover_autoscaler/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,234 @@
# Prover Autoscaler

Prover Autoscaler is needed to automatically scale Prover related Kubernetes Deployments according to the load in a
cluster with higher chances to get Nodes to run. If the cluster runs out of resources it moves the load to next one.

## Design

Prover Autoscaler has the main Scaler part and Agents running in each cluster.

### Agent

Agents watch via Kubernetes API status of Deployments, Pods and out of resources Events; perform scaling by requests
from Scaler. They watch only specified in config namespaces. Agent provides listens on 2 ports: `prometheus_port` to
export metrics (path is `/metrics`), and `http_port` with 3 paths: `/healthz`, `/cluster` to get the cluster status and
`/scale` to scale Deployments up or down.

### Scaler

Scaler collects cluster statuses from Agents, job queues from prover-job-monitor, calculates needed number of replicas
and sends scale requests to Agents.

Requests flow diagram:

```mermaid
sequenceDiagram
participant prover-job-monitor
participant Scaler
box cluster1
participant Agent1
participant K8s API1
end
box cluster2
participant Agent2
participant K8s API2
end
loop Watch
Agent1->>K8s API1: Watch namespaces
end
loop Watch
Agent2->>K8s API2: Watch namespaces
end
loop Recalculate
Scaler->>prover-job-monitor: /report
Scaler->>Agent1: /cluster
Scaler->>Agent2: /cluster
Scaler->>Agent1: /scale
end
```

Scaler supports 2 types of scaling algorithms: GPU and Simple. GPU usually is prover itself and all other Deployments
are using Simple algorithm.

Simple algorithm tries to scale the Deployment up to `queue / speed` replicas (rounded up) in the best cluster. If there
is not enough capacity it continue in the next best cluster and so on. On each run it selects "best cluster" using
priority, number of capacity issues and cluster size. The capacity is limited by config (`max_provers` or
`max_replicas`) and also by availability of machines in the cluster. Autoscaler detects that a cluster is running out of
particular machines by watching for `FailedScaleUp` events and also by checking if a Pod stuck in Pending for longer
than `long_pending_duration`. If not enough capacity is detected not running Pods will be moved.

GPU algorithm works similar to Simple one, but it also recognise different GPU types and distribute load across L4 GPUs
first, then T4, V100, P100 and A100, if available.

Different namespaces are running different protocol versions and completely independent. Normally only one namespace is
active, and only during protocol upgrade both are active. Each namespace has to have correct version of binaries
installed, see `protocol_versions` config option.

## Dependencies

- [prover-job-monitor](.../prover_job_monitor/)
- Kubernetes API
- GCP API (optional)

## Permissions

Agents need the following Kubernetes permissions:

```yaml
- apiGroups:
- ''
resources:
- pods
- events
- namespaces
- nodes
verbs:
- get
- watch
- list
- apiGroups:
- apps
resources:
- deployments
- replicasets
verbs:
- get
- list
- watch
- patch
- update
```
## Configuration
Prover Autoscaler requires a config file provided via `--config-path` flag, supported format: YAML. Also you need to
specify which job to run Scaler or Agent using `--job=scaler` or `--job=agent` flag correspondingly.

### Common configuration

- `graceful_shutdown_timeout` is time to wait for all the task to finish before force shutdown. Default: 5s.
- `observability` section configures type of `log_format` (`plain` or `json`) and log levels per module with
`log_directives`.

Example:

```yaml
graceful_shutdown_timeout: 5s
observability:
log_format: plain
log_directives: 'zksync_prover_autoscaler=debug'
```

### Agent configuration

`agent_config` section configures Agent parameters:

- `prometheus_port` is a port for Prometheus metrics to be served on (path is `/metrics`).
- `http_port` is the main port for Scaler to connect to.
- `namespaces` is list of namespaces to watch.
- `dry_run` if enabled, Agent will not change number of replicas, just report success. Default: true.

Example:

```yaml
agent_config:
prometheus_port: 8080
http_port: 8081
namespaces:
- prover-old
- prover-new
dry_run: true
```

### Scaler configuration

`scaler_config` section configures Scaler parameters:

- `dry_run` if enabled, Scaler will not send any scaler requests. Default: false.
- `prometheus_port` is a port for Prometheus metrics to be served on (path is `/metrics`).
- `prover_job_monitor_url` is full URL to get queue report from prover-job-monitor.
- `agents` is Agent list to send requests to.
- `scaler_run_interval` is interval between re-calculations. Default: 10s.
- `protocol_versions` is a map namespaces to protocol version it processes. Should correspond binary versions running
there!
- `cluster_priorities` is a map cluster name to priority, the lower will be used first.
- `min_provers` is a map namespace to minimum number of provers to run even if the queue is empty.
- `max_provers` is a map of cluster name to map GPU type to maximum number of provers.
- `prover_speed` is a map GPU to speed divider. Default: 500.
- `long_pending_duration` is time after a pending pod considered long pending and will be relocated to different
cluster. Default: 10m.
- `scaler_targets` subsection is a list of Simple targets:
- `queue_report_field` is name of corresponding queue report section. See example for possible options.
- `deployment` is name of a Deployment to scale.
- `max_replicas` is a map of cluster name to maximum number of replicas.
- `speed` is a divider for corresponding queue.

Example:

```yaml
scaler_config:
dry_run: true
prometheus_port: 8082
prover_job_monitor_url: http://prover-job-monitor.default.svc.cluster.local:3074/queue_report
agents:
- http://prover-autoscaler-agent.cluster1.com
- http://prover-autoscaler-agent.cluster2.com
- http://prover-autoscaler-agent.cluster3.com
scaler_run_interval: 30s
protocol_versions:
prover-old: 0.24.2
prover-new: 0.25.0
cluster_priorities:
cluster1: 0
cluster2: 100
cluster3: 200
min_provers:
prover-new: 0
max_provers:
cluster1:
L4: 1
T4: 200
cluster2:
L4: 100
T4: 200
cluster3:
L4: 100
T4: 100
prover_speed:
L4: 500
T4: 400
long_pending_duration: 10m
scaler_targets:
- queue_report_field: basic_witness_jobs
deployment: witness-generator-basic-fri
max_replicas:
cluster1: 10
cluster2: 20
speed: 10
- queue_report_field: leaf_witness_jobs
deployment: witness-generator-leaf-fri
max_replicas:
cluster1: 10
speed: 10
- queue_report_field: node_witness_jobs
deployment: witness-generator-node-fri
max_replicas:
cluster1: 10
speed: 10
- queue_report_field: recursion_tip_witness_jobs
deployment: witness-generator-recursion-tip-fri
max_replicas:
cluster1: 10
speed: 10
- queue_report_field: scheduler_witness_jobs
deployment: witness-generator-scheduler-fri
max_replicas:
cluster1: 10
speed: 10
- queue_report_field: proof_compressor_jobs
deployment: proof-fri-gpu-compressor
max_replicas:
cluster1: 10
cluster2: 10
speed: 5
```

0 comments on commit 5cea9ec

Please sign in to comment.