canonical · IronCore864 · Jan 24, 2025 · Dec 22, 2024 · Dec 22, 2024 · Dec 31, 2024
diff --git a/docs/how-to/index.md b/docs/how-to/index.md
@@ -14,15 +14,15 @@ Installation follows a similar pattern on all architectures. You can choose to i
 Install Pebble <install-pebble>
 ```
 
-
 ## Service orchestration
 
-As your needs grow, you may want to orchestrate multiple services.
+As your needs grow, you may want to use advanced Pebble features to run services reliably and orchestrate multiple services.
 
 ```{toctree}
 :titlesonly:
 :maxdepth: 1
 
+Run services reliably <run-services-reliably>
 Manage service dependencies <service-dependencies>
 ```
 

diff --git a/docs/how-to/run-services-reliably.md b/docs/how-to/run-services-reliably.md
@@ -0,0 +1,74 @@
+# How to run services reliably
+
+Microservice architectures offer flexibility, but they can introduce reliability challenges such as network interruptions, resource exhaustion, problems with dependent services, cascading failures, and deployment issues. Health checks can address these issues by monitoring resource usage, checking the availability of dependencies, catching problems of new deployments, and preventing downtime by redirecting traffic away from failing services.
+
+To help you manage services more reliably, Pebble provides a comprehensive health check feature.
+
+## Use health checks of the HTTP type
+
+A health check of the HTTP type issues HTTP `GET` requests to the health check URL at a user-specified interval.
+
+The health check is considered successful if the check returns an HTTP 200 response. After getting a certain number of failures in a row, the health check is considered "down" (or unhealthy).
+
+### Configure HTTP-type health checks
+
+For example, we can configure a health check of HTTP type named `svc1-up` that checks the endpoint `http://127.0.0.1:5000/health`:
+
+```yaml
+checks:
+  svc1-up:
+    override: replace
+    period: 10s
+    timeout: 3s
+    threshold: 3
+    http:
+      url: http://127.0.0.1:5000/health
+```
+
+The configuration above contains three key options that we can tweak for each health check:
+
+- `period`: How often to run the check (defaults to 10 seconds).
+- `timeout`: If the check hasn't responded before the timeout (defaults to 3 seconds), consider the check an error.
+- `threshold`: After how many consecutive errors (defaults to 3) is the check considered "down".
+
+Given the default values, a minimum check looks like the following:
+
+```yaml
+checks:
+  svc1-up:
+    override: replace
+    http:
+      url: http://127.0.0.1:5000/health
+```
+
+Besides the HTTP type, there are two more health check types in Pebble: `tcp`, which opens the given TCP port, and `exec`, which executes a user-specified command. For more information, see [Health checks](../reference/health-checks) and [Layer specification](../reference/layer-specification).
+
+### Restart the service when the health check fails
+
+To automatically restart services when a health check fails, use `on-check-failure` in the service configuration.
+
+To restart `svc1` when the health check named `svc1-up` fails, use the following configuration:
+
+```yaml
+services:
+  svc1:
+    override: replace
+    command: python3 /home/ubuntu/work/health-check-sample-service/main.py
+    startup: enabled
+    on-check-failure:
+      svc1-up: restart
+```
+
+## Limitations of health checks
+
+Although health checks are useful, they are not a complete solution for reliability:
+
+- Health checks can detect issues such as a failed database connection due to network issues, but they can't fix the network issue itself.
+- Health checks also can't replace testing and monitoring.
+- Health checks shouldn't be used for scheduling tasks like backups.
+
+## See more
+
+- [Health checks](../reference/health-checks)
+- [Layer specification](../reference/layer-specification)
+- [Service lifecycle](../reference/service-lifecycle)
diff --git a/docs/reference/cli-commands.md b/docs/reference/cli-commands.md
@@ -950,7 +950,7 @@ The "Current" column shows the current status of the service, and can be one of
 
 * `active`: starting or running
 * `inactive`: not yet started, being stopped, or stopped
-* `backoff`: in a [backoff-restart loop](service-auto-restart.md)
+* `backoff`: in a [backoff-restart loop](service-lifecycle.md)
 * `error`: in an error state
 
 
@@ -996,7 +996,7 @@ any other services it depends on, in the correct order.
 ### How it works
 
 - If the command is still running at the end of the 1 second window, the start is considered successful.
-- If the command exits within the 1 second window, Pebble retries the command after a configurable backoff, using the restart logic described in [](service-auto-restart.md). If one of the started services exits within the 1 second window, `pebble start` prints an appropriate error message and exits with an error.
+- If the command exits within the 1 second window, Pebble retries the command after a configurable backoff, using the restart logic described in [Service lifecycle](service-lifecycle.md). If one of the started services exits within the 1 second window, `pebble start` prints an appropriate error message and exits with an error.
 
 ### Examples
 

diff --git a/docs/reference/index.md b/docs/reference/index.md
@@ -20,7 +20,7 @@ Layers <layers>
 Layer specification <layer-specification>
 Log forwarding <log-forwarding>
 Notices <notices>
-Service auto-restart <service-auto-restart>
+Service lifecycle <service-lifecycle>
 ```
 
 
@@ -46,7 +46,7 @@ The `pebble` command has several subcommands.
 
 Pebble provides two ways to automatically restart services when they fail. Auto-restart is based on exit codes from services. Health checks are a more sophisticated way to test and report the availability of services.
 
-* [Service auto-restart](service-auto-restart)
+* [Service lifecycle](service-lifecycle)
 * [Health checks](health-checks)
 
 

diff --git a/docs/reference/service-auto-restart.md b/docs/reference/service-auto-restart.md
diff --git a/docs/reference/service-lifecycle.md b/docs/reference/service-lifecycle.md
@@ -0,0 +1,58 @@
+# Service lifecycle
+
+Pebble manages the lifecycle of a service, including starting, stopping, and restarting it. Pebble also handles health checks, failures, and auto-restart with backoff. This is all achieved using a state machine with the following states:
+
+- initial: The service's initial state.
+- starting: The service is in the process of starting.
+- running: The `okayDelay` (see below) period has passed, and the service runs normally.
+- terminating: The service is being gracefully terminated.
+- killing: The service is being forcibly killed.
+- stopped: The service has stopped.
+- backoff: The service will be put in the backoff state before the next start attempt if the service is configured to restart when it exits.
+- exited: The service has exited (and won't be automatically restarted).
+
+## Service start
+
+A service begins in an "initial" state. Pebble tries to start the service's underlying process and transitions the service to the "starting" state.
+
+## Start confirmation
+
+Pebble waits for a short period (`okayDelay`, defaults to one second) after starting the service. If the service runs without exiting after the `okayDelay` period, it's considered successfully started, and the service's state is transitioned into "running".
+
+No matter if the service is in the "starting" or "running" state, if you get the service, the status will be shown as "active". Read more in the [`pebble services`](#reference_pebble_services_command) command.
+
+## Start failure
+
+If the service exits quickly, an error along with the last logs are added to the task (see more in [Changes and tasks](/reference/changes-and-tasks.md)). This also ensures logs are accessible.
+
+## Abort start
+
+If the user interrupts the start process (e.g., with a SIGKILL), the service transitions to stopped, and a SIGKILL signal is sent to the underlying process.
+
+## Auto-restart
+
+By default, Pebble's service manager automatically restarts services that exit unexpectedly, regardless of whether the service is in the "starting" state (the `okayDelay` period has not passed) or in the "running" state (`okayDelay` has passed, and the service is considered to be "running").
+
+Pebble considers a service to have exited unexpectedly if the exit code is non-zero.
+
+You can fine-tune the auto-restart behaviour using the `on-success` and `on-failure` fields in a configuration layer. The possible values for these fields are:
+
+* `restart`: restart the service and enter a restart-backoff loop (the default behaviour).
+* `shutdown`: shut down and exit the Pebble daemon (with exit code 0 if the service exits successfully, exit code 10 otherwise)
+  - `success-shutdown`: shut down with exit code 0 (valid only for `on-failure`)
+  - `failure-shutdown`: shut down with exit code 10 (valid only for `on-success`)
+* `ignore`: ignore the service exiting and do nothing further
+
+## Backoff
+
+Pebble implements a backoff mechanism that increases the delay before restarting the service after each failed attempt. This prevents a failing service from consuming excessive resources.
+
+The `backoff-delay` defaults to half a second, the `backoff-factor` defaults to 2.0 (doubling), and the increasing delay is capped at `backoff-limit`, which defaults to 30 seconds. All of the three configurations can be customized, read more in [Layer specification](../reference/layer-specification).
+
+With default settings for the above configuration, in `restart` mode, the first time a service exits, Pebble waits for half a second. If the service exits again, Pebble calculates the next backoff delay by multiplying the current delay by `backoff-factor`, which results in a 1-second delay. The next delay will be 2 seconds, then 4 seconds, and so on, capped at 30 seconds.
+
+The `backoff-limit` value is also used as a "backoff reset" time. If the service stays running after a restart for `backoff-limit` seconds, the backoff process is reset and the delay reverts to `backoff-delay`.
+
+## Auto-restart on health check failures
+
+Pebble can be configured to automatically restart services based on health checks. To do so, use `on-check-failure` in the service configuration. Read more in [Health checks](health-checks).