canonical · IronCore864 · Jan 24, 2025 · Dec 22, 2024 · Dec 22, 2024 · Dec 31, 2024
diff --git a/docs/how-to/index.md b/docs/how-to/index.md
@@ -14,19 +14,18 @@ Installation follows a similar pattern on all architectures. You can choose to i
 Install Pebble <install-pebble>
 ```
 
-
 ## Service orchestration
 
-As your needs grow, you may want to orchestrate multiple services.
+As your needs grow, you may want to use advanced Pebble features to run services reliably and orchestrate multiple services.
 
 ```{toctree}
 :titlesonly:
 :maxdepth: 1
 
+Run services reliably <run-services-reliably>
 Manage service dependencies <service-dependencies>
 ```
 
-
 ## Identities
 
 Use named "identities" to allow additional users to access the API.

diff --git a/docs/how-to/run-services-reliably.md b/docs/how-to/run-services-reliably.md
@@ -0,0 +1,96 @@
+# How to run services reliably
+
+In this guide, we will look at service reliability challenges in the modern world and how we can mitigate them with Pebble's advanced feature - [Health checks](../reference/health-checks).
+
+## Service reliability in the modern microservice world
+
+With the rise of the microservice architecture, reliability is becoming more and more important than ever. First, let's explore some of the causes of unreliability in microservice architectures:
+
+- Network Issues: Microservices rely heavily on network communications. Intermittent network failures, latency spikes, and connection drops can disrupt service interactions and lead to failures.
+- Resource Exhaustion: A single microservice consuming excessive resources (CPU, memory, disk I/O, and so on) can impact not only its performance and availability but also potentially affect other services depending on it.
+- Dependency Failures: Microservices often depend on other components, like a database or other microservices. If a critical dependency becomes unavailable, the dependent service might also fail.
+- Cascading Failures: A failure in one service can trigger failures in other dependent services, creating a cascading effect that can quickly bring down a large part of the system.
+- Deployment Issues: Frequent deployments can benefit microservices if managed properly. However, it can also introduce instability if not. Errors during deployment, incorrect configurations, or incompatible versions can all cause reliability issues.
+- Testing and Monitoring Gaps: Insufficient testing and monitoring can make it difficult to identify issues proactively, leading to unexpected failures and longer MTTR (mean time to repair).
+
+## Health checks
+
+To mitigate the reliability issues mentioned above, we need specific tooling, and health checks are one of them - a key mechanism and a critical part of the software development lifecycle (SDLC) in the DevOps culture for monitoring and detecting potential problems in the modern microservice architectures and especially in containerized environments.
+
+By periodically running health checks, some of the reliability issues listed above can be mitigated:
+
+### Detect resource exhaustion
+
+Health checks can monitor resource usage (CPU, memory, disk space) within a microservice. For example, if resource consumption exceeds predefined thresholds, the health check can signal an unhealthy state, allowing for remediation, for example, scaling up or scaling out the service, restarting it, or issuing alerts.
+
+### Identify dependent service failures
+
+Health checks can verify the availability of critical dependencies. A service's health check can include checks to ensure it can connect to its database, message queues, or other required services.
+
+### Catch deployment issues
+
+Health checks can be incorporated into the deployment process. After a new version of a service is deployed, the deployment pipeline can monitor its health status. If the health check fails, the deployment can be rolled back to the previous state, preventing a faulty version from affecting end users.
+
+### Mitigate cascading failures
+
+By quickly identifying unhealthy services, health checks can help prevent cascading failures. For example, load balancers and service discovery mechanisms can use health check information to route traffic away from failing services, giving them time to recover.
+
+### More on health checks
+
+Note that a health check is no silver bullet, it can't solve all the reliability challenges posed by the microservice architecture. For example, while health checks can detect the consequence of network issues (e.g., inability to connect to a dependency), they can't fix the underlying network problem itself; and while health checks are a valuable part of a monitoring strategy, they can't replace comprehensive testing and monitoring.
+
+Please also note that although health checks are running on a schedule, they should not be used to run scheduled jobs such as periodic backups.
+
+In summary, health checks are a powerful tool for improving the reliability of microservices by enabling early detection of problems and making automated recovery possible.
+
+## Using health checks of the HTTP type
+
+A health check of the HTTP type issues HTTP `GET` requests to the health check URL at a user-specified interval.
+
+The health check is considered successful if the check returns an HTTP 200 response. After getting a certain number of failures in a row, the health check is considered "down" (or unhealthy).
+
+### Configuring HTTP-type health checks
+
+Let's say we have a service `svc1` with a health check endpoint at `http://127.0.0.1:5000/health`. To configure a health check of HTTP type named `svc1-up` that accesses the health check endpoint at a 30-second interval with a timeout of 1 second and considers the check down if we get 3 failures in a row, we can use the following configuration:
+
+```yaml
+checks:
+    svc1-up:
+        override: replace
+        period: 30s
+        timeout: 1s
+        threshold: 3
+        http:
+            url: http://127.0.0.1:5000/health
+```
+
+The configuration above contains three key options that you can tweak for each health check:
+
+- `period`: How often to run the check (defaults to 10 seconds).
+- `timeout`: If the check hasn't responded before the timeout (defaults to 3 seconds), consider the check an error
+- `threshold`: After how many consecutive errors (defaults to 3) is the check considered "down"
+
+Besides the HTTP type, there are two more health check types in Pebble: `tcp`, which opens the given TCP port, and `exec`, which executes a user-specified command. For more information, see [Health checks](../reference/health-checks) and [Layer specification](../reference/layer-specification).
+
+### Restarting the service when the health check fails
+
+To automatically restart services when a health check fails, use `on-check-failure` in the service configuration.
+
+To restart `svc1` when the health check named `svc1-up` fails, use the following configuration:
+
+```
+services:
+    svc1:
+        override: replace
+        command: python3 /home/ubuntu/work/health-check-sample-service/main.py
+        startup: enabled
+        on-check-failure:
+            svc1-up: restart
+```
+
+## See more
+
+- [Health checks](../reference/health-checks)
+- [Layer specification](../reference/layer-specification)
+- [Service lifecycle](../reference/service-lifecycle)
+- [How to manage service dependencies](service-dependencies)
diff --git a/docs/reference/cli-commands.md b/docs/reference/cli-commands.md
@@ -946,7 +946,7 @@ The "Current" column shows the current status of the service, and can be one of
 
 * `active`: starting or running
 * `inactive`: not yet started, being stopped, or stopped
-* `backoff`: in a [backoff-restart loop](service-auto-restart.md)
+* `backoff`: in a [backoff-restart loop](service-lifecycle.md)
 * `error`: in an error state
 
 
@@ -992,7 +992,7 @@ any other services it depends on, in the correct order.
 ### How it works
 
 - If the command is still running at the end of the 1 second window, the start is considered successful.
-- If the command exits within the 1 second window, Pebble retries the command after a configurable backoff, using the restart logic described in [](service-auto-restart.md). If one of the started services exits within the 1 second window, `pebble start` prints an appropriate error message and exits with an error.
+- If the command exits within the 1 second window, Pebble retries the command after a configurable backoff, using the restart logic described in [Service lifecycle](service-lifecycle.md). If one of the started services exits within the 1 second window, `pebble start` prints an appropriate error message and exits with an error.
 
 ### Examples
 

diff --git a/docs/reference/index.md b/docs/reference/index.md
@@ -20,7 +20,7 @@ Layer specification <layer-specification>
 Log forwarding <log-forwarding>
 Notices <notices>
 Pebble in containers <pebble-in-containers>
-Service auto-restart <service-auto-restart>
+Service lifecycle <service-lifecycle>
 ```
 
 
@@ -53,7 +53,7 @@ When the Pebble daemon is running inside a remote system (for example, a separat
 
 Pebble provides two ways to automatically restart services when they fail. Auto-restart is based on exit codes from services. Health checks are a more sophisticated way to test and report the availability of services.
 
-* [Service auto-restart](service-auto-restart)
+* [Service lifecycle](service-lifecycle)
 * [Health checks](health-checks)
 
 

diff --git a/docs/reference/service-auto-restart.md b/docs/reference/service-auto-restart.md
diff --git a/docs/reference/service-lifecycle.md b/docs/reference/service-lifecycle.md
@@ -0,0 +1,56 @@
+# Service lifecycle
+
+Pebble manages the lifecycle of a service, including starting, stopping, and restarting it, with a focus on handling health checks and failures, and implementing auto-restart with backoff strategies, which are achieved using a state machine with the following states:
+
+- initial: The service's initial state.
+- starting: The service is in the process of starting.
+- running: The `okayDelay` (see below) period has passed, and the service runs normally.
+- terminating: The service is being gracefully terminated.
+- killing: The service is being forcibly killed.
+- stopped: The service has stopped.
+- backoff: The service will be put in the backoff state before the next start attempt if the service is configured to restart when it exits.
+- exited: The service has exited (and won't be automatically restarted).
+
+## Service start
+
+A service begins in an "initial" state. Pebble tries to start the service's underlying process and transitions the service to the "starting" state.
+
+## Start confirmation
+
+Pebble waits for a short period (`okayDelay`, defaults to one second) after starting the service. If the service runs without exiting after the `okayDelay` period, it's considered successfully started, and the service's state is transitioned into "running".
+
+No matter if the service is in the "starting" or "running" state, if you get the service, the status will be shown as "active". Read more in the [`pebble services`](#reference_pebble_services_command) command.
+
+## Start failure
+
+If the service exits quickly, the started channel receives an error. The error, along with the last logs, are added to the task (see more in [Changes and tasks](/reference/changes-and-tasks.md)). This also ensures logs are accessible.
+
+## Abort start
+
+If the user interrupts the start process (e.g., with a SIGKILL), the service transitions to stopped, and a SIGKILL signal is sent to the underlying process.
+
+## Auto-restart
+
+By default, Pebble's service manager automatically restarts services that exit unexpectedly, regardless of whether the service is in the "starting" state (the `okayDelay` period has not passed) or in the "running" state (`okayDelay` is passed, and the service is considered to be "running").
+
+This is done whether the exit code is zero or non-zero, but you can fine-tune the behaviour using the `on-success` and `on-failure` fields in a configuration layer. The possible values for these fields are:
+
+* `restart`: restart the service and enter a restart-backoff loop (the default behaviour).
+* `shutdown`: shut down and exit the Pebble daemon (with exit code 0 if the service exits successfully, exit code 10 otherwise)
+  - `success-shutdown`: shut down with exit code 0 (valid only for `on-failure`)
+  - `failure-shutdown`: shut down with exit code 10 (valid only for `on-success`)
+* `ignore`: ignore the service exiting and do nothing further
+
+## Backoff
+
+Pebble implements a backoff mechanism that increases the delay before restarting the service after each failed attempt. This prevents a failing service from consuming excessive resources.
+
+The `backoff-delay` defaults to half a second, the `backoff-factor` defaults to 2.0 (doubling), and the increasing delay is capped at `backoff-limit`, which defaults to 30 seconds. All of the three configurations can be customized, read more in [Layer specification](../reference/layer-specification).
+
+For example, with default settings for the above configuration, in `restart` mode, the first time a service exits, Pebble waits for half a second. If the service exits again, Pebble calculates the next backoff delay by multiplying the current delay by `backoff-factor`, which results in a 1-second delay. The next delay will be 2 seconds, then 4 seconds, and so on, capped at 30 seconds.
+
+The `backoff-limit` value is also used as a "backoff reset" time. If the service stays running after a restart for `backoff-limit` seconds, the backoff process is reset and the delay reverts to `backoff-delay`.
+
+## Auto-restart on health check failures
+
+Pebble can be configured to automatically restart services based on health checks. To do so, use `on-check-failure` in the service configuration. Read more in [Health checks](health-checks).