Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: how to run services reliably and update service autorestart to service lifecycle. #541

Merged
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions docs/how-to/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,19 +14,18 @@ Installation follows a similar pattern on all architectures. You can choose to i
Install Pebble <install-pebble>
```


## Service orchestration

As your needs grow, you may want to orchestrate multiple services.
As your needs grow, you may want to use advanced Pebble features to run services reliably and orchestrate multiple services.

```{toctree}
:titlesonly:
:maxdepth: 1

Run services reliably <run-services-reliably>
Manage service dependencies <service-dependencies>
```


## Identities

Use named "identities" to allow additional users to access the API.
Expand Down
96 changes: 96 additions & 0 deletions docs/how-to/run-services-reliably.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# How to run services reliably

In this guide, we will look at service reliability challenges in the modern world and how we can mitigate them with Pebble's advanced feature - [Health checks](../reference/health-checks).

## Service reliability in the modern microservice world

With the rise of the microservice architecture, reliability is becoming more and more important than ever. First, let's explore some of the causes of unreliability in microservice architectures:

- Network Issues: Microservices rely heavily on network communications. Intermittent network failures, latency spikes, and connection drops can disrupt service interactions and lead to failures.
- Resource Exhaustion: A single microservice consuming excessive resources (CPU, memory, disk I/O, and so on) can impact not only its performance and availability but also potentially affect other services depending on it.
- Dependency Failures: Microservices often depend on other components, like a database or other microservices. If a critical dependency becomes unavailable, the dependent service might also fail.
- Cascading Failures: A failure in one service can trigger failures in other dependent services, creating a cascading effect that can quickly bring down a large part of the system.
- Deployment Issues: Frequent deployments can benefit microservices if managed properly. However, it can also introduce instability if not. Errors during deployment, incorrect configurations, or incompatible versions can all cause reliability issues.
- Testing and Monitoring Gaps: Insufficient testing and monitoring can make it difficult to identify issues proactively, leading to unexpected failures and longer MTTR (mean time to repair).

## Health checks

To mitigate the reliability issues mentioned above, we need specific tooling, and health checks are one of them - a key mechanism and a critical part of the software development lifecycle (SDLC) in the DevOps culture for monitoring and detecting potential problems in the modern microservice architectures and especially in containerized environments.

By periodically running health checks, some of the reliability issues listed above can be mitigated:

### Detect resource exhaustion

Health checks can monitor resource usage (CPU, memory, disk space) within a microservice. For example, if resource consumption exceeds predefined thresholds, the health check can signal an unhealthy state, allowing for remediation, for example, scaling up or scaling out the service, restarting it, or issuing alerts.

### Identify dependent service failures

Health checks can verify the availability of critical dependencies. A service's health check can include checks to ensure it can connect to its database, message queues, or other required services.

### Catch deployment issues

Health checks can be incorporated into the deployment process. After a new version of a service is deployed, the deployment pipeline can monitor its health status. If the health check fails, the deployment can be rolled back to the previous state, preventing a faulty version from affecting end users.

### Mitigate cascading failures

By quickly identifying unhealthy services, health checks can help prevent cascading failures. For example, load balancers and service discovery mechanisms can use health check information to route traffic away from failing services, giving them time to recover.

### More on health checks
IronCore864 marked this conversation as resolved.
Show resolved Hide resolved

Note that a health check is no silver bullet, it can't solve all the reliability challenges posed by the microservice architecture. For example, while health checks can detect the consequence of network issues (e.g., inability to connect to a dependency), they can't fix the underlying network problem itself; and while health checks are a valuable part of a monitoring strategy, they can't replace comprehensive testing and monitoring.

Please also note that although health checks are running on a schedule, they should not be used to run scheduled jobs such as periodic backups.

In summary, health checks are a powerful tool for improving the reliability of microservices by enabling early detection of problems and making automated recovery possible.

## Using health checks of the HTTP type
IronCore864 marked this conversation as resolved.
Show resolved Hide resolved

A health check of the HTTP type issues HTTP `GET` requests to the health check URL at a user-specified interval.
IronCore864 marked this conversation as resolved.
Show resolved Hide resolved

The health check is considered successful if the check returns an HTTP 200 response. After getting a certain number of failures in a row, the health check is considered "down" (or unhealthy).
IronCore864 marked this conversation as resolved.
Show resolved Hide resolved

### Configuring HTTP-type health checks
IronCore864 marked this conversation as resolved.
Show resolved Hide resolved

Let's say we have a service `svc1` with a health check endpoint at `http://127.0.0.1:5000/health`. To configure a health check of HTTP type named `svc1-up` that accesses the health check endpoint at a 30-second interval with a timeout of 1 second and considers the check down if we get 3 failures in a row, we can use the following configuration:

```yaml
checks:
svc1-up:
override: replace
period: 30s
IronCore864 marked this conversation as resolved.
Show resolved Hide resolved
timeout: 1s
threshold: 3
http:
url: http://127.0.0.1:5000/health
```

The configuration above contains three key options that you can tweak for each health check:

- `period`: How often to run the check (defaults to 10 seconds).
- `timeout`: If the check hasn't responded before the timeout (defaults to 3 seconds), consider the check an error
- `threshold`: After how many consecutive errors (defaults to 3) is the check considered "down"
IronCore864 marked this conversation as resolved.
Show resolved Hide resolved

Besides the HTTP type, there are two more health check types in Pebble: `tcp`, which opens the given TCP port, and `exec`, which executes a user-specified command. For more information, see [Health checks](../reference/health-checks) and [Layer specification](../reference/layer-specification).
IronCore864 marked this conversation as resolved.
Show resolved Hide resolved

### Restarting the service when the health check fails
IronCore864 marked this conversation as resolved.
Show resolved Hide resolved

To automatically restart services when a health check fails, use `on-check-failure` in the service configuration.

To restart `svc1` when the health check named `svc1-up` fails, use the following configuration:

```
services:
svc1:
override: replace
command: python3 /home/ubuntu/work/health-check-sample-service/main.py
startup: enabled
on-check-failure:
svc1-up: restart
```

## See more

- [Health checks](../reference/health-checks)
- [Layer specification](../reference/layer-specification)
- [Service lifecycle](../reference/service-lifecycle)
- [How to manage service dependencies](service-dependencies)
IronCore864 marked this conversation as resolved.
Show resolved Hide resolved
4 changes: 2 additions & 2 deletions docs/reference/cli-commands.md
Original file line number Diff line number Diff line change
Expand Up @@ -946,7 +946,7 @@ The "Current" column shows the current status of the service, and can be one of

* `active`: starting or running
* `inactive`: not yet started, being stopped, or stopped
* `backoff`: in a [backoff-restart loop](service-auto-restart.md)
* `backoff`: in a [backoff-restart loop](service-lifecycle.md)
* `error`: in an error state


Expand Down Expand Up @@ -992,7 +992,7 @@ any other services it depends on, in the correct order.
### How it works

- If the command is still running at the end of the 1 second window, the start is considered successful.
- If the command exits within the 1 second window, Pebble retries the command after a configurable backoff, using the restart logic described in [](service-auto-restart.md). If one of the started services exits within the 1 second window, `pebble start` prints an appropriate error message and exits with an error.
- If the command exits within the 1 second window, Pebble retries the command after a configurable backoff, using the restart logic described in [Service lifecycle](service-lifecycle.md). If one of the started services exits within the 1 second window, `pebble start` prints an appropriate error message and exits with an error.

### Examples

Expand Down
4 changes: 2 additions & 2 deletions docs/reference/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Layer specification <layer-specification>
Log forwarding <log-forwarding>
Notices <notices>
Pebble in containers <pebble-in-containers>
Service auto-restart <service-auto-restart>
Service lifecycle <service-lifecycle>
```


Expand Down Expand Up @@ -53,7 +53,7 @@ When the Pebble daemon is running inside a remote system (for example, a separat

Pebble provides two ways to automatically restart services when they fail. Auto-restart is based on exit codes from services. Health checks are a more sophisticated way to test and report the availability of services.

* [Service auto-restart](service-auto-restart)
* [Service lifecycle](service-lifecycle)
* [Health checks](health-checks)


Expand Down
15 changes: 0 additions & 15 deletions docs/reference/service-auto-restart.md

This file was deleted.

56 changes: 56 additions & 0 deletions docs/reference/service-lifecycle.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Service lifecycle

Pebble manages the lifecycle of a service, including starting, stopping, and restarting it, with a focus on handling health checks and failures, and implementing auto-restart with backoff strategies, which are achieved using a state machine with the following states:
IronCore864 marked this conversation as resolved.
Show resolved Hide resolved

- initial: The service's initial state.
- starting: The service is in the process of starting.
- running: The `okayDelay` (see below) period has passed, and the service runs normally.
- terminating: The service is being gracefully terminated.
- killing: The service is being forcibly killed.
- stopped: The service has stopped.
- backoff: The service will be put in the backoff state before the next start attempt if the service is configured to restart when it exits.
- exited: The service has exited (and won't be automatically restarted).

## Service start

A service begins in an "initial" state. Pebble tries to start the service's underlying process and transitions the service to the "starting" state.

## Start confirmation

Pebble waits for a short period (`okayDelay`, defaults to one second) after starting the service. If the service runs without exiting after the `okayDelay` period, it's considered successfully started, and the service's state is transitioned into "running".

No matter if the service is in the "starting" or "running" state, if you get the service, the status will be shown as "active". Read more in the [`pebble services`](#reference_pebble_services_command) command.

## Start failure

If the service exits quickly, the started channel receives an error. The error, along with the last logs, are added to the task (see more in [Changes and tasks](/reference/changes-and-tasks.md)). This also ensures logs are accessible.
IronCore864 marked this conversation as resolved.
Show resolved Hide resolved

## Abort start

If the user interrupts the start process (e.g., with a SIGKILL), the service transitions to stopped, and a SIGKILL signal is sent to the underlying process.

## Auto-restart

By default, Pebble's service manager automatically restarts services that exit unexpectedly, regardless of whether the service is in the "starting" state (the `okayDelay` period has not passed) or in the "running" state (`okayDelay` is passed, and the service is considered to be "running").

This is done whether the exit code is zero or non-zero, but you can fine-tune the behaviour using the `on-success` and `on-failure` fields in a configuration layer. The possible values for these fields are:
IronCore864 marked this conversation as resolved.
Show resolved Hide resolved

* `restart`: restart the service and enter a restart-backoff loop (the default behaviour).
* `shutdown`: shut down and exit the Pebble daemon (with exit code 0 if the service exits successfully, exit code 10 otherwise)
- `success-shutdown`: shut down with exit code 0 (valid only for `on-failure`)
- `failure-shutdown`: shut down with exit code 10 (valid only for `on-success`)
* `ignore`: ignore the service exiting and do nothing further

## Backoff

Pebble implements a backoff mechanism that increases the delay before restarting the service after each failed attempt. This prevents a failing service from consuming excessive resources.

The `backoff-delay` defaults to half a second, the `backoff-factor` defaults to 2.0 (doubling), and the increasing delay is capped at `backoff-limit`, which defaults to 30 seconds. All of the three configurations can be customized, read more in [Layer specification](../reference/layer-specification).

For example, with default settings for the above configuration, in `restart` mode, the first time a service exits, Pebble waits for half a second. If the service exits again, Pebble calculates the next backoff delay by multiplying the current delay by `backoff-factor`, which results in a 1-second delay. The next delay will be 2 seconds, then 4 seconds, and so on, capped at 30 seconds.
IronCore864 marked this conversation as resolved.
Show resolved Hide resolved

The `backoff-limit` value is also used as a "backoff reset" time. If the service stays running after a restart for `backoff-limit` seconds, the backoff process is reset and the delay reverts to `backoff-delay`.

## Auto-restart on health check failures

Pebble can be configured to automatically restart services based on health checks. To do so, use `on-check-failure` in the service configuration. Read more in [Health checks](health-checks).
Loading