Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: how to run services reliably and update service autorestart to service lifecycle. #541

Merged

Conversation

IronCore864
Copy link
Contributor

@IronCore864 IronCore864 commented Dec 22, 2024

According to our discussion (internal doc, sorry), this is the first piece of the next few how-to guides for Pebble.

In this PR:

  • A new how-to guide "how to run services reliably" detailing health checks is created, and placed after “how to install”, as the second how-to doc.
  • Per David's suggestion, some of the causes of unreliability in the modern microservice world are listed, and which can be mitigated by health checks/Pebble are explained.
  • Per David's suggestion, a few words on "what health checks are not" are added, so that the users won't misuse this feature to run cronjobs.

See more details and outline in the above discussion doc.

Preview: https://canonical-pebble--541.com.readthedocs.build/en/541/how-to/run-services-reliably/

@IronCore864 IronCore864 requested a review from dwilding December 24, 2024 01:46
@IronCore864 IronCore864 marked this pull request as ready for review December 24, 2024 01:46
@dwilding
Copy link
Contributor

Thank you for doing this! Before I work on a more detailed review, I have a few ideas about the overall contents:

  1. "Health checks" section - I think this section should have more prominence. So how about making each topic a separate subsection, instead of using bullets?

    Also in this section, I think we ought to make it obvious up-front that "health checks" are a combination of Pebble's feature + what the service author chooses to implement.

    Then within each topic we can say what we recommend the service to do. For example, in the topic "Identifying Dependent Service Failures", I really like what you wrote:

    A service’s health check can include checks to ensure it can connect to its database, message queue, or other required services.

    This is nice & clear advice about how to approach health checks on the service side.

  2. "Configuring health checks in Pebble" and "Restarting service on health check failure" sections - I'd probably combine these into a single section that is fully focused on how to configure a health check of HTTP type and restart the service when the health check fails.

    Since this is a how-to doc, it's OK to omit the different options for health checks. I think it's better to give a specific scenario, then link to the reference pages for people to understand the different options.

  3. "Demo service" and "Putting it all together" sections - I feel these are extremely helpful, and the health-check-sample-service idea is neat! Since these sections are more guided, I'd consider moving them to a tutorial instead.

@IronCore864
Copy link
Contributor Author

Thank you for doing this! Before I work on a more detailed review, I have a few ideas about the overall contents:

  1. "Health checks" section - I think this section should have more prominence. So how about making each topic a separate subsection, instead of using bullets?
    Also in this section, I think we ought to make it obvious up-front that "health checks" are a combination of Pebble's feature + what the service author chooses to implement.
    Then within each topic we can say what we recommend the service to do. For example, in the topic "Identifying Dependent Service Failures", I really like what you wrote:

    A service’s health check can include checks to ensure it can connect to its database, message queue, or other required services.

    This is nice & clear advice about how to approach health checks on the service side.

  2. "Configuring health checks in Pebble" and "Restarting service on health check failure" sections - I'd probably combine these into a single section that is fully focused on how to configure a health check of HTTP type and restart the service when the health check fails.
    Since this is a how-to doc, it's OK to omit the different options for health checks. I think it's better to give a specific scenario, then link to the reference pages for people to understand the different options.

  3. "Demo service" and "Putting it all together" sections - I feel these are extremely helpful, and the health-check-sample-service idea is neat! Since these sections are more guided, I'd consider moving them to a tutorial instead.

Refactored according to 1 and 2; for 3, I haven't done anything yet, mostly because those two parts are not long enough for a tutorial. Should we do that anyway?

@IronCore864
Copy link
Contributor Author

Per discussion elsewhere, we decided to remove the "Demo service" and "Putting it all together" sections. We will add a tutorial in the future using content from these sections.

@IronCore864 IronCore864 requested a review from benhoyt January 2, 2025 08:01
@dwilding
Copy link
Contributor

dwilding commented Jan 3, 2025

Looking at "How to run services reliably" with fresh eyes, I think we should go further in refactoring the doc to make the actionable info stand out. I recommend that we drop "Service reliability in the modern microservice world" as a separate section, keeping the info as context within the section that follows.

The nature of this topic is going to require some explanatory content, but I think we can compress it down somewhat.

I've taken what you wrote and put together this structure - what do you think?

# How to run services reliably

You can use Pebble's [health checks](../reference/health-checks) feature to perform checks on services and restart services if the checks fail. To use health checks effectively, you should consider:

- How services monitor their own health and make that information available to Pebble
- How Pebble is configured to fetch health information and respond to unhealthy services

This guide demonstrates how to use health checks to address common service reliability challenges.

## Return health information from services

As you implement health checks within services, consider typical causes of unreliability. You can monitor for unhealthy conditions and expose that information for Pebble to consume.

A common way to expose health information is to use an HTTP endpoint. For an example of how to configure Pebble to check an HTTP endpoint, see [](run_services_reliably_use_http_checks).

### Detect resource exhaustion

A single microservice consuming excessive resources (CPU, memory, disk I/O, and so on) can impact not only its performance and availability but also potentially affect other services depending on it.

Recommendation: Implement a health check that signals an unhealthy state if resource consumption exceeds predefined thresholds.

### Identify dependent service failures

_Use a similar structure of brief context followed by recommendations_

<!-- more subsections here -->

(run_services_reliably_use_http_checks)=
## Use HTTP-based health checks in Pebble

<!-- details here -->

@dwilding
Copy link
Contributor

dwilding commented Jan 3, 2025

I've finished adding comments on specific parts. There's a main refactoring suggestion here.

@IronCore864
Copy link
Contributor Author

Looking at "How to run services reliably" with fresh eyes, I think we should go further in refactoring the doc to make the actionable info stand out. I recommend that we drop "Service reliability in the modern microservice world" as a separate section, keeping the info as context within the section that follows.

The nature of this topic is going to require some explanatory content, but I think we can compress it down somewhat.

I've taken what you wrote and put together this structure - what do you think?

# How to run services reliably

You can use Pebble's [health checks](../reference/health-checks) feature to perform checks on services and restart services if the checks fail. To use health checks effectively, you should consider:

- How services monitor their own health and make that information available to Pebble
- How Pebble is configured to fetch health information and respond to unhealthy services

This guide demonstrates how to use health checks to address common service reliability challenges.

## Return health information from services

As you implement health checks within services, consider typical causes of unreliability. You can monitor for unhealthy conditions and expose that information for Pebble to consume.

A common way to expose health information is to use an HTTP endpoint. For an example of how to configure Pebble to check an HTTP endpoint, see [](run_services_reliably_use_http_checks).

### Detect resource exhaustion

A single microservice consuming excessive resources (CPU, memory, disk I/O, and so on) can impact not only its performance and availability but also potentially affect other services depending on it.

Recommendation: Implement a health check that signals an unhealthy state if resource consumption exceeds predefined thresholds.

### Identify dependent service failures

_Use a similar structure of brief context followed by recommendations_

<!-- more subsections here -->

(run_services_reliably_use_http_checks)=
## Use HTTP-based health checks in Pebble

<!-- details here -->

I haven't done this refactor yet because it seems to be a very significant rework. I have given it some thought, and the current logic is:

First, introduce a health check. Then, explain in what scenarios it's helpful, which leads to Pebble's health check feature, how to configure it, and why it can help improve reliability. In this way, we could paint a picture, which I think a how-to guide should do: what exact problems this document can help achieve.

The suggested logic focuses more on Pebble and its features without laying too much background information first, which then is intertwined with the following paragraphs. This was less clear to me, so I hesitated about the refactoring. Maybe we should get more input from @benhoyt.

Other suggestions are all adopted.

@dwilding
Copy link
Contributor

dwilding commented Jan 6, 2025

It also sounds good to me if we wait to get Ben's input. If we end up not doing the refactoring, the part that I feel needs to be most emphasized somewhere is that "health checks" are a combination of:

  • How services monitor their own health and make that information available to Pebble
  • How Pebble is configured to fetch health information and respond to unhealthy services

I think this distinction is important context for the advice in the doc.

@IronCore864 IronCore864 requested a review from dwilding January 20, 2025 10:11
Copy link
Contributor

@dwilding dwilding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've finished reviewing. I think this update keeps the key context and makes the whole doc more approachable - good job 🙂

Copy link
Contributor

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! Requesting some minor changes, but I'm happy with the structure.

I think it'd be good to move the "service autorestart" -> "service lifecycle" changes to a separate PR for separate review. It'll also mean we can check off the "Pebble how-tos" roadmap item sooner. :-)

docs/how-to/run-services-reliably.md Outdated Show resolved Hide resolved
docs/how-to/run-services-reliably.md Outdated Show resolved Hide resolved
docs/how-to/run-services-reliably.md Outdated Show resolved Hide resolved
docs/how-to/run-services-reliably.md Outdated Show resolved Hide resolved
docs/how-to/run-services-reliably.md Outdated Show resolved Hide resolved
docs/how-to/run-services-reliably.md Outdated Show resolved Hide resolved
docs/how-to/run-services-reliably.md Outdated Show resolved Hide resolved
docs/how-to/run-services-reliably.md Outdated Show resolved Hide resolved
docs/how-to/run-services-reliably.md Outdated Show resolved Hide resolved
docs/how-to/run-services-reliably.md Outdated Show resolved Hide resolved
@IronCore864 IronCore864 requested a review from benhoyt January 24, 2025 02:44
Copy link
Contributor

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@IronCore864 IronCore864 merged commit 68a941e into canonical:master Jan 24, 2025
15 checks passed
@IronCore864 IronCore864 deleted the docs-how-to-run-services-reliably branch January 24, 2025 06:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants