Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Commit

Permalink
Alert-manager: Kill low-gpu-utilization jobs, tag abnormal jobs (#4940)
Browse files Browse the repository at this point in the history
* alert manager based gpu utilization enhancement
  • Loading branch information
shaiic-pai authored Sep 29, 2020
1 parent 9755553 commit cf4e6a8
Show file tree
Hide file tree
Showing 53 changed files with 4,424 additions and 328 deletions.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,8 @@ The [admin manual](https://openpai.readthedocs.io/en/latest/manual/cluster-admin

- **Users and groups management**. Administrators could manage the [users and groups](https://openpai.readthedocs.io/en/latest/manual/cluster-admin/how-to-manage-users-and-groups.html) easily.

- **Alerts management**. Administrators could [customize alerts rules and actions](https://openpai.readthedocs.io/en/latest/manual/cluster-admin/how-to-customize-alerts.html).

- **Customization**. Administrators could customize the cluster by [plugins](https://openpai.readthedocs.io/en/latest/manual/cluster-admin/how-to-customize-cluster-by-plugins.html). Administrators could also upgrade (or downgrade) a single component (e.g. rest servers) to address customized application demands.

### For cluster users
Expand Down
52 changes: 41 additions & 11 deletions contrib/kubespray/quick-start/services-configuration.yaml.template
Original file line number Diff line number Diff line change
Expand Up @@ -164,17 +164,37 @@ authentication:
# port: 80


# if you want to enable alert manager to send alert email, uncomment following lines and fill
# the right values.
# if you want to enable alert-handler actions, uncomment following lines and fill the right values.
# alert-manager:
# receiver: [email protected]
# smtp_url: smtp.office365.com:587
# smtp_from: [email protected]
# smtp_auth_username: [email protected]
# smtp_auth_password: password_for_alert_sender
# port: 9093 # this is optional, you should not write this if you do not want to change the port alert-manager is listening on

# uncomment following if you want to change customeize grafana
# port: 9093 # do not change this if you do not want to change the port alert-manager is listening on
# alert-handler: # alert-handler will only be enabled when this field is not empty
# port: 9095 # do not change this if you do not want to change the port alert-handler is listening on
# log-level: "info"
# pai-bearer-token: 'your-application-token-for-pai-rest-server'
# email-configs: # email-notification will only be enabled when this field is not empty
# admin-receiver: [email protected]
# smtp-host: smtp.office365.com
# smtp-port: 587
# smtp-from: [email protected]
# smtp-auth-username: [email protected]
# smtp-auth-password: password-for-alert-sender
# customized-routes:
# routes:
# - receiver: pai-email-admin-user-and-stop-job
# match:
# alertname: PAIJobGpuPercentLowerThan0_3For1h
# customized-receivers:
# - name: "pai-email-admin-user-and-stop-job"
# actions:
# - email-admin
# - email-user
# - stop-jobs
# - tag-jobs
# tags:
# - 'stopped-by-alert-manager'


# uncomment following if you want to change customize grafana
# grafana:
# port: 3000

Expand All @@ -191,11 +211,21 @@ authentication:
# interface: eth0,eno2


# uncomment following if you want to change customeize prometheus
# uncomment following if you want to customize prometheus
# prometheus:
# port: 9091
# # How frequently to scrape targets
# scrape_interval: 30
# customized-alerts: |
# groups:
# - name: customized-alerts
# rules:
# - alert: PAIJobGpuPercentLowerThan0_3For1h
# expr: avg(task_gpu_percent{virtual_cluster=~"default"}) by (job_name) < 0.3
# for: 1h
# annotations:
# summary: "{{$labels.job_name}} has a job gpu percent lower than 30% for 1 hour"
# description: Monitor job level gpu utilization in certain virtual clusters.


# uncomment following section if you want to customize the threshold of cleaner
Expand Down
4 changes: 2 additions & 2 deletions deployment/paiLibrary/common/template_handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,13 @@
import logging
import logging.config

logger = logging.getLogger(__name__)

logger = logging.getLogger(__name__)

def generate_from_template_dict(template_data, map_table):

generated_file = jinja2.Template(template_data).render(
map_table
)

return generated_file
return generated_file
4 changes: 0 additions & 4 deletions deployment/paiLibrary/paiService/service_management_start.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,8 +79,6 @@ def start(self, serv):
for fat_serv in dependency_list:
if fat_serv not in self.service_list:
continue
if fat_serv in self.done_dict and self.done_dict[fat_serv] == True:
continue
self.start(fat_serv)

try_counts = 0
Expand Down Expand Up @@ -128,6 +126,4 @@ def run(self):
self.logger.warning("service.yaml can't be found on the directory of {0}".format(serv))
self.logger.warning("Please check your source code. The {0}'s service will be skipped.".format(serv))
continue
if serv in self.done_dict and self.done_dict[serv] == True:
continue
self.start(serv)
9 changes: 5 additions & 4 deletions docs/manual/cluster-admin/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@ This manual is for cluster administrators to learn the installation and uninstal
6. [How to Set Up Virtual Clusters](./how-to-set-up-virtual-clusters.md)
7. [How to Add and Remove Nodes](./how-to-add-and-remove-nodes.md)
8. [How to Customize Cluster by Plugins](./how-to-customize-cluster-by-plugins.md)
9. [Alerting-and-Troubleshooting](./alerting-and-troubleshooting.md)
10. [Recommended Practice](./recommended-practice.md)
11. [How to Uninstall OpenPAI](./how-to-uninstall-openpai.md)
12. [Upgrade Guide](./upgrade-guide.md)
9. [How to Customize Alerts](./how-to-customize-alerts.md)
10. [Troubleshooting](./troubleshooting.md)
11. [Recommended Practice](./recommended-practice.md)
12. [How to Uninstall OpenPAI](./how-to-uninstall-openpai.md)
13. [Upgrade Guide](./upgrade-guide.md)
165 changes: 165 additions & 0 deletions docs/manual/cluster-admin/how-to-customize-alerts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
# How to Customize Alerts

OpenPAI supports the customization of alert rules and corresponding handling actions.
The alert rules are managed by `prometheus` service and the matching rules between rules and actions are managed by `alert-manager` service.

By default, the alerts will only be displayed on the webportal.
You can customize `prometheus` and `alert-manager` to realize complex functions.
For example, we can send emails to administrators and alert related users, tag the jobs, etc.

In this document, we will introduce existing alerts & actions, their matching methods, and how to add new customized alerts & actions.

## Existing Alerts/Actions & How to Match Them

### Existing Alerts

OpenPAI uses `Prometheus` to monitor system metrics.
We provide various alerts by defining rules on virtual_clusters, GPU utilization, etc.
If OpenPAI is deployed, you can then visit `your_master_ip/prometheus/alerts` to see the details of alerts, including their definitions and status.

For alerting rules syntax, please refer to [link](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/).

### Existing Actions

OpenPAI uses the `alert-manager` service for alert handling. We have provided so far these following actions:
* webportal-notification: Show alerts on the home page of webportal (on the top-right corner).
* email-admin: Send emails to the assigned admin.
* email-user: Send emails to the owners of jobs.
* stop-jobs Stop jobs by calling OpenPAI REST API.
* tag-jobs: Add a tag to jobs by calling OpenPAI REST API.

The action `webportal-notification` is always enabled, which means that all the alerts will be shown on the webportal.

All the other actions are realized in `alert-handler`.
To make these actions available, administrators need to properly fill the corresponding fields of `alert-manager` in `service-configuration.yml`,
the available actions list will then be saved in `cluster_cfg["alert-manager"]["actions-available"]`, please refer to [alert-manager config](https://github.com/microsoft/pai/tree/master/src/alert-manager/config/alert-manager.md) for details of alert-manager service configuration details.

Make sure `job_name` presents in the alert body if you want to use `email-user`, `stop-jobs`, or `tag-jobs` actions.

### How to Match Alerts and Actions

The matching rules are defined using `receivers` and `rules`.
A `receiver` is simply a group of actions, a `rule` matches the alerts to a specific `receiver`.

With the default configuration, all the alerts will match the default alert receiver which triggers only `email-admin` action.
You can add new receivers with related matching rules to assign actions to alerts in the `alert-manager` field in `service-configuration.yml`

For example :

``` yaml
customized-routes:
routes:
- receiver: pai-email-admin-user-and-stop-job
match:
alertname: PAIJobGpuPercentLowerThan0_3For1h
customized-receivers:
- name: "pai-email-admin-user-and-stop-job"
actions:
- email-admin
- email-user
- stop-jobs
- tag-jobs
tags:
- 'stopped-by-alert-manager'
```
Here we define :
- a receiver `pai-email-admin-user-and-stop-job`, which contains the actions `email-admin`, `email-user`, `stop-jobs` and `tag-jobs`
- a route, which matches the alert `pai-email-admin-user-and-stop-job` to the receiver `pai-email-admin-user-and-stop-job`.

As a consequence, when the alert `PAIJobGpuPercentLowerThan0_3For1h` is fired, all these 4 actions will be triggered.

For `routes` definition, we adopt the syntax of [Prometheus Alertmanager](https://prometheus.io/docs/alerting/latest/configuration/).
For `receivers` definition, you can simply:
- name the receiver in `name` field;
- list the actions to use in `actions`;
- list the tags in `tags` if `tag-jobs` is one of the actions.

Remember to push service config to the cluster and restart the `alert-manager` service after your modification with the following commands in the dev-box container:
```bash
./paictl.py service stop -n alert-manager
./paictl.py config push -p /cluster-configuration -m service
./paictl.py service start -n alert-manager
```

For alert & action matching rules syntax, please refer to [Prometheus Alertmanager](https://prometheus.io/docs/alerting/latest/configuration/).

For OpenPAI service management, please refer to [Basic Management Operations](https://github.com/microsoft/pai/blob/master/docs/manual/cluster-admin/basic-management-operations.md).


## How to Add Customized Alerts

You can define customized alerts in the `prometheus` field in `service-configuration.yml`.
For example, We can add a customized alert `PAIJobGpuPercentLowerThan0_3For1h` by adding :

``` yaml
customized-alerts: |
groups:
- name: customized-alerts
rules:
- alert: PAIJobGpuPercentLowerThan0_3For1h
expr: avg(task_gpu_percent{virtual_cluster=~"default"}) by (job_name) < 0.3
for: 1h
annotations:
summary: "{{$labels.job_name}} has a job gpu percent lower than 30% for 1 hour"
description: Monitor job level gpu utilization in certain virtual clusters.
```

The `PAIJobGpuPercentLowerThan0_3For1h` alert will be fired when the job on virtual cluster `default` has a task level average GPU percent lower than `30%` for more than `1 hour`.
Here the metric `task_gpu_percent` is used, which describes the GPU utilization in task level.
You can explore the system metrics at `your_master_ip/prometheus/graph`.

Remember to push service config to the cluster and restart the `prometheus` service after your modification with the following commands in the dev-box container:
```bash
./paictl.py service stop -n prometheus
./paictl.py config push -p /cluster-configuration -m service
./paictl.py service start -n prometheus
```

Please refer to [Prometheus Alerting Rules](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/) for alerting rules syntax.

## How to Add Customized Actions

If you want to add new customized actions, follow these steps:

### Realize the action in 'alert-handler'.
We provide `alert-handler` as a lightweight `express` application, where you can add customized APIs easily.

For example, the `stop-jobs` action is realized by calling the `localhost:9095/alert-handler/stop-jobs` API through `webhook`,
the request is then forward to the OpenPAI Rest Server to stop the job.
You can add new APIs in `alert-handler` and adapt the request to realize the required action.

The source code of `alert-handler` is available [here](https://github.com/microsoft/pai/blob/master/src/alert-manager/src/alert-handler).

### Check the dependencies of the action

As stated before, to make an action available, administrators need to provide the necessary configurations.
Check this [folder](https://github.com/microsoft/pai/tree/master/src/alert-manager/config) and define the dependencies' rules for your customized actions.


### Render the action to webhook configurations

When customized receivers are defined in `service-configuration.yml`,
the `actions` will then be rendered as webhook_configs [here](https://github.com/microsoft/pai/blob/master/src/alert-manager/deploy/alert-manager-configmap.yaml.template).

The actions we provide, `email-admin`, `email-user`, `stop-jobs`, `tag-jobs`, can be called within `alert-manager` by sending POST requests to `alert-handler`:
- `localhost:{your_alert_handler_port}/alert-handler/send-email-to-admin`
- `localhost:{your_alert_handler_port}/alert-handler/send-email-to-user`
- `localhost:{your_alert_handler_port}/alert-handler/stop-jobs`
- `localhost:{your_alert_handler_port}/alert-handler/tag-jobs/:tag`

The request body will be automatically filled by `alert-manager` with `webhook`
and `alert-handler` will adapt the requests to various actions.

Please define how to render your customized action to the `alert-handler` API request
[here](https://github.com/microsoft/pai/blob/master/src/alert-manager/src/alert-handler)

Remember to re-build and push the docker image, and restart the `alert-manager` service after your modification with the following commmands in the dev-box container:

```bash
./build/pai_build.py build -c /cluster-configuration/ -s alert-manager
./build/pai_build.py push -c /cluster-configuration/ -i alert-handler
./paictl.py service stop -n alert-manager
./paictl.py config push -p /cluster-configuration -m service
./paictl.py service start -n alert-manager
```
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -1,45 +1,6 @@
# Alerting and Troubleshooting
# Troubleshooting

OpenPAI uses [Prometheus](https://prometheus.io/) to monitor the system. You can view the monitoring information [on webportal](./basic-management-operations.md#management-on-webportal). For alerting, OpenPAI uses [alert manager](https://prometheus.io/docs/alerting/latest/alertmanager/), but it is not set up in default installation. This document describes how to set up alert manager, and how to deal with some common alerts. It also includes some other troubleshooting cases in practice.

## Set Up Alert Manager

OpenPAI's alert manager is set to send alerting e-mails when alert happens. To begin with, you should get an SMTP account to send these e-mails.

After getting an SMTP account, how to set up the alert manager in PAI? Please read the document about [service management and paictl](./basic-management-operations.md#pai-service-management-and-paictl) first, and start a dev box container. Then, in the dev box container, pull the configuration by:

```bash
./paictl config pull -o /cluster-configuration
```

Uncomment the alert manager section in `/cluster-configuration/services-configuration.yaml`, and set your SMTP account and the receiver's e-mail address. Here is an example:

```bash
alert-manager:
port: 9093
receiver: <receiver-email-address>
smtp_auth_password: <smtp-password>
smtp_auth_username: <smtp-username>
smtp_from: <smtp-email-address>
smtp_url: <smtp-server>:<smtp-port>
```

Configuration `port` stands for the port of alert manager. In most cases, you don't need to change it. Configuration `receiver` is usually set to be the administrator's e-mail address to receive alerting e-mails.

Save the configuration file, and start alert manager by:

```bash
./paictl.py service stop -n alert-manager
./paictl.py config push -p /cluster-configuration -m service
./paictl.py service start -n alert-manager
```

After alert manager is successfully started, the receiver's e-mail address will receive alerting e-mails from the SMTP account. Also, you can view the alerting information on Webportal (in the top-right corner):

<img src="./imgs/alert-on-webportal.png" width="100%" height="100%" />


## Troubleshooting
This ducument includes some troubleshooting cases in practice.

### PaiServicePodNotReady Alert

Expand Down
20 changes: 13 additions & 7 deletions examples/cluster-configuration/services-configuration.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -103,15 +103,21 @@ rest-server:
# node-exporter:
# port: 9100

# if you want to enable alert manager to send alert email, uncomment following lines and fill
# if you want to enable alert manager to send alert email to admin, uncomment following lines and fill
# the right values.
# alert-manager:
# receiver: [email protected]
# smtp_url: smtp.office365.com:587
# smtp_from: [email protected]
# smtp_auth_username: [email protected]
# smtp_auth_password: password_for_alert_sender
# port: 9093 # this is optional, you should not write this if you do not want to change the port alert-manager is listening on
# port: 9093 # do not change this if you do not want to change the port alert-manager is listening on
# alert-handler: # alert-handler will only be enabled when this field is not empty
# port: 9095 # do not change this if you do not want to change the port alert-handler is listening on
# log-level: "info"
# pai-bearer-token: 'your-application-token-for-pai-rest-server' # required if you want to send email to job users
# email-configs: # email-notification will only be enabled when this field is not empty
# admin-receiver: [email protected]
# smtp-host: smtp.office365.com
# smtp-port: 587
# smtp-from: [email protected]
# smtp-auth-username: [email protected]
# smtp-auth-password: password-for-alert-sender

# uncomment following if you want to change customize prometheus
# prometheus:
Expand Down
3 changes: 2 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@ nav:
- How to Set Up Virtual Clusters: manual/cluster-admin/how-to-set-up-virtual-clusters.md
- How to Add and Remove Nodes: manual/cluster-admin/how-to-add-and-remove-nodes.md
- How to Customize Cluster by Plugins: manual/cluster-admin/how-to-customize-cluster-by-plugins.md
- Alerting and Troubleshooting: manual/cluster-admin/alerting-and-troubleshooting.md
- How to Customize Alerts: manual/cluster-admin/how-to-customize-alerts.md
- Troubleshooting: manual/cluster-admin/troubleshooting.md
- Recommended Practice: manual/cluster-admin/recommended-practice.md
- How to Uninstall OpenPAI: manual/cluster-admin/how-to-uninstall-openpai.md
- Upgrade Guide: manual/cluster-admin/upgrade-guide.md
Expand Down
2 changes: 2 additions & 0 deletions src/alert-manager/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Dependency directories
node_modules/
Loading

0 comments on commit cf4e6a8

Please sign in to comment.