Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Commit

Permalink
Merge pull request #2676 from Microsoft/errorspec
Browse files Browse the repository at this point in the history
[Backend]: PAI Job Error Spec, Collection and Handling - Phase II
  • Loading branch information
yqwang-ms authored Apr 26, 2019
2 parents cb7d1a9 + 90410c9 commit b232486
Show file tree
Hide file tree
Showing 80 changed files with 3,863 additions and 956 deletions.
5 changes: 5 additions & 0 deletions docs/job_tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ This guide introduces the details of job submission.
- [Environment variables](#environment-variables)
- [A complete example](#a-complete-example)
- [Learn more job examples](#learn-more-job-examples)
- [Job exit spec](#job-exit-spec)

## Quick start: submit a hello-world job

Expand Down Expand Up @@ -240,3 +241,7 @@ A distributed TensorFlow job is listed below as an example:
## Learn more job examples

For more examples, please refer to [job examples directory](../examples).

## Job exit spec

For the **specification of each job exitcode**, please refer to [PAI Job Exit Spec User Manual](../src/job-exit-spec/config/user-manual.md).
17 changes: 15 additions & 2 deletions docs/rest-server/API.md
Original file line number Diff line number Diff line change
Expand Up @@ -456,8 +456,21 @@ Status: 200
appLaunchedTime: "applicationLaunchedTimestamp",
appCompletedTime: "applicationCompletedTimestamp",
appExitCode: applicationExitCode,
appExitDiagnostics: "applicationExitDiagnostics"
appExitType: "applicationExitType"
// please check https://github.com/Microsoft/pai/blob/master/src/job-exit-spec/config/job-exit-spec.md for more information
appExitSpec: exitSpecObject,
appExitTriggerMessage: "applicationExitTriggerMessage",
appExitTriggerTaskRoleName: "applicationExitTriggerTaskRoleName",
appExitTriggerTaskIndex: "applicationExitTriggerTaskIndex",
appExitDiagnostics: "applicationExitDiagnostics",
// exit messages extracted from exitDiagnostics
appExitMessages: {
contaier: "containerStderr",
// please check https://github.com/Microsoft/pai/blob/master/docs/rest-server/runtime-exit-spec.md for more information
runtime: runtimeScriptErrorObject,
launcher: "launcherExitMessage"
},
// appExitType is deprecated, please use appExitSpec instead.
appExitType: "applicationExitType",
},
taskRoles: {
// Name-details map
Expand Down
71 changes: 71 additions & 0 deletions docs/rest-server/runtime-exit-spec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# PAI Runtime Exit Spec
1. It is **PAI Dynamic Exit Info**, see **PAI Static Exit Info** in [PAI Job Exit Spec User Manual](../../src/job-exit-spec/config/user-manual.md)

## PAI Runtime as Error Funnel
PAI Runtime is the **only interface** to interact with PAI Container.

So, it is also **the one and the only one to produce** the **right** and **final** PAI Container ExitCode and Dynamic ExitInfo.

(*Upper layer centralized service generally should not do the error detection based on Runtime's return, to produce another ExitCode, anymore*)

For example,

if other components need to kill Container, it need to **notify Runtime to kill** the Container along with the **kill reason, etc, instead if kill the Container by itself.**

Similarly, Runtime need to **take the responsibility to execute the kill action**, and then **forward the kill reason (see below)** to YARN by Dynamic ExitInfo, **instead of forwarding all local error logs as reason (see below)** to YARN.

However, it will also **try best to forward all local error logs as attachment**, and given **limited total size for Dynamic ExitInfo** (such as 4KB), it is **the only dictator to decide which error log file forward more, which forward less, forward which part of error log, tail or middle, etc**.

(*Such as during Runtime killing Container, the container itself may produce exiting error messaging which is generally not very helpful to explain error reason, so it is a good candidate to sacrifice for other error log*)

(*This cannot be achieved by change YARN code which only can collect all error log tail evenly, because YARN is not aware which error, so cannot decide which error log should forward more.*)

<p style="text-align: left;">
<img src="runtime-exit-spec.png" title="Architecture" alt="Architecture" />
</p>

## Design
1. Write in User Understandable way
2. Write into `runtime.pai.agg.error` file which only contains the YAML object of Dynamic ExitInfo
3. Limit the size of `runtime.pai.agg.error` file
4. Write landmark lines around the schema object string, to help RestServer locate it

```
[PAI_RUNTIME_ERROR_START]
YAML String of PAI Runtime Exit Info
[PAI_RUNTIME_ERROR_END]
```
## PAI Runtime Exit Info Schema
```
exitcode:
reason:
solution:
trigger:
originalUserExitCode:
caughtSignal:
caughtException:
matchedUserLogString:
dockerStatus:
errorLogs:
user:
runtime:
diskCleaner:
```
|field|description|example|
|-----|-----------|--------|
| **exitcode** | PAI ExitCode | 198, i.e. [CONTAINER_OOD_KILLED_BY_DISKCLEANER](../../src/job-exit-spec/config/job-exit-spec.md) |
| **reason** | Dynamic Reason | Current Container is killed due to it used most disk space and all container disk usage exceeded platform limit.<br>Current Container Disk Usage 30% (30GB), All Container Disk Usage 91% (91GB), Platform Disk Limit 90% (90GB). |
| **solution** | Dynamic Solution | All Container overused 1GB disk, but current container only use 30% disk, please retry on another node later. |
| **trigger** | Which error pattern triggered the error:<br>Should be one of below field name | diskCleanerErrorLog |
| **originalUserExitCode** | Original user process exitcode<br>Only for showing to user. Preserve it here instead of exit with it, in case it conflicts with Runtime's own ExitCode Spec. | 1 |
| **caughtSignal** | Caught Signal | SIGKILL |
| **caughtException** | Caught Exception | TensorFlow StackTrace:... |
| **matchedUserLogString** | Matched user log string | RunScripts.sh: line 58: [: -gt: unary operator expected |
| **dockerStatus** | Docker Status | KILLED |
| **errorLogs** | Runtime collect all local *.pai.error and error files, and redirect it to here. (This is best effort, because we have limit runtime.pai.agg.error file size) | |
| **errorLogs.user** | From user stderr | 234346457 ERROR "get: `/dgfdg': No such file or directory"<br>234346458 ERROR "Tensorflow failed to init, exiting…" |
| **errorLogs.runtime** | From runtime.pai.error | 234346457 ERROR REASON "get: `/{user code dir}': No such file or directory" |
| **errorLogs.diskCleaner** | From diskCleaner.pai.error |234346457 ERROR ACTION "KILL"<br>234346457 ERROR REASON "Current Container is killed due to it used most disk space and all container disk usage exceeded platform limit. Current Container Disk Usage 30% (30GB), All Container Disk Usage 91% (91GB), Platform Disk Limit 90% (90GB)."<br>234346457 ERROR SOLUTION "All Container overused 1GB disk, but current container only use 30% disk, please retry on another node later."|
Binary file added docs/rest-server/runtime-exit-spec.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 5 additions & 0 deletions src/cleaner/deploy/cleaner.yaml.template
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,8 @@ spec:
volumeMounts:
- mountPath: /var/run/docker.sock
name: docker-socket
- mountPath: /logs
name: cleaner-logs
{%- if cluster_cfg['cluster']['common']['qos-switch'] == "true" %}
resources:
limits:
Expand All @@ -53,6 +55,9 @@ spec:
- name: docker-socket
hostPath:
path: /var/run/docker.sock
- name: cleaner-logs
hostPath:
path: {{ cluster_cfg["cluster"]["common"]["data-path"] }}/yarn/node/userlogs
tolerations:
- key: node.kubernetes.io/memory-pressure
operator: "Exists"
Expand Down
61 changes: 49 additions & 12 deletions src/cleaner/scripts/clean_docker.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
import multiprocessing
import re
import time
import os

class DockerCleaner(LoggerMixin):
def __init__(self, threshold, interval, timeout=timedelta(hours=1)):
Expand Down Expand Up @@ -54,44 +55,80 @@ def run(self):

def check_disk_usage(self, partition):
df = subprocess.Popen(["df","-h", partition], stdout=subprocess.PIPE)
size = 0
sized = 0
try:
for line in df.stdout:
splitline = line.decode().split()
if splitline[5] == partition:
size = int(splitline[4][:-1])
sized = splitline[1]
used = splitline[2]
usep = int(splitline[4][:-1])
except ValueError:
self.logger.error("cannot get disk size, reset size to 0")
size = 0
self.logger.info("Checking disk, disk usage = {0}%".format(size))
return size
sized = 0
used = 0
usep = 0
self.logger.info("Checking disk, disk usage = {0}%".format(usep))
return sized, used, usep


def check_and_clean(self):
if self.check_disk_usage("/") >= self.__threshold:
sized, used, usep = self.check_disk_usage("/")
if usep >= self.__threshold:
self.logger.info("Disk usage is above {0}%, Try to remove containers".format(self.__threshold))
self.kill_largest_container()
self.kill_largest_container(sized, used, usep)


# Clean logic v1: kill largest container
white_list = ["k8s_POD", "k8s_kube", "k8s_pylon", "k8s_zookeeper", "k8s_rest-server", "k8s_yarn", "k8s_hadoop", "k8s_job-exporter", "k8s_watchdog", "k8s_grafana", "k8s_node-exporter", "k8s_webportal", "k8s_prometheus", "k8s_nvidia-drivers", "k8s_etcd-container", "k8s_apiserver-container", "k8s_docker-cleaner", "kubelet", "dev-box"]
def kill_largest_container(self):
def kill_largest_container(self, sized, used, usep):
containers = []
# Only try to stop PAI jobs and user created containers
containers_source = subprocess.Popen(["docker", "ps", "-a", "--format", r'{{.ID}}\t{{.Image}}\t{{.Size}}\t{{.Names}}'], stdout=subprocess.PIPE)
containers_source = subprocess.Popen(["docker", "ps", "-a", "--format", r'{{.ID}}\t{{.Image}}\t{{.Size}}\t{{.Names}}\t'], stdout=subprocess.PIPE)
for line in containers_source.stdout:
splitline = line.split("\t")
for prefix in self.white_list:
if (splitline[3].startswith(prefix)):
break
else:
size = common.calculate_size(splitline[2].split()[0])
containers.append([size, splitline[0], splitline[1]])
# Only check job containers
if re.search(r"container(_\w+)?_\d+_\d+_\d+_\d+$", splitline[3]) is not None:
size_str = splitline[2].split()[0]
size = common.calculate_size(size_str)
containers.append([size, splitline[0], splitline[1], splitline[3], size_str])

containers.sort(key=lambda x:x[0], reverse=True)

if containers.count > 0 and containers[0][0] > 1024**3:
self.logger.warning("Kill container {0} due to disk pressure. Container size: {1}".format(containers[0][1], containers[0][0]))
self.logger.warning("Kill container {0} due to disk pressure. Container size: {1}".format(containers[0][3], containers[0][4]))

# Write error log
container_name = re.search(r"container(_\w+)?_\d+_\d+_\d+_\d+$", containers[0][3]).group()
application_name = "application{0}".format(re.search(r"^_\d+_\d+", re.search(r"_\d+_\d+_\d+_\d+$", container_name).group()).group())
full_path = "/logs/{0}/{1}".format(application_name, container_name)

if not os.path.isdir(full_path):
self.logger.error("Cannot find job log dir, creating path. Log may not be collected.")
try:
os.makedirs(full_path)
except OSError as exc:
self.logger.error("Failed to create path {0}.".format(full_path))

if os.path.isdir(full_path):
error_filename = "{0}/diskCleaner.pai.error".format(full_path)
timestamp = int(time.time())
try:
fp = open(error_filename, "w")
except IOError:
self.logger.error("Failed to write error log, skipped")
else:
fp.writelines([
"{0} ERROR ACTION \"KILL\"\n".format(timestamp),
"{0} ERROR REASON \"{1} killed due to disk pressure. Disk size: {2}, Used: {3}, Cleaner threshold: {4}, Container cost: {5} \"\n".format(timestamp, container_name, sized, "{0}({1}%)".format(used, usep), "{0}%".format(self.__threshold), containers[0][4]),
"{0} ERROR SOLUTION \"Node disk is full, please try another time. If your job needs large space, please use NAS to store data.\"\n".format(timestamp)
])
fp.close()

subprocess.Popen(["docker", "kill", "--signal=10", containers[0][1]])

# Because docker stop will not immedicately stop container, we can not remove docker image right after stop container
Expand Down
51 changes: 50 additions & 1 deletion src/hadoop-batch-job/deploy/hadoop-configuration/yarn-site.xml
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,56 @@
<value>true</value>
</property>

<property>
<property>
<description>
Error filename pattern, to identify the file in the container's
Log directory which contain the container's error log. As error file
redirection is done by client/AM and yarn will not be aware of the error
file name. YARN uses this pattern to identify the error file and tail
the error log as diagnostics when the container execution returns non zero
value. Filename patterns are case sensitive and should match the
specifications of FileSystem.globStatus(Path) api. If multiple filenames
matches the pattern, first file matching the pattern will be picked.
</description>
<name>yarn.nodemanager.container.stderr.pattern</name>
<value>runtime.pai.agg.error</value>
</property>
<property>
<description>
Size of the container error file which needs to be tailed, in bytes.
To avoid unexpected diagnostics truncation:
40KB = 10 * default(4096)
</description>
<name>yarn.nodemanager.container.stderr.tail.bytes</name>
<value>40960</value>
</property>
<property>
<description>
Maximum size of contain's diagnostics to keep for relaunching container case.
To avoid unexpected diagnostics truncation:
100KB = 10 * default(10000)
</description>
<name>yarn.nodemanager.container-diagnostics-maximum-size</name>
<value>100000</value>
</property>
<property>
<description>
Defines the limit of the diagnostics message of an application
attempt, in kilo characters (character count * 1024).
When using ZooKeeper to store application state behavior, it's
important to limit the size of the diagnostic messages to
prevent YARN from overwhelming ZooKeeper. In cases where
yarn.resourcemanager.state-store.max-completed-applications is set to
a large number, it may be desirable to reduce the value of this property
to limit the total data stored.
To avoid unexpected diagnostics truncation:
640KB = 10 * default(64)
</description>
<name>yarn.app.attempt.diagnostics.limit.kc</name>
<value>640</value>
</property>

<property>
<name>yarn.resourcemanager.system-metrics-publisher.enabled</name>
<value>true</value>
</property>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,55 @@
<value>true</value>
</property>

<property>
<description>
Error filename pattern, to identify the file in the container's
Log directory which contain the container's error log. As error file
redirection is done by client/AM and yarn will not be aware of the error
file name. YARN uses this pattern to identify the error file and tail
the error log as diagnostics when the container execution returns non zero
value. Filename patterns are case sensitive and should match the
specifications of FileSystem.globStatus(Path) api. If multiple filenames
matches the pattern, first file matching the pattern will be picked.
</description>
<name>yarn.nodemanager.container.stderr.pattern</name>
<value>runtime.pai.agg.error</value>
</property>
<property>
<description>
Size of the container error file which needs to be tailed, in bytes.
To avoid unexpected diagnostics truncation:
40KB = 10 * default(4096)
</description>
<name>yarn.nodemanager.container.stderr.tail.bytes</name>
<value>40960</value>
</property>
<property>
<description>
Maximum size of contain's diagnostics to keep for relaunching container case.
To avoid unexpected diagnostics truncation:
100KB = 10 * default(10000)
</description>
<name>yarn.nodemanager.container-diagnostics-maximum-size</name>
<value>100000</value>
</property>
<property>
<description>
Defines the limit of the diagnostics message of an application
attempt, in kilo characters (character count * 1024).
When using ZooKeeper to store application state behavior, it's
important to limit the size of the diagnostic messages to
prevent YARN from overwhelming ZooKeeper. In cases where
yarn.resourcemanager.state-store.max-completed-applications is set to
a large number, it may be desirable to reduce the value of this property
to limit the total data stored.
To avoid unexpected diagnostics truncation:
640KB = 10 * default(64)
</description>
<name>yarn.app.attempt.diagnostics.limit.kc</name>
<value>640</value>
</property>

<property>
<name>yarn.resourcemanager.system-metrics-publisher.enabled</name>
<value>true</value>
Expand Down
Loading

0 comments on commit b232486

Please sign in to comment.