Merge pull request #2676 from Microsoft/errorspec

[Backend]: PAI Job Error Spec, Collection and Handling - Phase II
microsoft · Apr 26, 2019 · b232486 · b232486
2 parents cb7d1a9 + 90410c9
commit b232486
Show file tree

Hide file tree

Showing 80 changed files with 3,863 additions and 956 deletions.
diff --git a/docs/job_tutorial.md b/docs/job_tutorial.md
@@ -31,6 +31,7 @@ This guide introduces the details of job submission.
     - [Environment variables](#environment-variables)
     - [A complete example](#a-complete-example)
   - [Learn more job examples](#learn-more-job-examples)
+  - [Job exit spec](#job-exit-spec)
 
 ## Quick start: submit a hello-world job
 
@@ -240,3 +241,7 @@ A distributed TensorFlow job is listed below as an example:
 ## Learn more job examples
 
 For more examples, please refer to [job examples directory](../examples).
+
+## Job exit spec
+
+For the **specification of each job exitcode**, please refer to [PAI Job Exit Spec User Manual](../src/job-exit-spec/config/user-manual.md).
diff --git a/docs/rest-server/API.md b/docs/rest-server/API.md
@@ -456,8 +456,21 @@ Status: 200
     appLaunchedTime: "applicationLaunchedTimestamp",
     appCompletedTime: "applicationCompletedTimestamp",
     appExitCode: applicationExitCode,
-    appExitDiagnostics: "applicationExitDiagnostics"
-    appExitType: "applicationExitType"
+    // please check https://github.com/Microsoft/pai/blob/master/src/job-exit-spec/config/job-exit-spec.md for more information
+    appExitSpec: exitSpecObject,
+    appExitTriggerMessage: "applicationExitTriggerMessage",
+    appExitTriggerTaskRoleName: "applicationExitTriggerTaskRoleName",
+    appExitTriggerTaskIndex: "applicationExitTriggerTaskIndex",
+    appExitDiagnostics: "applicationExitDiagnostics",
+    // exit messages extracted from exitDiagnostics
+    appExitMessages: {
+      contaier: "containerStderr",
+      // please check https://github.com/Microsoft/pai/blob/master/docs/rest-server/runtime-exit-spec.md for more information
+      runtime: runtimeScriptErrorObject,
+      launcher: "launcherExitMessage"
+    },
+    // appExitType is deprecated, please use appExitSpec instead.
+    appExitType: "applicationExitType",
   },
   taskRoles: {
     // Name-details map

diff --git a/docs/rest-server/runtime-exit-spec.md b/docs/rest-server/runtime-exit-spec.md
@@ -0,0 +1,71 @@
+# PAI Runtime Exit Spec
+1. It is **PAI Dynamic Exit Info**, see **PAI Static Exit Info** in [PAI Job Exit Spec User Manual](../../src/job-exit-spec/config/user-manual.md)
+
+## PAI Runtime as Error Funnel
+PAI Runtime is the **only interface** to interact with PAI Container.
+
+So, it is also **the one and the only one to produce** the **right** and **final** PAI Container ExitCode and Dynamic ExitInfo.
+
+(*Upper layer centralized service generally should not do the error detection based on Runtime's return, to produce another ExitCode, anymore*)
+
+For example,
+
+if other components need to kill Container, it need to **notify Runtime to kill** the Container along with the **kill reason, etc, instead if kill the Container by itself.**
+
+Similarly, Runtime need to **take the responsibility to execute the kill action**, and then **forward the kill reason (see below)** to YARN by Dynamic ExitInfo, **instead of forwarding all local error logs as reason (see below)** to YARN.
+
+However, it will also **try best to forward all local error logs as attachment**, and given **limited total size for Dynamic ExitInfo** (such as 4KB), it is **the only dictator to decide which error log file forward more, which forward less, forward which part of error log, tail or middle, etc**.
+
+(*Such as during Runtime killing Container, the container itself may produce exiting error messaging which is generally not very helpful to explain error reason, so it is a good candidate to sacrifice for other error log*)
+
+(*This cannot be achieved by change YARN code which only can collect all error log tail evenly, because YARN is not aware which error, so cannot decide which error log should forward more.*)
+
+<p style="text-align: left;">
+  <img src="runtime-exit-spec.png" title="Architecture" alt="Architecture" />
+</p>
+
+## Design
+1. Write in User Understandable way
+2. Write into `runtime.pai.agg.error` file which only contains the YAML object of Dynamic ExitInfo
+3. Limit the size of `runtime.pai.agg.error` file
+4. Write landmark lines around the schema object string, to help RestServer locate it
+
+    ```
+    [PAI_RUNTIME_ERROR_START]
+    YAML String of PAI Runtime Exit Info
+    [PAI_RUNTIME_ERROR_END]
+    ```
+
+
+## PAI Runtime Exit Info Schema
+```
+exitcode:
+reason:
+solution:
+trigger:
+originalUserExitCode:
+caughtSignal:
+caughtException:
+matchedUserLogString:
+dockerStatus:
+errorLogs:
+  user:
+  runtime:
+  diskCleaner:
+```
+
+|field|description|example|
+|-----|-----------|--------|
+| **exitcode** | PAI ExitCode | 198, i.e. [CONTAINER_OOD_KILLED_BY_DISKCLEANER](../../src/job-exit-spec/config/job-exit-spec.md) |
+| **reason** | Dynamic Reason | Current Container is killed due to it used most disk space and all container disk usage exceeded platform limit.<br>Current Container Disk Usage 30% (30GB), All Container Disk Usage 91% (91GB), Platform Disk Limit 90% (90GB). |
+| **solution** | Dynamic Solution | All Container overused 1GB disk, but current container only use 30% disk, please retry on another node later. |
+| **trigger** | Which error pattern triggered the error:<br>Should be one of below field name | diskCleanerErrorLog |
+| **originalUserExitCode** | Original user process exitcode<br>Only for showing to user. Preserve it here instead of exit with it, in case it conflicts with Runtime's own ExitCode Spec. | 1 |
+| **caughtSignal** | Caught Signal | SIGKILL |
+| **caughtException** | Caught Exception | TensorFlow StackTrace:... |
+| **matchedUserLogString** | Matched user log string | RunScripts.sh: line 58: [: -gt: unary operator expected |
+| **dockerStatus** | Docker Status | KILLED |
+| **errorLogs** | Runtime collect all local *.pai.error and error files, and redirect it to here. (This is best effort, because we have limit runtime.pai.agg.error file size) | |
+| **errorLogs.user** | From user stderr | 	234346457 ERROR "get: `/dgfdg': No such file or directory"<br>234346458 ERROR "Tensorflow failed to init, exiting…" |
+| **errorLogs.runtime** | From runtime.pai.error | 234346457 ERROR REASON "get: `/{user code dir}': No such file or directory" |
+| **errorLogs.diskCleaner** | From diskCleaner.pai.error |234346457 ERROR ACTION "KILL"<br>234346457 ERROR REASON "Current Container is killed due to it used most disk space and all container disk usage exceeded platform limit. Current Container Disk Usage 30% (30GB), All Container Disk Usage 91% (91GB), Platform Disk Limit 90% (90GB)."<br>234346457 ERROR SOLUTION "All Container overused 1GB disk, but current container only use 30% disk, please retry on another node later."|
diff --git a/docs/rest-server/runtime-exit-spec.png b/docs/rest-server/runtime-exit-spec.png
diff --git a/src/cleaner/deploy/cleaner.yaml.template b/src/cleaner/deploy/cleaner.yaml.template
@@ -42,6 +42,8 @@ spec:
         volumeMounts:
         - mountPath: /var/run/docker.sock
           name: docker-socket
+        - mountPath: /logs
+          name: cleaner-logs
         {%- if cluster_cfg['cluster']['common']['qos-switch'] == "true" %}
         resources:
           limits:
@@ -53,6 +55,9 @@ spec:
       - name: docker-socket
         hostPath:
           path: /var/run/docker.sock
+      - name: cleaner-logs
+        hostPath:
+          path: {{ cluster_cfg["cluster"]["common"]["data-path"] }}/yarn/node/userlogs
       tolerations:
       - key: node.kubernetes.io/memory-pressure
         operator: "Exists"

diff --git a/src/cleaner/scripts/clean_docker.py b/src/cleaner/scripts/clean_docker.py
@@ -23,6 +23,7 @@
 import multiprocessing
 import re
 import time
+import os
 
 class DockerCleaner(LoggerMixin):
     def __init__(self, threshold, interval, timeout=timedelta(hours=1)):
@@ -54,44 +55,80 @@ def run(self):
 
     def check_disk_usage(self, partition):
         df = subprocess.Popen(["df","-h", partition], stdout=subprocess.PIPE)
-        size = 0
+        sized = 0
         try:
             for line in df.stdout:
                 splitline = line.decode().split()
                 if splitline[5] == partition:
-                    size = int(splitline[4][:-1])
+                    sized = splitline[1]
+                    used = splitline[2]
+                    usep = int(splitline[4][:-1])
         except ValueError:
             self.logger.error("cannot get disk size, reset size to 0")
-            size = 0
-        self.logger.info("Checking disk, disk usage = {0}%".format(size))
-        return size
+            sized = 0
+            used = 0
+            usep = 0
+        self.logger.info("Checking disk, disk usage = {0}%".format(usep))
+        return sized, used, usep
 
 
     def check_and_clean(self):
-        if self.check_disk_usage("/") >= self.__threshold:
+        sized, used, usep = self.check_disk_usage("/") 
+        if usep >= self.__threshold:
             self.logger.info("Disk usage is above {0}%, Try to remove containers".format(self.__threshold))
-            self.kill_largest_container()
+            self.kill_largest_container(sized, used, usep)
 
 
     # Clean logic v1: kill largest container
     white_list = ["k8s_POD", "k8s_kube", "k8s_pylon", "k8s_zookeeper", "k8s_rest-server", "k8s_yarn", "k8s_hadoop", "k8s_job-exporter", "k8s_watchdog", "k8s_grafana", "k8s_node-exporter", "k8s_webportal", "k8s_prometheus", "k8s_nvidia-drivers", "k8s_etcd-container", "k8s_apiserver-container", "k8s_docker-cleaner", "kubelet", "dev-box"]
-    def kill_largest_container(self):
+    def kill_largest_container(self, sized, used, usep):
         containers = []
         # Only try to stop PAI jobs and user created containers
-        containers_source = subprocess.Popen(["docker", "ps", "-a", "--format", r'{{.ID}}\t{{.Image}}\t{{.Size}}\t{{.Names}}'], stdout=subprocess.PIPE)
+        containers_source = subprocess.Popen(["docker", "ps", "-a", "--format", r'{{.ID}}\t{{.Image}}\t{{.Size}}\t{{.Names}}\t'], stdout=subprocess.PIPE)
         for line in containers_source.stdout:
             splitline = line.split("\t")
             for prefix in self.white_list:
                 if (splitline[3].startswith(prefix)):
                     break
             else:
-                size = common.calculate_size(splitline[2].split()[0])
-                containers.append([size, splitline[0], splitline[1]])
+                # Only check job containers
+                if re.search(r"container(_\w+)?_\d+_\d+_\d+_\d+$", splitline[3]) is not None:
+                    size_str = splitline[2].split()[0]
+                    size = common.calculate_size(size_str)
+                    containers.append([size, splitline[0], splitline[1], splitline[3], size_str])
 
         containers.sort(key=lambda x:x[0], reverse=True)
 
         if containers.count > 0 and containers[0][0] > 1024**3:
-            self.logger.warning("Kill container {0} due to disk pressure. Container size: {1}".format(containers[0][1], containers[0][0]))
+            self.logger.warning("Kill container {0} due to disk pressure. Container size: {1}".format(containers[0][3], containers[0][4]))
+
+            # Write error log
+            container_name = re.search(r"container(_\w+)?_\d+_\d+_\d+_\d+$", containers[0][3]).group()
+            application_name = "application{0}".format(re.search(r"^_\d+_\d+", re.search(r"_\d+_\d+_\d+_\d+$", container_name).group()).group())            
+            full_path = "/logs/{0}/{1}".format(application_name, container_name)
+
+            if not os.path.isdir(full_path):
+                self.logger.error("Cannot find job log dir, creating path. Log may not be collected.")
+                try:
+                    os.makedirs(full_path)
+                except OSError as exc:
+                    self.logger.error("Failed to create path {0}.".format(full_path))
+
+            if os.path.isdir(full_path):
+                error_filename = "{0}/diskCleaner.pai.error".format(full_path)
+                timestamp = int(time.time())
+                try:
+                    fp = open(error_filename, "w")
+                except IOError:
+                    self.logger.error("Failed to write error log, skipped")
+                else:
+                    fp.writelines([
+                        "{0} ERROR ACTION \"KILL\"\n".format(timestamp),
+                        "{0} ERROR REASON \"{1} killed due to disk pressure. Disk size: {2}, Used: {3}, Cleaner threshold: {4}, Container cost: {5} \"\n".format(timestamp, container_name, sized, "{0}({1}%)".format(used, usep), "{0}%".format(self.__threshold), containers[0][4]),
+                        "{0} ERROR SOLUTION \"Node disk is full, please try another time. If your job needs large space, please use NAS to store data.\"\n".format(timestamp)
+                        ])
+                    fp.close()
+
             subprocess.Popen(["docker", "kill", "--signal=10", containers[0][1]])
 
             # Because docker stop will not immedicately stop container, we can not remove docker image right after stop container

diff --git a/src/hadoop-batch-job/deploy/hadoop-configuration/yarn-site.xml b/src/hadoop-batch-job/deploy/hadoop-configuration/yarn-site.xml
@@ -151,7 +151,56 @@
        <value>true</value>
     </property>
 
-    <property>
+  <property>
+    <description>
+      Error filename pattern, to identify the file in the container's
+      Log directory which contain the container's error log. As error file
+      redirection is done by client/AM and yarn will not be aware of the error
+      file name. YARN uses this pattern to identify the error file and tail
+      the error log as diagnostics when the container execution returns non zero
+      value. Filename patterns are case sensitive and should match the
+      specifications of FileSystem.globStatus(Path) api. If multiple filenames
+      matches the pattern, first file matching the pattern will be picked.
+    </description>
+    <name>yarn.nodemanager.container.stderr.pattern</name>
+    <value>runtime.pai.agg.error</value>
+  </property>
+  <property>
+    <description>
+      Size of the container error file which needs to be tailed, in bytes.
+      To avoid unexpected diagnostics truncation:
+      40KB = 10 * default(4096)
+    </description>
+    <name>yarn.nodemanager.container.stderr.tail.bytes</name>
+    <value>40960</value>
+  </property>
+  <property>
+    <description>
+      Maximum size of contain's diagnostics to keep for relaunching container case.
+      To avoid unexpected diagnostics truncation:
+      100KB = 10 * default(10000)
+    </description>
+    <name>yarn.nodemanager.container-diagnostics-maximum-size</name>
+    <value>100000</value>
+  </property>
+  <property>
+    <description>
+      Defines the limit of the diagnostics message of an application
+      attempt, in kilo characters (character count * 1024).
+      When using ZooKeeper to store application state behavior, it's
+      important to limit the size of the diagnostic messages to
+      prevent YARN from overwhelming ZooKeeper. In cases where
+      yarn.resourcemanager.state-store.max-completed-applications is set to
+      a large number, it may be desirable to reduce the value of this property
+      to limit the total data stored.
+      To avoid unexpected diagnostics truncation:
+      640KB = 10 * default(64)
+    </description>
+    <name>yarn.app.attempt.diagnostics.limit.kc</name>
+    <value>640</value>
+  </property>
+
+  <property>
        <name>yarn.resourcemanager.system-metrics-publisher.enabled</name>
        <value>true</value>
     </property>

diff --git a/src/hadoop-jobhistory/deploy/hadoop-jobhistory-configuration/yarn-site.xml.template b/src/hadoop-jobhistory/deploy/hadoop-jobhistory-configuration/yarn-site.xml.template
@@ -128,6 +128,55 @@
        <value>true</value>
     </property>
 
+    <property>
+      <description>
+        Error filename pattern, to identify the file in the container's
+        Log directory which contain the container's error log. As error file
+        redirection is done by client/AM and yarn will not be aware of the error
+        file name. YARN uses this pattern to identify the error file and tail
+        the error log as diagnostics when the container execution returns non zero
+        value. Filename patterns are case sensitive and should match the
+        specifications of FileSystem.globStatus(Path) api. If multiple filenames
+        matches the pattern, first file matching the pattern will be picked.
+      </description>
+      <name>yarn.nodemanager.container.stderr.pattern</name>
+      <value>runtime.pai.agg.error</value>
+    </property>
+    <property>
+      <description>
+        Size of the container error file which needs to be tailed, in bytes.
+        To avoid unexpected diagnostics truncation:
+        40KB = 10 * default(4096)
+      </description>
+      <name>yarn.nodemanager.container.stderr.tail.bytes</name>
+      <value>40960</value>
+    </property>
+    <property>
+      <description>
+        Maximum size of contain's diagnostics to keep for relaunching container case.
+        To avoid unexpected diagnostics truncation:
+        100KB = 10 * default(10000)
+      </description>
+      <name>yarn.nodemanager.container-diagnostics-maximum-size</name>
+      <value>100000</value>
+    </property>
+    <property>
+      <description>
+        Defines the limit of the diagnostics message of an application
+        attempt, in kilo characters (character count * 1024).
+        When using ZooKeeper to store application state behavior, it's
+        important to limit the size of the diagnostic messages to
+        prevent YARN from overwhelming ZooKeeper. In cases where
+        yarn.resourcemanager.state-store.max-completed-applications is set to
+        a large number, it may be desirable to reduce the value of this property
+        to limit the total data stored.
+        To avoid unexpected diagnostics truncation:
+        640KB = 10 * default(64)
+      </description>
+      <name>yarn.app.attempt.diagnostics.limit.kc</name>
+      <value>640</value>
+    </property>
+
     <property>
        <name>yarn.resourcemanager.system-metrics-publisher.enabled</name>
        <value>true</value>