Add memory limit for all PAI services, make it 'Burstable' Qos class #1384

hao1939 · 2018-09-17T03:28:55Z

Add memory limit for all PAI services. Fixes #506 .

coveralls · 2018-09-17T03:32:58Z

Coverage increased (+8.9%) to 60.403% when pulling c6345b4 on yuan/qos into b5a674a on master.

mzmssg · 2018-09-17T04:41:26Z

src/hadoop-resource-manager/deploy/hadoop-resource-manager.yaml.template

@@ -63,20 +63,22 @@ spec:
          value: {{ clusterinfo[ 'hadoopinfo' ][ 'hadoop_vip' ] }}
        - name: TIMELINE_SERVER_ADDRESS
          value: {{ clusterinfo[ 'hadoopinfo' ][ 'hadoop_vip' ] }}
+        - name: YARN_RESOURCEMANAGER_HEAPSIZE
+          value: "6144"


have better raise the HEAPSIZE to 8G, change k8s resources limits to 10G

Hi @mzmssg ,
Along currently testing, the resource manager it not the bottleneck.
So may we keep it as it is?

mzmssg · 2018-09-17T04:58:01Z

src/zookeeper/deploy/zookeeper.yaml.template

@@ -46,6 +46,13 @@ spec:
            - /jobstatus/jobok
          initialDelaySeconds: 5
          periodSeconds: 3
+        env:
+        - name: JAVA_OPTS
+          value: "-server -Xmx512m"


zookeep container usually consumes 600+M, I suggest raise JVM to 1G and set k8s limits to 1.5G

Good, currently it's an initial value, will adjust it soon.

hao1939 · 2018-09-18T07:44:53Z

src/zookeeper/build/zookeeper.dockerfile

@@ -26,7 +26,7 @@ COPY build/zoo.cfg /etc/zookeeper/conf/
 COPY build/myid /

 # Use sed to modify Zookeeper env variable to also log to the console
-RUN sed -i '/^ZOO_LOG4J_PROP/ s:.*:ZOO_LOG4J_PROP="INFO,CONSOLE":' /usr/share/zookeeper/bin/zkEnv.sh
+RUN sed -i '/^ZOO_LOG4J_PROP/ s:.*:ZOO_LOG4J_PROP="INFO,CONSOLE":' /etc/zookeeper/conf/environment


Hi @mzmssg, Which one should we keep?

To reach that capacity, either some Pod is using more than its request, or the system is using more than 3Gi - 1Gi = 2Gi.

node-exporter hadoop-node-manager hadoop-data-node drivers-one-shot

so it would oom killed first

prometheus grafana

frameworklauncher hadoop-jobhistory hadoop-name-node hadoop-resource-manager pylon rest-server webportal zookeeper

zookeeper: set JAVA_OPTS="-server -Xmx512m" fix env name to JAVA_OPTS fix zookeeper

src/yarn-frameworklauncher/deploy/yarn-frameworklauncher.yaml.template

yqwang-ms · 2018-09-26T06:15:14Z

src/yarn-frameworklauncher/deploy/yarn-frameworklauncher.yaml.template

@@ -62,6 +62,13 @@ spec:
          value: {{ clusterinfo[ 'frameworklauncher' ][ 'frameworklauncher_vip' ] }}
        - name: FRAMEWORKLAUNCHER_PORT
          value: "{{ clusterinfo[ 'frameworklauncher' ][ 'frameworklauncher_port' ] }}"
+        - name: LAUNCHER_OPTS


Comment here 4Gi + -Xmx3072m -Djute.maxbuffer=49107800 can hold how many jobs and tasks

Good point! I will collect some info.

hao1939 · 2018-09-27T07:06:39Z

merge the code first, will update the doc later.

* [Rest-server]Add OS check for ssh-keygen and fix code size bug (#1399) * add OS check for ssh-keygen + fix code size bug * fix code size bug * add callback if generate ssh keyfiles failed * workaround cleanup failed * Pylon: Fix rewrite of WebHDFS redirection (#1328) (#1407) * Zhaoyu/deleted files (#1394) * add list file script * add file checker * add file worker * add commond line and test cases * fix the test case * fix typo and change configure * leverage lsof to get deleted files * change output * change comments * add test case * add more test cases * call the deleted file command directly * fix scrape time env config (#1401) * change name of launcher (#1404) * make node-exporter's readiness probe less sensitive (#1412) * REST Server: Refine error message transfering from upstream. (#1410) * fix gpu num display bugs (#1411) * Change code in framework luancher to support Specific Ports and Random port request co-exist. (#1402) * fixportsIssue * fix minus issue * fix ports * fix minus issue * fix coment * adjust * fix minus issue fix minus issue fix ports fix coment adjust * fix CR comments * refresh service (#1388) * refresh all * remove chmod * extract public part * update hadoop version to fix NodeManger GPU detect issue when ecc is turn off. (#1421) * update hadoop version * update * add push * [Deployment] service stop also check frameworklauncher-ds for backward compatibility (#1409) * check frameworklauncher-ds for backward compatibility * fix frameworklauncher rename backward capability * fix travis markdown version error * [Rest-server] Add job retry history (#1425) * add job retry history link * add job history link * revert stop.sh change * mount the code dir as readonly (#1422) * Webportal: fix version display in PAIShare pages (#1424) * Fix backward compatibility of killAllOnCompletedTaskNumber (#1329) (#1408) * [PAIShare Doc] How-to-config-gitHubPAT.md (#1427) * add Jenkinsfile * Modify Jenkinsfile * minor change * minor change * Add paishare test case in cluster test * how-to-config-gitHubPAT.md * minor change * minor change to Jekinsfile * Refine Images for githubPAT config * Refine Images for githubPAT config2 * minor change * resize image * refine * refine * refine * Minor change to image * Add empty line * Enable launcher ACL (#1150) * Mount job router under user routes * Allow job router read user params * Add namespace to API endpoints * Allow PUT execution type of legacy jobs * Fix HDFS path * Fix Docker container name * Add namespace support to web portal * Add default user name for legacy jobs * enable ACL * Add user namespace to Jenkins CI * Fix backward compability for JobConfig & JobSSHInfo * Support namespace in e2e test * Fix e2e test * fix test case after back-compat * Fix launcher test * Disable tildes in job name * docs * Fix stop job in detail page * Fix job detail * Fix doc link * Lint * support acl in submit V2 * fix unit test * collect network metrics for containers (#1418) * [aks] deploy dev-box as a daemonset choice for user (#1413) * [aks] deploy dev-box as a daemonset * rename dev-box.yaml file to dev-box-k8s-deploy.yaml * remove docker mount / make it to pod /remove redundancy * rename the dev-box name * fix path at deploy doc for new code structure (#1398) add / for pai/deployment Update document after refactor. (#1397) [Rest-server]Add OS check for ssh-keygen and fix code size bug (#1399) * add OS check for ssh-keygen + fix code size bug * fix code size bug * add callback if generate ssh keyfiles failed * [Launcher]: Recover the queue for TASK_COMPLETED tasks (#1432) * Export config at job detail page (#1429) * config export * move label machine step from kubelet start into service deployment (#1403) * fix dependencies check (#1430) * add init option in docker run, help reaping zombie processes (#1435) add init option in docker run * Add memory limit for all PAI services, make it 'Burstable' Qos class (#1384) * set kubernetes memory eviction threshold To reach that capacity, either some Pod is using more than its request, or the system is using more than 3Gi - 1Gi = 2Gi. * set those pods as 'Guaranteed' QoS: node-exporter hadoop-node-manager hadoop-data-node drivers-one-shot * Set '--oom-score-adj=1000' for job container so it would oom killed first * set those pods as 'Burstable' QoS: prometheus grafana * set those pods as 'Guaranteed' QoS: frameworklauncher hadoop-jobhistory hadoop-name-node hadoop-resource-manager pylon rest-server webportal zookeeper * adjust services memory limits * add k8s services resource limit * seem 1g is not enough for launcher * adjust hadoop-resource-manager limit * adjust webportal memory limit * adjust cpu limits * rm yarn-exporter resource limits * adjuest prometheus limits * adjust limits * frameworklauncher: set JAVA_OPTS="-server -Xmx512m" zookeeper: set JAVA_OPTS="-server -Xmx512m" fix env name to JAVA_OPTS fix zookeeper * add heapsize limit for hadoop-data-node hadoop-jobhistory * add xmx for hadoop * modify memory limits * reserve 40g for singlebox, else reserve 12g * using LAUNCHER_OPTS * revert zookeeper dockerfile * adjust node manager memory limit * drivers would take more memory when install * increase memory for zookeeper and launcher * set requests to a lower value * comment it out, using the continer env "YARN_RESOURCEMANAGER_HEAPSIZE" * add comments * fix dependency check (#1442) * PAIShare opt-in (#1436) * Set home page back to dashboard * Add PAIShare optIn * REST server: Allow user to set its own GitHub PAT * Remove opt-in and add PAT notification * Update how-to-config-github-pat.component.js * Refine * lint * Zhaoyu/cleaner build deploy (#1441) * add docker file * fix cleaner dockerfile * add deploy script * add liveness probe * fix liveness probe * track stopped worker * fix docker mount * add probe period * fix rule * add delete refresh * delete template * change per the review comments * change the cool down time to 1800 seconds

mzmssg reviewed Sep 17, 2018

View reviewed changes

hao1939 force-pushed the yuan/qos branch 4 times, most recently from ef5adba to 0dc3609 Compare September 18, 2018 05:58

hao1939 commented Sep 18, 2018

View reviewed changes

hao1939 force-pushed the yuan/qos branch 5 times, most recently from f6d66dc to f2cfff4 Compare September 25, 2018 02:55

hao1939 added 17 commits September 25, 2018 16:41

set kubernetes memory eviction threshold

158e109

To reach that capacity, either some Pod is using more than its request, or the system is using more than 3Gi - 1Gi = 2Gi.

set those pods as 'Guaranteed' QoS:

f938b50

node-exporter hadoop-node-manager hadoop-data-node drivers-one-shot

Set '--oom-score-adj=1000' for job container

1fd34a2

so it would oom killed first

set those pods as 'Burstable' QoS:

4fe6069

prometheus grafana

set those pods as 'Guaranteed' QoS:

828f79a

frameworklauncher hadoop-jobhistory hadoop-name-node hadoop-resource-manager pylon rest-server webportal zookeeper

adjust services memory limits

ced780d

add k8s services resource limit

29c2d72

seem 1g is not enough for launcher

446ed97

adjust hadoop-resource-manager limit

73f8ebc

adjust webportal memory limit

b300d75

adjust cpu limits

fdd3776

rm yarn-exporter resource limits

7774b8e

adjuest prometheus limits

48bd9ec

adjust limits

23e6671

frameworklauncher: set JAVA_OPTS="-server -Xmx512m"

358e13c

zookeeper: set JAVA_OPTS="-server -Xmx512m" fix env name to JAVA_OPTS fix zookeeper

add heapsize limit for hadoop-data-node hadoop-jobhistory

800d5ce

add xmx for hadoop

c377d5c

hao1939 added 9 commits September 25, 2018 16:43

modify memory limits

689355c

reserve 40g for singlebox, else reserve 12g

64ef88b

using LAUNCHER_OPTS

ab0e8dd

revert zookeeper dockerfile

43f0127

adjust node manager memory limit

3a5ab74

drivers would take more memory when install

fb6f37d

increase memory for zookeeper and launcher

ccd9bc6

set requests to a lower value

9d2430c

comment it out, using the continer env "YARN_RESOURCEMANAGER_HEAPSIZE"

ae1fc05

hao1939 force-pushed the yuan/qos branch from 35ff9ca to ae1fc05 Compare September 25, 2018 08:43

hao1939 requested review from YitongFeng, ydye, xudifsd and yqwang-ms September 26, 2018 06:02

xudifsd approved these changes Sep 26, 2018

View reviewed changes

yqwang-ms reviewed Sep 26, 2018

View reviewed changes

src/yarn-frameworklauncher/deploy/yarn-frameworklauncher.yaml.template Show resolved Hide resolved

yqwang-ms reviewed Sep 26, 2018

View reviewed changes

ydye approved these changes Sep 26, 2018

View reviewed changes

add comments

c6345b4

hao1939 merged commit d9cf1d5 into master Sep 27, 2018

hao1939 deleted the yuan/qos branch October 9, 2018 03:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add memory limit for all PAI services, make it 'Burstable' Qos class #1384

Add memory limit for all PAI services, make it 'Burstable' Qos class #1384

hao1939 commented Sep 17, 2018 •

edited

Loading

coveralls commented Sep 17, 2018 •

edited

Loading

mzmssg Sep 17, 2018

hao1939 Sep 26, 2018

mzmssg Sep 17, 2018

hao1939 Sep 18, 2018

hao1939 Sep 18, 2018

yqwang-ms Sep 26, 2018

hao1939 Sep 26, 2018

hao1939 commented Sep 27, 2018

Add memory limit for all PAI services, make it 'Burstable' Qos class #1384

Add memory limit for all PAI services, make it 'Burstable' Qos class #1384

Conversation

hao1939 commented Sep 17, 2018 • edited Loading

coveralls commented Sep 17, 2018 • edited Loading

mzmssg Sep 17, 2018

Choose a reason for hiding this comment

hao1939 Sep 26, 2018

Choose a reason for hiding this comment

mzmssg Sep 17, 2018

Choose a reason for hiding this comment

hao1939 Sep 18, 2018

Choose a reason for hiding this comment

hao1939 Sep 18, 2018

Choose a reason for hiding this comment

yqwang-ms Sep 26, 2018

Choose a reason for hiding this comment

hao1939 Sep 26, 2018

Choose a reason for hiding this comment

hao1939 commented Sep 27, 2018

hao1939 commented Sep 17, 2018 •

edited

Loading

coveralls commented Sep 17, 2018 •

edited

Loading