Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

add init option in docker run, help reaping zombie processes #1435

Merged
merged 3 commits into from
Sep 27, 2018

Conversation

mzmssg
Copy link
Member

@mzmssg mzmssg commented Sep 26, 2018

hanging container, keep alive after exit_handle
image

only script process and 2 zombie processes. /bin/bash doesn't reap them correctly.
image

--init will overwrite entrypoint with /dev/init, thus the pid 1 process looks like /dev/init -- /bin/bash dockerboostrap.sh. init process have a right behavior in reaping zombie processes.

@mzmssg mzmssg changed the title Zimiao/docker zombie add init option in docker run Sep 26, 2018
@mzmssg mzmssg changed the title add init option in docker run add init option in docker run, help reaping zombie processes Sep 26, 2018
@coveralls
Copy link

Coverage Status

Coverage increased (+3.3%) to 63.536% when pulling 11e9808 on zimiao/docker_zombie into bda1639 on master.

@coveralls
Copy link

coveralls commented Sep 26, 2018

Coverage Status

Coverage increased (+9.8%) to 70.015% when pulling fbf59ab on zimiao/docker_zombie into bda1639 on master.

@fanyangCS
Copy link
Contributor

can you point us to the reference about your findings?

@mzmssg
Copy link
Member Author

mzmssg commented Sep 26, 2018

@fanyangCS

This is a blog talking about PID 1 issue in docker run, it also mentions that bash have some issues in handling signal.
https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/

This is --init behavior in docker run
https://stackoverflow.com/questions/43122080/how-to-use-init-parameter-in-docker-run

--entrypoint '/bin/bash' {{{ jobData.image }}} \
'/pai/bootstrap/docker_bootstrap.sh'
{{{ jobData.image }}} \
/bin/bash '/pai/bootstrap/docker_bootstrap.sh'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will set the pid of "bash docker_bootstrap.sh" to 1? it seems previously it is already the case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not actually:
Without --init, pid 1 process is bash docker_bootstrap.sh.
With --init, pid 1 process is /dev/init, and bash docker_bootstrap.sh will be its subprocess.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@mzmssg mzmssg Sep 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my testing, --init can fix the suicide failure as well. it is a good solution to such issues.

https://stackoverflow.com/questions/49162358/docker-init-zombies-why-does-it-matter

krallin/tini#8 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/krallin/tini#understanding-tini. if the child of process exits, init will exit

@mzmssg mzmssg requested a review from DongZhaoYu September 26, 2018 11:39
@mzmssg mzmssg merged commit e10b0b4 into master Sep 27, 2018
qyyy added a commit that referenced this pull request Sep 28, 2018
* [Rest-server]Add OS check for ssh-keygen and fix code size bug (#1399)

* add OS check for ssh-keygen + fix code size bug

* fix code size bug

* add callback if generate ssh keyfiles failed

* workaround cleanup failed

* Pylon: Fix rewrite of WebHDFS redirection (#1328) (#1407)

* Zhaoyu/deleted files (#1394)

* add list file script

* add file checker

* add file worker

* add commond line and test cases

* fix the test case

* fix typo and change configure

* leverage lsof to get deleted files

* change output

* change comments

* add test case

* add more test cases

* call the deleted file command directly

* fix scrape time env config (#1401)

* change name of launcher (#1404)

* make node-exporter's readiness probe less sensitive (#1412)

* REST Server: Refine error message transfering from upstream. (#1410)

* fix gpu num display bugs (#1411)

* Change code in framework luancher to support Specific Ports and Random port request co-exist.  (#1402)

* fixportsIssue

* fix minus issue

* fix ports

* fix minus issue

* fix coment

* adjust

* fix minus issue

fix minus issue

fix ports

fix coment

adjust

* fix CR comments

* refresh service (#1388)

* refresh all

* remove chmod

* extract public part

* update hadoop version to fix NodeManger GPU detect issue when ecc is turn off. (#1421)

* update hadoop version

* update

* add push

* [Deployment] service stop also check frameworklauncher-ds for backward compatibility (#1409)

* check frameworklauncher-ds for backward compatibility

* fix frameworklauncher rename backward capability

* fix travis markdown version error

* [Rest-server] Add job retry history (#1425)

* add job retry history link

* add job history link

* revert stop.sh change

* mount the code dir as readonly (#1422)

* Webportal: fix version display in PAIShare pages (#1424)

* Fix backward compatibility of killAllOnCompletedTaskNumber (#1329) (#1408)

* [PAIShare Doc] How-to-config-gitHubPAT.md (#1427)

* add Jenkinsfile

* Modify Jenkinsfile

* minor change

* minor change

* Add paishare test case in cluster test

* how-to-config-gitHubPAT.md

* minor change

* minor change to Jekinsfile

* Refine Images for githubPAT config

* Refine Images for githubPAT config2

* minor change

* resize image

* refine

* refine

* refine

* Minor change to image

* Add empty line

* Enable launcher ACL (#1150)

* Mount job router under user routes

* Allow job router read user params

* Add namespace to API endpoints

* Allow PUT execution type of legacy jobs

* Fix HDFS path

* Fix Docker container name

* Add namespace support to web portal

* Add default user name for legacy jobs

* enable ACL

* Add user namespace to Jenkins CI

* Fix backward compability for JobConfig & JobSSHInfo

* Support namespace in e2e test

* Fix e2e test

* fix test case after back-compat

* Fix launcher test

* Disable tildes in job name

* docs

* Fix stop job in detail page

* Fix job detail

* Fix doc link

* Lint

* support acl in submit V2

* fix unit test

* collect network metrics for containers (#1418)

* [aks] deploy dev-box as a daemonset choice for user (#1413)

* [aks] deploy dev-box as a daemonset

* rename dev-box.yaml file to dev-box-k8s-deploy.yaml

* remove docker mount / make it to pod /remove redundancy

* rename the dev-box name

* fix path at deploy doc for new code structure (#1398)

add / for pai/deployment

Update document after refactor. (#1397)

[Rest-server]Add OS check for ssh-keygen and fix code size bug (#1399)

* add OS check for ssh-keygen + fix code size bug

* fix code size bug

* add callback if generate ssh keyfiles failed

* [Launcher]: Recover the queue for TASK_COMPLETED tasks (#1432)

* Export config at job detail page (#1429)

* config export

* move label machine step from kubelet start into service deployment (#1403)

* fix dependencies check (#1430)

* add init option in docker run, help reaping zombie processes (#1435)

 add init option in docker run

* Add memory limit for all PAI services, make it 'Burstable' Qos class (#1384)

* set kubernetes memory eviction threshold

To reach that capacity, either some Pod is using more than its request,
or the system is using more than 3Gi - 1Gi = 2Gi.

* set those pods as 'Guaranteed' QoS:

node-exporter
hadoop-node-manager
hadoop-data-node
drivers-one-shot

* Set '--oom-score-adj=1000' for job container

so it would oom killed first

* set those pods as 'Burstable' QoS:

prometheus
grafana

* set those pods as 'Guaranteed' QoS:

frameworklauncher
hadoop-jobhistory
hadoop-name-node
hadoop-resource-manager
pylon
rest-server
webportal
zookeeper

* adjust services memory limits

* add k8s services resource limit

* seem 1g is not enough for launcher

* adjust hadoop-resource-manager limit

* adjust webportal memory limit

* adjust cpu limits

* rm yarn-exporter resource limits

* adjuest prometheus limits

* adjust limits

* frameworklauncher: set JAVA_OPTS="-server -Xmx512m"

zookeeper: set JAVA_OPTS="-server -Xmx512m"

fix env name to JAVA_OPTS

fix zookeeper

* add heapsize limit for hadoop-data-node hadoop-jobhistory

* add xmx for hadoop

* modify memory limits

* reserve 40g for singlebox, else reserve 12g

* using LAUNCHER_OPTS

* revert zookeeper dockerfile

* adjust node manager memory limit

* drivers would take more memory when install

* increase memory for zookeeper and launcher

* set requests to a lower value

* comment it out, using the continer env "YARN_RESOURCEMANAGER_HEAPSIZE"

* add comments

* fix dependency check (#1442)

* PAIShare opt-in (#1436)

* Set home page back to dashboard

* Add PAIShare optIn

* REST server: Allow user to set its own GitHub PAT

* Remove opt-in and add PAT notification

* Update how-to-config-github-pat.component.js

* Refine

* lint

* Zhaoyu/cleaner build deploy (#1441)

* add docker file

* fix cleaner dockerfile

* add deploy script

* add liveness probe

* fix liveness probe

* track stopped worker

* fix docker mount

* add probe period

* fix rule

* add delete refresh

* delete template

* change per the review comments

* change the cool down time to 1800 seconds
@xudifsd xudifsd deleted the zimiao/docker_zombie branch May 24, 2019 02:13
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants