-
Notifications
You must be signed in to change notification settings - Fork 549
add init option in docker run, help reaping zombie processes #1435
Conversation
can you point us to the reference about your findings? |
This is a blog talking about PID 1 issue in docker run, it also mentions that bash have some issues in handling signal. This is |
--entrypoint '/bin/bash' {{{ jobData.image }}} \ | ||
'/pai/bootstrap/docker_bootstrap.sh' | ||
{{{ jobData.image }}} \ | ||
/bin/bash '/pai/bootstrap/docker_bootstrap.sh' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will set the pid of "bash docker_bootstrap.sh" to 1? it seems previously it is already the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not actually:
Without --init
, pid 1 process is bash docker_bootstrap.sh
.
With --init
, pid 1 process is /dev/init
, and bash docker_bootstrap.sh
will be its subprocess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my testing, --init
can fix the suicide failure as well. it is a good solution to such issues.
https://stackoverflow.com/questions/49162358/docker-init-zombies-why-does-it-matter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/krallin/tini#understanding-tini. if the child of process exits, init will exit
* [Rest-server]Add OS check for ssh-keygen and fix code size bug (#1399) * add OS check for ssh-keygen + fix code size bug * fix code size bug * add callback if generate ssh keyfiles failed * workaround cleanup failed * Pylon: Fix rewrite of WebHDFS redirection (#1328) (#1407) * Zhaoyu/deleted files (#1394) * add list file script * add file checker * add file worker * add commond line and test cases * fix the test case * fix typo and change configure * leverage lsof to get deleted files * change output * change comments * add test case * add more test cases * call the deleted file command directly * fix scrape time env config (#1401) * change name of launcher (#1404) * make node-exporter's readiness probe less sensitive (#1412) * REST Server: Refine error message transfering from upstream. (#1410) * fix gpu num display bugs (#1411) * Change code in framework luancher to support Specific Ports and Random port request co-exist. (#1402) * fixportsIssue * fix minus issue * fix ports * fix minus issue * fix coment * adjust * fix minus issue fix minus issue fix ports fix coment adjust * fix CR comments * refresh service (#1388) * refresh all * remove chmod * extract public part * update hadoop version to fix NodeManger GPU detect issue when ecc is turn off. (#1421) * update hadoop version * update * add push * [Deployment] service stop also check frameworklauncher-ds for backward compatibility (#1409) * check frameworklauncher-ds for backward compatibility * fix frameworklauncher rename backward capability * fix travis markdown version error * [Rest-server] Add job retry history (#1425) * add job retry history link * add job history link * revert stop.sh change * mount the code dir as readonly (#1422) * Webportal: fix version display in PAIShare pages (#1424) * Fix backward compatibility of killAllOnCompletedTaskNumber (#1329) (#1408) * [PAIShare Doc] How-to-config-gitHubPAT.md (#1427) * add Jenkinsfile * Modify Jenkinsfile * minor change * minor change * Add paishare test case in cluster test * how-to-config-gitHubPAT.md * minor change * minor change to Jekinsfile * Refine Images for githubPAT config * Refine Images for githubPAT config2 * minor change * resize image * refine * refine * refine * Minor change to image * Add empty line * Enable launcher ACL (#1150) * Mount job router under user routes * Allow job router read user params * Add namespace to API endpoints * Allow PUT execution type of legacy jobs * Fix HDFS path * Fix Docker container name * Add namespace support to web portal * Add default user name for legacy jobs * enable ACL * Add user namespace to Jenkins CI * Fix backward compability for JobConfig & JobSSHInfo * Support namespace in e2e test * Fix e2e test * fix test case after back-compat * Fix launcher test * Disable tildes in job name * docs * Fix stop job in detail page * Fix job detail * Fix doc link * Lint * support acl in submit V2 * fix unit test * collect network metrics for containers (#1418) * [aks] deploy dev-box as a daemonset choice for user (#1413) * [aks] deploy dev-box as a daemonset * rename dev-box.yaml file to dev-box-k8s-deploy.yaml * remove docker mount / make it to pod /remove redundancy * rename the dev-box name * fix path at deploy doc for new code structure (#1398) add / for pai/deployment Update document after refactor. (#1397) [Rest-server]Add OS check for ssh-keygen and fix code size bug (#1399) * add OS check for ssh-keygen + fix code size bug * fix code size bug * add callback if generate ssh keyfiles failed * [Launcher]: Recover the queue for TASK_COMPLETED tasks (#1432) * Export config at job detail page (#1429) * config export * move label machine step from kubelet start into service deployment (#1403) * fix dependencies check (#1430) * add init option in docker run, help reaping zombie processes (#1435) add init option in docker run * Add memory limit for all PAI services, make it 'Burstable' Qos class (#1384) * set kubernetes memory eviction threshold To reach that capacity, either some Pod is using more than its request, or the system is using more than 3Gi - 1Gi = 2Gi. * set those pods as 'Guaranteed' QoS: node-exporter hadoop-node-manager hadoop-data-node drivers-one-shot * Set '--oom-score-adj=1000' for job container so it would oom killed first * set those pods as 'Burstable' QoS: prometheus grafana * set those pods as 'Guaranteed' QoS: frameworklauncher hadoop-jobhistory hadoop-name-node hadoop-resource-manager pylon rest-server webportal zookeeper * adjust services memory limits * add k8s services resource limit * seem 1g is not enough for launcher * adjust hadoop-resource-manager limit * adjust webportal memory limit * adjust cpu limits * rm yarn-exporter resource limits * adjuest prometheus limits * adjust limits * frameworklauncher: set JAVA_OPTS="-server -Xmx512m" zookeeper: set JAVA_OPTS="-server -Xmx512m" fix env name to JAVA_OPTS fix zookeeper * add heapsize limit for hadoop-data-node hadoop-jobhistory * add xmx for hadoop * modify memory limits * reserve 40g for singlebox, else reserve 12g * using LAUNCHER_OPTS * revert zookeeper dockerfile * adjust node manager memory limit * drivers would take more memory when install * increase memory for zookeeper and launcher * set requests to a lower value * comment it out, using the continer env "YARN_RESOURCEMANAGER_HEAPSIZE" * add comments * fix dependency check (#1442) * PAIShare opt-in (#1436) * Set home page back to dashboard * Add PAIShare optIn * REST server: Allow user to set its own GitHub PAT * Remove opt-in and add PAT notification * Update how-to-config-github-pat.component.js * Refine * lint * Zhaoyu/cleaner build deploy (#1441) * add docker file * fix cleaner dockerfile * add deploy script * add liveness probe * fix liveness probe * track stopped worker * fix docker mount * add probe period * fix rule * add delete refresh * delete template * change per the review comments * change the cool down time to 1800 seconds
hanging container, keep alive after

exit_handle
only script process and 2 zombie processes.

/bin/bash
doesn't reap them correctly.--init
will overwriteentrypoint
with/dev/init
, thus the pid 1 process looks like/dev/init -- /bin/bash dockerboostrap.sh
.init
process have a right behavior in reaping zombie processes.