Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running prefect local agent in a docker container leads to zombie apocalypse ;-) #2418

Closed
mcg1969 opened this issue Apr 26, 2020 · 10 comments · Fixed by #2925
Closed

Running prefect local agent in a docker container leads to zombie apocalypse ;-) #2418

mcg1969 opened this issue Apr 26, 2020 · 10 comments · Fixed by #2925
Assignees
Labels

Comments

@mcg1969
Copy link

mcg1969 commented Apr 26, 2020

Description

I'm running prefect agent inside of a Docker container with local execution. Each run of the process leaves a zombie process, a phenomenon which if left unchecked eventually causes deleterious effects. I noticed this because I was at one point unable to ssh into the node on which the container was running.

Expected Behavior

Somehow the completed processes should be harvested to remove the zombies.

Reproduction

My shell script does

exec prefect agent start -t $prefect_runner_token

(note: removing the exec doesn't help). Here's a simple script to create a flow that runs on a schedule:

import prefect
from prefect import Flow, task
from prefect.schedules import IntervalSchedule
from datetime import timedelta, datetime

import time

schedule = IntervalSchedule(
    start_date=datetime.utcnow() + timedelta(seconds=1),
    interval=timedelta(minutes=2),
)

@task
def run():
    logger = prefect.context.get("logger")
    results = []
    for x in range(3):
        results.append(str(x + 1))
        logger.info("Hello! run {}".format(x + 1))
        time.sleep(3)
    return results

with Flow("Hello", schedule=schedule) as flow:
    results = run()

flow.register(project_name="Hello")

Environment

The container is built on CentOS 7.3. It does not have an init process.

{
  "config_overrides": {},
  "env_vars": [],
  "system_information": {
    "platform": "Linux-3.10.0-957.21.3.el7.x86_64-x86_64-with-glibc2.10",
    "prefect_version": "0.10.4",
    "python_version": "3.8.2"
  }
}
@mcg1969
Copy link
Author

mcg1969 commented Apr 26, 2020

What I am finding is that each run produces three subprocesses. The process with the smallest PID takes the longest to run as seems to be reaped eventually. The other two processes seem to exit more quickly but are never reaped. Thus each flow run adds net two zombies.

@joshmeek
Copy link

Congratulations @mcg1969 I think this means that you are patient zero! I will look into this behavior. What are you using as the base image for your container?

@mcg1969
Copy link
Author

mcg1969 commented Apr 27, 2020

I'm afraid I can't share the exact container, though I don't mind that you know it's the one that we use inside of Anaconda Enterprise, and @jcrist might have some familiarity with that. That said, it's based on a CentOS 7.3 base image, with Miniconda installed within. I'm happy to share the precise conda environment I was using too if that helps.

@joshmeek
Copy link

No worries! Was only wondering if it had some possible weird dependencies but this is enough information to go off of 😄

@mcg1969
Copy link
Author

mcg1969 commented Apr 27, 2020

Here's the conda environment, re-creatable with

conda create -n testprefect -c defaults -c conda-forge --file ...

The file:

# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: linux-64
_libgcc_mutex=0.1=main
appdirs=1.4.3=pyh91ea838_0
asn1crypto=1.3.0=py38_0
ca-certificates=2020.1.1=0
certifi=2020.4.5.1=py38_0
cffi=1.14.0=py38h2e261b9_0
chardet=3.0.4=py38_1003
click=7.1.1=py_0
cloudpickle=1.2.2=py_0
croniter=0.3.30=py_0
cryptography=2.8=py38h1ba5d50_0
cytoolz=0.10.1=py38h7b6447c_0
dask-core=2.14.0=py_0
distributed=2.14.0=py38_0
docker-py=4.2.0=py38_0
docker-pycreds=0.4.0=py_0
heapdict=1.0.1=py_0
idna=2.9=py_1
ld_impl_linux-64=2.33.1=h53a641e_7
libedit=3.1.20181209=hc058e9b_0
libffi=3.2.1=hd88cf55_4
libgcc-ng=9.1.0=hdf63c60_0
libstdcxx-ng=9.1.0=hdf63c60_0
marshmallow=3.5.1=py_0
marshmallow-oneofschema=2.0.1=py_0
msgpack-python=1.0.0=py38hfd86e86_1
mypy_extensions=0.4.3=py38_0
ncurses=6.2=he6710b0_0
openssl=1.1.1g=h7b6447c_0
packaging=20.3=py_0
pendulum=2.1.0=py38_1
pip=20.0.2=py38_1
prefect=0.10.4=py_0
psutil=5.7.0=py38h7b6447c_0
pycparser=2.20=py_0
pyopenssl=19.1.0=py38_0
pyparsing=2.4.6=py_0
pysocks=1.7.1=py38_0
python=3.8.2=hcf32534_0
python-box=4.2.2=py_0
python-dateutil=2.8.1=py_0
python-slugify=3.0.4=py_0
pytz=2019.3=py_0
pytzdata=2019.3=py_0
pyyaml=5.3.1=py38h7b6447c_0
readline=8.0=h7b6447c_0
requests=2.23.0=py38_0
ruamel.yaml=0.16.10=py38h7b6447c_1
ruamel.yaml.clib=0.2.0=py38h7b6447c_0
setuptools=46.1.3=py38_0
six=1.14.0=py38_0
sortedcontainers=2.1.0=py38_0
sqlite=3.31.1=h62c20be_1
tabulate=0.8.3=py38_0
tblib=1.6.0=py_0
text-unidecode=1.3=py_0
tk=8.6.8=hbc83047_0
toml=0.10.0=pyh91ea838_0
toolz=0.10.0=py_0
tornado=6.0.4=py38h7b6447c_1
unidecode=1.1.1=py_0
urllib3=1.25.8=py38_0
websocket-client=0.57.0=py38_1
wheel=0.34.2=py38_0
xz=5.2.5=h7b6447c_0
yaml=0.1.7=had09818_2
zict=2.0.0=py_0
zlib=1.2.11=h7b6447c_3

@mcg1969
Copy link
Author

mcg1969 commented Apr 27, 2020

I wouldn't say there's anything about the container I would expect to cause problems. Anything is possible, of course. But the container doesn't have an init process.

@lauralorenz lauralorenz added the bug Something isn't working label Apr 29, 2020
@mcg1969
Copy link
Author

mcg1969 commented May 11, 2020

I have been able to verify that adding an init process like tini (https://github.com/krallin/tini) to the container, and running everything under that, reaps the zombies properly.

@jcrist
Copy link

jcrist commented May 12, 2020

Glad to hear it! Currently it looks like we're implicitly relying on the init process to prune orphaned processes (which IMO is fine, if not ideal). We could possibly fix this in the future, but for now I think I'm fine saying that we require an init process when using the local agent. Leaving it open though. Thanks for the report @mcg1969!

@mcg1969
Copy link
Author

mcg1969 commented May 12, 2020

I think that's reasonable—a doc fix would be great to consider!

@joshmeek joshmeek added docs and removed bug Something isn't working labels May 19, 2020
@lauralorenz
Copy link

Just adding here from IRL convo: we think the docs note should be on the page describing the local agent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants