Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zombie process exception #165

Open
5agado opened this issue May 29, 2024 · 4 comments
Open

Zombie process exception #165

5agado opened this issue May 29, 2024 · 4 comments

Comments

@5agado
Copy link

5agado commented May 29, 2024

Describe the bug
Getting zombie process exception as already reported for the sagemaker-inference-toolkit

To reproduce
Using 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.2.0-gpu-py310-cu118-ubuntu20.04-sagemaker and custom inference script in a batch-transform causes to trigger such error. Even a simple initial time.sleep(60) in the inference.py script can be used to trigger the error.
A custom requirements.txt file also needs to be provided with custom inference script.

Here the full traceback:

Traceback (most recent call last):
  File "/usr/local/bin/dockerd-entrypoint.py", line 23, in <module>
    serving.main()
  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/serving.py", line 38, in main
    _start_torchserve()
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 56, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 257, in call
    return attempt.get(self._wrap_exception)
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 301, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/opt/conda/lib/python3.10/site-packages/six.py", line 719, in reraise
    raise value
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 251, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/serving.py", line 34, in _start_torchserve
    torchserve.start_torchserve(handler_service=HANDLER_SERVICE)
  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 102, in start_torchserve
    ts_process = _retrieve_ts_server_process()
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 56, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 266, in call
    raise attempt.get()
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 301, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/opt/conda/lib/python3.10/site-packages/six.py", line 719, in reraise
    raise value
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 251, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 187, in _retrieve_ts_server_process
    if TS_NAMESPACE in process.cmdline():
  File "/opt/conda/lib/python3.10/site-packages/psutil/__init__.py", line 719, in cmdline
    return self._proc.cmdline()
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1714, in wrapper
    return fun(self, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1853, in cmdline
    self._raise_if_zombie()
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1758, in _raise_if_zombie
    raise ZombieProcess(self.pid, self._name, self._ppid)

System information
A description of your system. Please provide:

  • Sagemaker model image: 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.2.0-gpu-py310-cu118-ubuntu20.04-sagemaker
  • Sagemaker model mode: single-mode
  • Batch-transform instance type: ml.g4dn.2xlarge
  • Batch-transform Invocation timeout in seconds: 600
@rauldiaz
Copy link

rauldiaz commented Jun 4, 2024

Hi! Any luck solving this? I am in the same situation for the pytorch-inference:2.2.0-cpu-py310 image. In serverless mode, I get the same error. I can however deploy it in a real-time endpoint.

@5agado
Copy link
Author

5agado commented Jun 4, 2024

@rauldiaz this was the fix, but it wasn't properly propagated to the instances.

The recent new releases would have solved all the issues, if they updated sagemaker-pytorch-inference, instead it is still stuck to 2.0.23 :/

Can see also the ongoing conversation here.

@Itto1992
Copy link

I am using 763104351884.dkr.ecr.ap-northeast-1.amazonaws.com/pytorch-inference:2.2.0-cpu-py310-ubuntu20.04-sagemaker-v1.12 in serverless mode, I encountered this error. I also tried an older image (763104351884.dkr.ecr.ap-northeast-1.amazonaws.com/pytorch-inference:2.1.0-cpu-py310-ubuntu20.04-sagemaker-v1.8), but I also failed to invoke inference. Anyone solved this problem?

@Itto1992
Copy link

I am using 763104351884.dkr.ecr.ap-northeast-1.amazonaws.com/pytorch-inference:2.2.0-cpu-py310-ubuntu20.04-sagemaker-v1.12 in serverless mode, I encountered this error. I also tried an older image (763104351884.dkr.ecr.ap-northeast-1.amazonaws.com/pytorch-inference:2.1.0-cpu-py310-ubuntu20.04-sagemaker-v1.8), but I also failed to invoke inference. Anyone solved this problem?

Hi there,

I wanted to report that after a night, the issue seemed to resolve itself, and everything was working fine. However, when I updated the endpoint, the same error occurred again. Is this a known issue that tends to happen shortly after deployment?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants