Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KubernetesJobTask "No pod scheduled" error #2570

Closed
StasDeep opened this issue Nov 5, 2018 · 6 comments · Fixed by #2813
Closed

KubernetesJobTask "No pod scheduled" error #2570

StasDeep opened this issue Nov 5, 2018 · 6 comments · Fixed by #2813
Labels

Comments

@StasDeep
Copy link
Contributor

StasDeep commented Nov 5, 2018

I often see the following error when using KubernetesJobTask:

[28:luigi-interface:224@18:33] [ERROR] [pid 28] Worker Worker(salt=476402868, workers=20, host=something-host-1541289600-zntd7, username=root, pid=1) failed    Something(local_execution=Falser)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/luigi/worker.py", line 205, in run
    new_deps = self._run_get_new_deps()
  File "/opt/conda/lib/python3.6/site-packages/luigi/worker.py", line 142, in _run_get_new_deps
    task_gen = self.task.run()
  File "/usr/src/app/phpipeline/luigi_util.py", line 231, in run
    super(PHKubernetesJobTask, self).run()
  File "/opt/conda/lib/python3.6/site-packages/luigi/contrib/kubernetes.py", line 355, in run
    self.__track_job()
  File "/opt/conda/lib/python3.6/site-packages/luigi/contrib/kubernetes.py", line 199, in __track_job
    while not self.__verify_job_has_started():
  File "/opt/conda/lib/python3.6/site-packages/luigi/contrib/kubernetes.py", line 261, in __verify_job_has_started
    assert len(pods) > 0, "No pod scheduled by " + self.uu_name
AssertionError: No pod scheduled by something-20181104183346-b51d371a5bfd4197
[1:luigi-interface:570@18:33] [INFO] Informed scheduler that task   Something_42b6a6d55a   has status   FAILED

It is hard to reproduce, but seems like sometimes pod needs a bit more time to be created, but task does not wait for it to be created and ends up raising the error. Task gets FAILED status, but a pod is still created and is run without Luigi control.

When the task is restarted and not failed, a pod is run under Luigi control, but there's still that uncontrolled pod, so we end up having two pods making the same thing.

I've managed to fix it the most naive way. When getting pods to verify that they are started, instead of:

pods = self.__get_pods()

I do the following:

for _ in range(3):
    pods = self._KubernetesJobTask__get_pods()

    if pods:
        break

    # If pod was not returned, sleep to wait for pod to be created.
    sleep(15)

This is not a beautiful way of fixing the issue, but it works (though it will for sure fail if pod needs >45 seconds to be started).

@keisuke-umezawa
Copy link

I have same issues.

@StasDeep
Copy link
Contributor Author

@keisuke-umezawa I've managed to fix it with the code above. Just create a custom KubernetesJobTask class with an overridden __verify_job_has_started method.

@keisuke-umezawa
Copy link

@StasDeep Thanks! Do you have example how to override __verify_job_has_started?

@StasDeep
Copy link
Contributor Author

Yep. Something like that:

class MyKubernetesJobTask(KubernetesJobTask):

    def _KubernetesJobTask__verify_job_has_started(self):
        """
        Does the same thing as the base method, but tries to get pods three times instead of one.
        This way, it's less likely to get empty list of pods and get an exception.
        Sleeps between attempts for 15 seconds.
        """
        # Verify that the job started
        self._KubernetesJobTask__get_job()

        # Verify that the pod started
        for _ in range(3):
            pods = self._KubernetesJobTask__get_pods()

            if pods:
                break

            # If pod was not returned, sleep to wait for pod to be created.
            sleep(15)

Note that __verify_job_has_started, __get_job etc. methods have two underscores in the beginning, so don't forget to use _KubernetesJobTask prefix when invoking them.

@dlstadther
Copy link
Collaborator

It seems reasonable to me to allow there to be some amount of wait time and attempts for a POD to become available. Of course, it should be configurable. Happy to review a PR!

@stale
Copy link

stale bot commented Jun 22, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If closed, you may revisit when your time allows and reopen! Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants