-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pod_creation_wait_interal #2813
Conversation
@honnix @Tarrasch @dlstadther Could someone please review? |
def __track_job(self): | ||
"""Poll job status while active""" | ||
while not self.__verify_job_has_started(): | ||
time.sleep(self.__DEFAULT_POLL_INTERVAL) | ||
time.sleep(self.poll_interval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you have a formatting bug on L287 of this patch. You want to use '%s' instead of '{}' and '%' instead of comma to be stylistically consistent with the rest of the file.
I too am running into this one while running one of our more established pipelines btw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, i fixed it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind. Timestamp issue.
@@ -276,6 +281,11 @@ def __verify_job_has_started(self): | |||
|
|||
# Verify that the pod started | |||
pods = self.__get_pods() | |||
if not pods: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow the logic here. How would the sleep
impact the assertion afterwards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm with @honnix . Seems like you'd want to re-eval pods=self.__get_pods()
before asserting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks
luigi/contrib/kubernetes.py
Outdated
@@ -213,10 +213,15 @@ def poll_interval(self): | |||
"""How often to poll Kubernetes for job status, in seconds.""" | |||
return self.__DEFAULT_POLL_INTERVAL | |||
|
|||
@property | |||
def pod_creation_wait_interal(self): | |||
"""Delay for initial pod creation for just submitted job""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use a constant similar as __DEFAULT_POLL_INTERVAL
, also in docstring please describe the unit so it would be easier to follow.
@honnix @dlstadther I believe your comments are resolved, could you please review again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about the lack of approval. I think I misunderstood the first time around and would like some clarification and verification.
def __track_job(self): | ||
"""Poll job status while active""" | ||
while not self.__verify_job_has_started(): | ||
time.sleep(self.__DEFAULT_POLL_INTERVAL) | ||
time.sleep(self.poll_interval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might have steered you wrong... Or maybe i misunderstood the code the first time.
I'm looking at this more closely and have noticed that now each iteration of this while not self.__verify_job_has_started()
we'll wait both pod_creation_wait_interval
seconds AND poll_interval
seconds.
This doesn't seem desirable. It also isn't clear from the user perspective - for 2 intervals to both impact the same wait time.
Am i missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dlstadther
Here's how I see this:
time.sleep
on this line handling the case when we applied job to kubernetes API, but cluster state has not yet changed to reflect those changes and so no pods yet exist. This call will return empty list and the whole luigi Task will fail whereas after couple of seconds cluster state will catch up with our changes and actual KubernetesJob will run unsupervised by luigi.
And so time.sleep
on this line is for case when pod is already created but containers within it are not ready. So in theory we won't wait on both intervals, when pods are ready code under this if clause won't execute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand the explanation. Most likely this new added sleep
would only happen once. Either afterwards we see pods and will always see while waiting for container running state, or we won't and assertion will be made.
@honnix @dlstadther @pisymbol ping |
LGTM. It would be nicer if we could separate waiting for pod to show up and container to be running. |
@honnix Sorry, could you please clarify is this a change request or just a comment? Also what are the odds that this PR will be merged anytime soon? |
@plizmol It's more as a comment. I'm OK with this change. What do you think @dlstadther ? |
I'm ok with this. If someone doesn't like it in the future, they can submit another PR to separate the delays. |
Description
Add pod_creation_wait_interal property to give Kubernetes cluster some time to create pods for Job.
Motivation and Context
Fix #2570, also aimed to replace abandoned #2664
Have you tested this? If so, how?
I ran my jobs with this code and it works for me.