Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pod_creation_wait_interal #2813

Merged
merged 3 commits into from
Nov 20, 2019
Merged

Add pod_creation_wait_interal #2813

merged 3 commits into from
Nov 20, 2019

Conversation

plizmol
Copy link
Contributor

@plizmol plizmol commented Oct 28, 2019

Description

Add pod_creation_wait_interal property to give Kubernetes cluster some time to create pods for Job.

Motivation and Context

Fix #2570, also aimed to replace abandoned #2664

Have you tested this? If so, how?

I ran my jobs with this code and it works for me.

@plizmol
Copy link
Contributor Author

plizmol commented Oct 30, 2019

@honnix @Tarrasch @dlstadther Could someone please review?

def __track_job(self):
"""Poll job status while active"""
while not self.__verify_job_has_started():
time.sleep(self.__DEFAULT_POLL_INTERVAL)
time.sleep(self.poll_interval)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link

@pisymbol pisymbol Nov 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you have a formatting bug on L287 of this patch. You want to use '%s' instead of '{}' and '%' instead of comma to be stylistically consistent with the rest of the file.

I too am running into this one while running one of our more established pipelines btw.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, i fixed it

Copy link

@pisymbol pisymbol Nov 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind. Timestamp issue.

@@ -276,6 +281,11 @@ def __verify_job_has_started(self):

# Verify that the pod started
pods = self.__get_pods()
if not pods:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow the logic here. How would the sleep impact the assertion afterwards?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with @honnix . Seems like you'd want to re-eval pods=self.__get_pods() before asserting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks

@@ -213,10 +213,15 @@ def poll_interval(self):
"""How often to poll Kubernetes for job status, in seconds."""
return self.__DEFAULT_POLL_INTERVAL

@property
def pod_creation_wait_interal(self):
"""Delay for initial pod creation for just submitted job"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a constant similar as __DEFAULT_POLL_INTERVAL, also in docstring please describe the unit so it would be easier to follow.

@plizmol plizmol requested a review from honnix November 1, 2019 08:19
@plizmol
Copy link
Contributor Author

plizmol commented Nov 4, 2019

@honnix @dlstadther I believe your comments are resolved, could you please review again?

Copy link
Collaborator

@dlstadther dlstadther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the lack of approval. I think I misunderstood the first time around and would like some clarification and verification.

def __track_job(self):
"""Poll job status while active"""
while not self.__verify_job_has_started():
time.sleep(self.__DEFAULT_POLL_INTERVAL)
time.sleep(self.poll_interval)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might have steered you wrong... Or maybe i misunderstood the code the first time.

I'm looking at this more closely and have noticed that now each iteration of this while not self.__verify_job_has_started() we'll wait both pod_creation_wait_interval seconds AND poll_interval seconds.

This doesn't seem desirable. It also isn't clear from the user perspective - for 2 intervals to both impact the same wait time.

Am i missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dlstadther
Here's how I see this:
time.sleep on this line handling the case when we applied job to kubernetes API, but cluster state has not yet changed to reflect those changes and so no pods yet exist. This call will return empty list and the whole luigi Task will fail whereas after couple of seconds cluster state will catch up with our changes and actual KubernetesJob will run unsupervised by luigi.
And so time.sleep on this line is for case when pod is already created but containers within it are not ready. So in theory we won't wait on both intervals, when pods are ready code under this if clause won't execute.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand the explanation. Most likely this new added sleep would only happen once. Either afterwards we see pods and will always see while waiting for container running state, or we won't and assertion will be made.

@plizmol plizmol requested a review from dlstadther November 5, 2019 20:02
@plizmol
Copy link
Contributor Author

plizmol commented Nov 10, 2019

@honnix @dlstadther @pisymbol ping

@honnix
Copy link
Member

honnix commented Nov 15, 2019

LGTM. It would be nicer if we could separate waiting for pod to show up and container to be running.

@plizmol
Copy link
Contributor Author

plizmol commented Nov 19, 2019

LGTM. It would be nicer if we could separate waiting for pod to show up and container to be running.

@honnix Sorry, could you please clarify is this a change request or just a comment? Also what are the odds that this PR will be merged anytime soon?

@honnix
Copy link
Member

honnix commented Nov 20, 2019

@plizmol It's more as a comment. I'm OK with this change. What do you think @dlstadther ?

@dlstadther
Copy link
Collaborator

I'm ok with this. If someone doesn't like it in the future, they can submit another PR to separate the delays.

@honnix honnix merged commit 262f2bd into spotify:master Nov 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KubernetesJobTask "No pod scheduled" error
4 participants