-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added a submissionID to status and use it to group pod of the same run #446
Conversation
@liyinan926 will give it a try shortly. |
@@ -49,7 +55,7 @@ func (s *sparkPodEventHandler) onPodUpdated(old, updated interface{}) { | |||
if updatedPod.ResourceVersion == oldPod.ResourceVersion { | |||
return | |||
} | |||
glog.V(2).Infof("Pod %s updated in namespace %s.", updatedPod.GetObjectMeta().GetName(), updatedPod.GetObjectMeta().GetNamespace()) | |||
glog.V(2).Infof("Pod %s updated in namespace %s.", updatedPod.GetName(), updatedPod.GetNamespace()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the past these return empty strings that is why we used the GetObjectMeta thing... let's keep an eye on this, although it should be ok.
|
||
if submissionID, exists := pod.Labels[config.SubmissionIDLabel]; exists { | ||
app, err := s.applicationLister.SparkApplications(pod.GetNamespace()).Get(appName) | ||
if err != nil || app.Status.SubmissionID != submissionID { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recently we run into an issue where pods stuck in terminating status (several people have reported similar situations on the K8s project), the question is whether the operator should be responsible for cleaning stuff in the background for things that have failed. For now we can ignore this and keep submitting but we should think about it, because the operator at the end of the day manages stuff for you, if we have to force delete stuff manually then its not that handy. On the other hand you most likely want graceful shutdowns especially with Spark and long running streaming jobs so every data is flushed before something is restarted, so probably you want to wait for the shutdown to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think essentially we want to wait for the driver pod to be completely gone. This is actually handled in validateSparkResourceDeletion
under the PendingRerun
state. If validateSparkResourceDeletion
returns false, the state will remain PendingRerun
and the next resync loop will check again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That means validateSparkResourceDeletion
will always return a stuck pod. Should we add a configurable policy to handle this, like retry a force delete after N time passes or user should deal with this manually? Maybe its better to keep it around so the problem can be investigated at the expense of not making progress.
Going to run the PR and report back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While, a pod has a terminationGracePeriod
after which the pod will be sent a SIGKILL. So once deletion of the pod is requested, the pod should be gone after certain amount of time.
@liyinan926 I run into this (although the app is restarted) at the end it never reaches
while the app is running the status was never updated:
I just tried 3 updates in a row. |
It seems that Invalidating state clears the driver thus |
The invalidation is not clearing driver info. That |
Ok my mistake
That is the critical part of the log (the rest of the log is in the other link above, this is before just before the final submit):
The issue I think is that |
|
If you check the logs the pod has started, also |
That's weird. It means the pod listing result doesn't include the driver pod. |
Might be a transient error, it does not seem to be deterministic. I am trying to reproduce it again. |
I just hit another one.. same loop pattern but not the lister issue... so it seems to be a bug... because the state gets updated before the final pod is run and so it gets stuck due to the previous updates since failed is a final state... Yet the submit works as expected and app runs and completes. I deleted my CRD and run:
whole log here : https://gist.github.com/skonto/6fefa215b236cb83abd5cce68ace627a It gets stuck to Pods are running fine at the end:
|
Again, it's weird. It means the application was in a state that should have an existing driver pod but for some reason the driver pod didn't show up on the pod listing result. |
I believe this is a different case because after the app update there is no
|
The most interesting part of the logs is as follows. There was an update to the app that caused it's state to be changed to
|
ok @liyinan926 it is the stale cache or the fact that for the parallel updates there is no "thread" safety in the operator's logic. |
@skonto this is a known limitation of the informer and the caching mechanism. There's always some amount of delay because of the async nature. Most of the time this is not an issue as user updates are not expected to be super frequent. |
I am not sure about this because things run in parallel and the last part about SUBMITTED state is ambiguous. If you get an update while on another thread the submission is happening is the code 100% safe (even if the cache was working fine)? I suspect no according to my description above. Anyway we can always merge and re-open if something fails. |
Actually update events are put into a queue and processed sequentially. However, because the cache is async in nature, a subsequent update may see a state that is stale. In this particularly case, the second update was trying to change the app state from |
There are workers at the controller level what dequeue items, these run in parallel, no? |
Oh, you are right. I forgot that there are multiple worker threads. |
@liyinan926 are you going to work on these corner cases on another PR? |
By corner cases, do you mean handling reads from stale cache? If yes, the best possible solution is to always read from the API server. But that risks putting a much bigger pressure onto the API server, which is what the caching mechanism tries to alleviate. |
@liyinan926 No the cases where things fail because of the parallel execution like above. |
@skonto