Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a submissionID to status and use it to group pod of the same run #446

Merged
merged 1 commit into from
Mar 20, 2019

Conversation

liyinan926
Copy link
Collaborator

@skonto
Copy link
Contributor

skonto commented Mar 19, 2019

@liyinan926 will give it a try shortly.

@@ -49,7 +55,7 @@ func (s *sparkPodEventHandler) onPodUpdated(old, updated interface{}) {
if updatedPod.ResourceVersion == oldPod.ResourceVersion {
return
}
glog.V(2).Infof("Pod %s updated in namespace %s.", updatedPod.GetObjectMeta().GetName(), updatedPod.GetObjectMeta().GetNamespace())
glog.V(2).Infof("Pod %s updated in namespace %s.", updatedPod.GetName(), updatedPod.GetNamespace())
Copy link
Contributor

@skonto skonto Mar 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past these return empty strings that is why we used the GetObjectMeta thing... let's keep an eye on this, although it should be ok.


if submissionID, exists := pod.Labels[config.SubmissionIDLabel]; exists {
app, err := s.applicationLister.SparkApplications(pod.GetNamespace()).Get(appName)
if err != nil || app.Status.SubmissionID != submissionID {
Copy link
Contributor

@skonto skonto Mar 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recently we run into an issue where pods stuck in terminating status (several people have reported similar situations on the K8s project), the question is whether the operator should be responsible for cleaning stuff in the background for things that have failed. For now we can ignore this and keep submitting but we should think about it, because the operator at the end of the day manages stuff for you, if we have to force delete stuff manually then its not that handy. On the other hand you most likely want graceful shutdowns especially with Spark and long running streaming jobs so every data is flushed before something is restarted, so probably you want to wait for the shutdown to happen.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think essentially we want to wait for the driver pod to be completely gone. This is actually handled in validateSparkResourceDeletion under the PendingRerun state. If validateSparkResourceDeletion returns false, the state will remain PendingRerun and the next resync loop will check again.

Copy link
Contributor

@skonto skonto Mar 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means validateSparkResourceDeletion will always return a stuck pod. Should we add a configurable policy to handle this, like retry a force delete after N time passes or user should deal with this manually? Maybe its better to keep it around so the problem can be investigated at the expense of not making progress.
Going to run the PR and report back.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While, a pod has a terminationGracePeriod after which the pod will be sent a SIGKILL. So once deletion of the pod is requested, the pod should be gone after certain amount of time.

@skonto
Copy link
Contributor

skonto commented Mar 19, 2019

@liyinan926 I run into this (although the app is restarted) at the end it never reaches RunningState:

...
  Spark Version:  2.4.0
  Type:           Scala
Status:
  Application State:
    State:  SUBMITTED
  Driver Info:
    Web UI Port:                 32292
    Web UI Service Name:         spark-pi-ui-svc
  Execution Attempts:            1
  Last Submission Attempt Time:  2019-03-19T20:01:21Z
  Submission Attempts:           1
  Submission ID:                 spark-pi-1553025678548580441
  Termination Time:              <nil>
Events:
  Type     Reason                               Age                From            Message
  ----     ------                               ----               ----            -------
  Normal   SparkApplicationAdded                85s                spark-operator  SparkApplication spark-pi was added, enqueuing it for submission
  Normal   SparkExecutorPending                 68s (x2 over 68s)  spark-operator  Executor spark-pi-1553025617735-exec-1 is pending
  Normal   SparkExecutorPending                 67s (x2 over 67s)  spark-operator  Executor spark-pi-1553025617735-exec-2 is pending
  Normal   SparkExecutorRunning                 63s                spark-operator  Executor spark-pi-1553025617735-exec-1 is running
  Normal   SparkExecutorRunning                 62s                spark-operator  Executor spark-pi-1553025617735-exec-2 is running
  Normal   SparkDriverRunning                   46s (x2 over 76s)  spark-operator  Driver spark-pi-driver is running
  Normal   SparkExecutorPending                 42s (x2 over 43s)  spark-operator  Executor spark-pi-1553025648754-exec-1 is pending
  Normal   SparkExecutorPending                 42s (x2 over 42s)  spark-operator  Executor spark-pi-1553025648754-exec-2 is pending
  Normal   SparkExecutorRunning                 37s                spark-operator  Executor spark-pi-1553025648754-exec-2 is running
  Normal   SparkExecutorRunning                 37s                spark-operator  Executor spark-pi-1553025648754-exec-1 is running
  Normal   SparkApplicationSpecUpdateProcessed  22s (x3 over 54s)  spark-operator  Successfully processed spec update for SparkApplication spark-pi
  Warning  SparkApplicationPendingRerun         20s (x3 over 54s)  spark-operator  SparkApplication spark-pi is pending rerun
  Normal   SparkApplicationSubmitted            17s (x4 over 82s)  spark-operator  SparkApplication spark-pi was submitted successfully

while the app is running the status was never updated:

kubectl get pods -n spark

spark-pi-1553025682813-exec-1                                     1/1       Running       0          1m
spark-pi-1553025682813-exec-2                                     1/1       Running       0          1m

spark-pi-driver                                                   1/1       Running       0          1m

I just tried 3 updates in a row.
Log here: https://gist.github.com/skonto/ac95bd761535f46a107227a557a3c9f8

@skonto
Copy link
Contributor

skonto commented Mar 19, 2019

It seems that Invalidating state clears the driver thus driver pod not found msg printed and the final pod updates for the new running pods are not triggering the app updates for the same reason.

@liyinan926
Copy link
Collaborator Author

It seems that Invalidating state clears the driver thus driver pod not found msg printed and the final pod updates for the new running pods are not triggering the app updates for the same reason.

The invalidation is not clearing driver info. That driver pod not found log is benign and just means the driver pod has not been created yet. This could happen if the sync is triggered after the application state gets set to Submitted, but the driver pod has not started up running yet.

@skonto
Copy link
Contributor

skonto commented Mar 19, 2019

Ok my mistake PendingRerun does:

 else if status.AppState.State == v1beta1.PendingRerunState {
		status.SparkApplicationID = ""
		status.DriverInfo = v1beta1.DriverInfo{}
		status.ExecutorState = nil

That is the critical part of the log (the rest of the log is in the other link above, this is before just before the final submit):

I0319 20:01:21.419152      12 controller.go:699] Deleting pod with name spark-pi-driver in namespace spark
I0319 20:01:21.419877      12 controller.go:205] SparkApplication spark/spark-pi was updated, enqueueing it
I0319 20:01:21.427523      12 spark_pod_eventhandler.go:58] Pod spark-pi-driver updated in namespace spark.
I0319 20:01:21.427552      12 spark_pod_eventhandler.go:95] Enqueuing SparkApplication spark/spark-pi for app update processing.
I0319 20:01:21.433772      12 spark_pod_eventhandler.go:77] Pod spark-pi-driver deleted in namespace spark.
I0319 20:01:21.433804      12 spark_pod_eventhandler.go:95] Enqueuing SparkApplication spark/spark-pi for app update processing.
I0319 20:01:21.433880      12 controller.go:708] Deleting Spark UI Service spark-pi-ui-svc in namespace spark
I0319 20:01:21.466076      12 controller.go:522] Trying to update SparkApplication spark/spark-pi, from: [{ spark-pi-1553025647387553090 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 31774    spark-pi-driver} {INVALIDATING } map[] 0 0}] to [{ spark-pi-1553025647387553090 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 31774    spark-pi-driver} {PENDING_RERUN } map[] 0 0}]
I0319 20:01:21.483097      12 sparkapp_metrics.go:144] Exporting metrics for spark-pi; old status: { spark-pi-1553025647387553090 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 31774    spark-pi-driver} {INVALIDATING } map[] 0 0} new status: { spark-pi-1553025647387553090 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 31774    spark-pi-driver} {PENDING_RERUN } map[] 0 0}
I0319 20:01:21.483140      12 controller.go:252] Ending processing key: "spark/spark-pi"
I0319 20:01:21.483175      12 controller.go:245] Starting processing key: "spark/spark-pi"
W0319 20:01:21.483304      12 controller.go:334] driver not found for SparkApplication: spark/spark-pi
I0319 20:01:21.483327      12 controller.go:522] Trying to update SparkApplication spark/spark-pi, from: [{ spark-pi-1553025678548580441 2019-03-19 20:01:21 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 32292    } {SUBMITTED } map[] 1 1}] to [{ spark-pi-1553025678548580441 2019-03-19 20:01:21 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 32292    } {SUBMITTED } map[] 1 1}]
I0319 20:01:21.483824      12 controller.go:205] SparkApplication spark/spark-pi was updated, enqueueing it
I0319 20:01:21.499039      12 sparkapp_metrics.go:144] Exporting metrics for spark-pi; old status: { spark-pi-1553025678548580441 2019-03-19 20:01:21 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 32292    } {SUBMITTED } map[] 1 1} new status: { spark-pi-1553025678548580441 2019-03-19 20:01:21 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 32292    } {SUBMITTED } map[] 1 1}
I0319 20:01:21.499079      12 controller.go:252] Ending processing key: "spark/spark-pi"
I0319 20:01:21.499106      12 controller.go:245] Starting processing key: "spark/spark-pi"
I0319 20:01:21.499543      12 controller.go:205] SparkApplication spark/spark-pi was updated, enqueueing it
I0319 20:01:21.505490      12 monitoring_config.go:54] Overriding the default Prometheus configuration with config file /etc/metrics/conf/prometheus.yaml in the Spark image.
I0319 20:01:21.505584      12 submission.go:63] spark-submit arguments: [/opt/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master k8s://https://172.30.0.1:443 --deploy-mode cluster --conf spark.kubernetes.namespace=spark --conf spark.app.name=spark-pi --conf spark.kubernetes.driver.pod.name=spark-pi-driver --conf spark.kubernetes.container.image=lightbend/spark:2.1.0-OpenShift-2.4.0-rh --conf spark.kubernetes.container.image.pullPolicy=Always --conf spark.kubernetes.submission.waitAppCompletion=false --conf spark.metrics.namespace=spark.spark-pi --conf spark.metrics.conf=/etc/metrics/conf/metrics.properties --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=spark-pi --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/submission-id=spark-pi-1553025681505520024 --conf spark.kubernetes.driver.pod.name=spark-pi-driver --conf spark.driver.cores=1.000000 --conf spark.driver.memory=1g --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa --conf spark.kubernetes.driver.label.version=2.4.0 --conf spark.kubernetes.driver.annotation.prometheus.io/port=9100 --conf spark.kubernetes.driver.annotation.prometheus.io/path=/metrics --conf spark.kubernetes.driver.annotation.xxxxxxx.lightbend.io/port=9100 --conf spark.kubernetes.driver.annotation.xxxxxxx.lightbend.io/scrape=true --conf spark.kubernetes.driver.annotation.prometheus.io/scrape=true --conf spark.driver.extraJavaOptions=-javaagent:/prometheus/jmx_prometheus_javaagent-0.3.1.jar=9100:/etc/metrics/conf/prometheus.yaml --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=spark-pi --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/submission-id=spark-pi-1553025681505520024 --conf spark.executor.instances=2 --conf spark.executor.cores=2 --conf spark.executor.memory=1g --conf spark.kubernetes.executor.label.version=2.4.0 --conf spark.kubernetes.executor.annotation.prometheus.io/scrape=true --conf spark.kubernetes.executor.annotation.prometheus.io/port=9100 --conf spark.kubernetes.executor.annotation.prometheus.io/path=/metrics --conf spark.kubernetes.executor.annotation.xxxxxxx.lightbend.io/port=9100 --conf spark.kubernetes.executor.annotation.xxxxxxx.lightbend.io/scrape=true --conf spark.executor.extraJavaOptions=-javaagent:/prometheus/jmx_prometheus_javaagent-0.3.1.jar=9100:/etc/metrics/conf/prometheus.yaml local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar 10000]

The issue I think is that currentDriverState is always nil after that.
The same thing happens in a loop, updateStatusAndExportMetrics triggers a new app update but the app never gets a new status in getAndUpdateAppState as currentDriverState is always nil. That means it gets stuck to SUBMITTED. Does the lister works as expected in getAndUpdateAppState ?

@liyinan926
Copy link
Collaborator Author

currentDriverState is a local variable in getAndUpdateAppState, and set based on what the pod listing result is. If the driver pod has started up running, it will show up in the list result, and as a result, currentDriverState will be set.

@skonto
Copy link
Contributor

skonto commented Mar 19, 2019

If you check the logs the pod has started, also kubectl shows that, but driver not found for SparkApplication is printed all the time meaning currentDriverState is nil. It never changes and state remains as SUBMITTED. Whole log here: https://gist.github.com/skonto/ae81d4695ff4237293fb42f11fcc42c7

@liyinan926
Copy link
Collaborator Author

That's weird. It means the pod listing result doesn't include the driver pod.

@skonto
Copy link
Contributor

skonto commented Mar 19, 2019

Might be a transient error, it does not seem to be deterministic. I am trying to reproduce it again.
We can always merge and re-open if it comes up. @yuchaoran2011 is planning to add some tests to exercise the lifecycle so we can hopefully catch things like that.

@skonto
Copy link
Contributor

skonto commented Mar 20, 2019

I just hit another one.. same loop pattern but not the lister issue... so it seems to be a bug... because the state gets updated before the final pod is run and so it gets stuck due to the previous updates since failed is a final state... Yet the submit works as expected and app runs and completes.

I deleted my CRD and run:

kubectl  apply -f pi-metrics.yaml
sleep 10
kubectl  apply -f pi-metrics2.yaml
sleep 10
kubectl  apply -f pi-metrics3.yaml
sleep 10
kubectl  apply -f pi-metrics.yaml
  Application State:
    Error Message:  Driver Pod not found
    State:          FAILED
  Driver Info:
    Pod Name:                    spark-pi-driver
    Web UI Port:                 31660
    Web UI Service Name:         spark-pi-ui-svc
  Execution Attempts:            1
  Last Submission Attempt Time:  2019-03-19T23:56:18Z
  Spark Application Id:          spark-f3e575fefe4444939ca9bf66a69bb464
  Submission Attempts:           1
  Submission ID:                 spark-pi-1553039775702310039
  Termination Time:              2019-03-19T23:56:26Z
Events:
  Type     Reason                               Age                 From            Message
  ----     ------                               ----                ----            -------
  Normal   SparkApplicationAdded                2m1s                spark-operator  SparkApplication spark-pi was added, enqueuing it for submission
  Normal   SparkDriverRunning                   90s (x3 over 112s)  spark-operator  Driver spark-pi-driver is running
  Normal   SparkApplicationSpecUpdateProcessed  88s (x3 over 110s)  spark-operator  Successfully processed spec update for SparkApplication spark-pi
  Warning  SparkApplicationPendingRerun         88s (x3 over 110s)  spark-operator  SparkApplication spark-pi is pending rerun
  Normal   SparkApplicationSubmitted            85s (x4 over 118s)  spark-operator  SparkApplication spark-pi was submitted successfully
  Warning  SparkApplicationFailed               85s                 spark-operator  SparkApplication spark-pi failed: Driver Pod not found

whole log here : https://gist.github.com/skonto/6fefa215b236cb83abd5cce68ace627a

It gets stuck to FAILED....

Pods are running fine at the end:

kubectl get pods -n spark
NAME                                                              READY     STATUS        RESTARTS   AGE

spark-operator-ns-fdp-sparkoperator-695455567f-jhghc              1/1       Running       0          21m
spark-pi-1553039787910-exec-1                                     1/1       Running       0          59s
spark-pi-1553039787910-exec-2                                     1/1       Running       0          59s
spark-pi-driver                                                   1/1       Running       0          1m

@liyinan926
Copy link
Collaborator Author

Again, it's weird. It means the application was in a state that should have an existing driver pod but for some reason the driver pod didn't show up on the pod listing result.

@skonto
Copy link
Contributor

skonto commented Mar 20, 2019

I believe this is a different case because after the app update there is no driver not found for SparkApplication printed all over the place. That means the previous currentDriverState is kept around and a failed pod update caused the change to the state to be FAILING Driver Pod not found and then to the final FAILED Driver Pod not found. This happens while there is a submission and the previous state before that is SUBMITTED.
There are two consecutive updates here that move the state to the permanent FAILED one.
I think there is no pod at that moment because the new pod hasnt started yet, the submission just returned successfully, and only spark-pi-ui-svc 31007 exists.

I0319 23:56:29.454104      11 sparkapp_metrics.go:144] Exporting metrics for spark-pi; old status: { spark-pi-1553039775702310039 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 31660    spark-pi-driver} {PENDING_RERUN } map[] 0 0} new status: { spark-pi-1553039786574481132 2019-03-19 23:56:29.396916509 +0000 UTC m=+1186.542531986 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 31007    } {SUBMITTED } map[] 1 1}
I0319 23:56:29.454153      11 controller.go:252] Ending processing key: "spark/spark-pi"
I0319 23:56:29.454190      11 controller.go:245] Starting processing key: "spark/spark-pi"
I0319 23:56:29.454232      11 controller.go:522] Trying to update SparkApplication spark/spark-pi, from: [{spark-f3e575fefe4444939ca9bf66a69bb464 spark-pi-1553039775702310039 2019-03-19 23:56:18 +0000 UTC 2019-03-19 23:56:26 +0000 UTC {spark-pi-ui-svc 31660    spark-pi-driver} {FAILING Driver Pod not found} map[] 1 1}] to [{spark-f3e575fefe4444939ca9bf66a69bb464 spark-pi-1553039775702310039 2019-03-19 23:56:18 +0000 UTC 2019-03-19 23:56:26 +0000 UTC {spark-pi-ui-svc 31660    spark-pi-driver} {FAILED Driver Pod not found} map[] 1 1}]
I0319 23:56:29.454690      11 event.go:221] Event(v1.ObjectReference{Kind:"SparkApplication", Namespace:"spark", Name:"spark-pi", UID:"87dd2218-4aa2-11e9-b674-0288a0a0c3ec", APIVersion:"sparkoperator.k8s.io/v1beta1", ResourceVersion:"2641608", FieldPath:""}): type: 'Warning' reason: 'SparkApplicationFailed' SparkApplication spark-pi failed: Driver Pod not found
I0319 23:56:29.455003      11 controller.go:205] SparkApplication spark/spark-pi was updated, enqueueing it
I0319 23:56:29.494486      11 controller.go:205] SparkApplication spark/spark-pi was updated, enqueueing it

@liyinan926
Copy link
Collaborator Author

liyinan926 commented Mar 20, 2019

The most interesting part of the logs is as follows. There was an update to the app that caused it's state to be changed to INVALIDATING, and then to PENDING_RERUN (the line starting with I0319 23:56:26.556801). However, in the interim, there was also an attempt to change the application state from RUNNING to FAILING. This latter state change was triggered likely by a pod update. The state changed to FAILING because the driver pod got deleted when the app was updated and invalidated, and this line https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/pkg/controller/sparkapplication/controller.go#L334 was triggered. It is clearly that the reason for the second update to start with a RUNNING state is stale cache. The operator uses an informer and a lister for getting and listing SparkApplication objects. The informer/lister maintains a cache internally. When the status of an application gets updated on the API server, the informer receives the update and updates the cache asynchronously. Because there were several updates in a very short amount of time, there's a good chance that some update read a stale state of the app. This is a inherent limitation of the informer/lister.

I0319 23:56:26.542436      11 controller.go:522] Trying to update SparkApplication spark/spark-pi, from: [{spark-f3e575fefe4444939ca9bf66a69bb464 spark-pi-1553039775702310039 2019-03-19 23:56:18 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 31660    spark-pi-driver} {INVALIDATING } map[] 1 1}] to [{ spark-pi-1553039775702310039 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 31660    spark-pi-driver} {PENDING_RERUN } map[] 0 0}]
I0319 23:56:26.556801      11 sparkapp_metrics.go:144] Exporting metrics for spark-pi; old status: {spark-f3e575fefe4444939ca9bf66a69bb464 spark-pi-1553039775702310039 2019-03-19 23:56:18 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 31660    spark-pi-driver} {INVALIDATING } map[] 1 1} new status: { spark-pi-1553039775702310039 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 31660    spark-pi-driver} {PENDING_RERUN } map[] 0 0}
I0319 23:56:26.556841      11 controller.go:252] Ending processing key: "spark/spark-pi"
I0319 23:56:26.556871      11 controller.go:245] Starting processing key: "spark/spark-pi"
W0319 23:56:26.556953      11 controller.go:334] driver not found for SparkApplication: spark/spark-pi
I0319 23:56:26.556975      11 controller.go:522] Trying to update SparkApplication spark/spark-pi, from: [{spark-f3e575fefe4444939ca9bf66a69bb464 spark-pi-1553039775702310039 2019-03-19 23:56:18 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 31660    spark-pi-driver} {RUNNING } map[] 1 1}] to [{spark-f3e575fefe4444939ca9bf66a69bb464 spark-pi-1553039775702310039 2019-03-19 23:56:18 +0000 UTC 2019-03-19 23:56:26.556966857 +0000 UTC m=+1183.702582450 {spark-pi-ui-svc 31660    spark-pi-driver} {FAILING Driver Pod not found} map[] 1 1}]
I0319 23:56:26.558776      11 controller.go:205] SparkApplication spark/spark-pi was updated, enqueueing it

@skonto
Copy link
Contributor

skonto commented Mar 20, 2019

ok @liyinan926 it is the stale cache or the fact that for the parallel updates there is no "thread" safety in the operator's logic.

@liyinan926
Copy link
Collaborator Author

liyinan926 commented Mar 20, 2019

@skonto this is a known limitation of the informer and the caching mechanism. There's always some amount of delay because of the async nature. Most of the time this is not an issue as user updates are not expected to be super frequent.

@skonto
Copy link
Contributor

skonto commented Mar 20, 2019

It is clearly that the reason for the second update to start with a RUNNING state is stale cache.

I am not sure about this because things run in parallel and the last part about SUBMITTED state is ambiguous. If you get an update while on another thread the submission is happening is the code 100% safe (even if the cache was working fine)? I suspect no according to my description above. Anyway we can always merge and re-open if something fails.
One question though updates will happen at this rate (not uncommon at all) so this cache is becoming a problem IMHO.
If you cannot get a reliable app status then CRD is a bit useless from that point of view.

@liyinan926
Copy link
Collaborator Author

It is clearly that the reason for the second update to start with a RUNNING state is stale cache.

I am not sure about this because things run in parallel and the last part about SUBMITTED state is ambiguous. If you get an update while on another thread the submission is happening is the code 100% safe (even if the cache was working fine)? Anyway we can always merge and re-open if something fails.

Actually update events are put into a queue and processed sequentially. However, because the cache is async in nature, a subsequent update may see a state that is stale. In this particularly case, the second update was trying to change the app state from RUNNING to FAILED as the log shows.

@skonto
Copy link
Contributor

skonto commented Mar 20, 2019

There are workers at the controller level what dequeue items, these run in parallel, no?
processNextItem -> syncSparkApplication is called in parallel from what I see.

@liyinan926
Copy link
Collaborator Author

There are workers at the controller level what dequeue items, these run in parallel, no?
processNextItem -> syncSparkApplication is called in parallel from what I see.

Oh, you are right. I forgot that there are multiple worker threads.

@liyinan926 liyinan926 merged commit 1bbf0d0 into kubeflow:master Mar 20, 2019
@skonto
Copy link
Contributor

skonto commented Mar 20, 2019

@liyinan926 are you going to work on these corner cases on another PR?

@liyinan926
Copy link
Collaborator Author

@liyinan926 are you going to work on these corner cases on another PR?

By corner cases, do you mean handling reads from stale cache? If yes, the best possible solution is to always read from the API server. But that risks putting a much bigger pressure onto the API server, which is what the caching mechanism tries to alleviate.

@skonto
Copy link
Contributor

skonto commented Mar 26, 2019

@liyinan926 No the cases where things fail because of the parallel execution like above.
Imagine you have a job that keeps failing generating events (another case)...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants