Added a submissionID to status and use it to group pod of the same run #446

liyinan926 · 2019-03-19T14:36:53Z

skonto · 2019-03-19T17:32:23Z

@liyinan926 will give it a try shortly.

skonto · 2019-03-19T17:41:34Z

pkg/controller/sparkapplication/spark_pod_eventhandler.go

@@ -49,7 +55,7 @@ func (s *sparkPodEventHandler) onPodUpdated(old, updated interface{}) {
 	if updatedPod.ResourceVersion == oldPod.ResourceVersion {
 		return
 	}
-	glog.V(2).Infof("Pod %s updated in namespace %s.", updatedPod.GetObjectMeta().GetName(), updatedPod.GetObjectMeta().GetNamespace())
+	glog.V(2).Infof("Pod %s updated in namespace %s.", updatedPod.GetName(), updatedPod.GetNamespace())


In the past these return empty strings that is why we used the GetObjectMeta thing... let's keep an eye on this, although it should be ok.

skonto · 2019-03-19T17:50:06Z

pkg/controller/sparkapplication/spark_pod_eventhandler.go

+
+	if submissionID, exists := pod.Labels[config.SubmissionIDLabel]; exists {
+		app, err := s.applicationLister.SparkApplications(pod.GetNamespace()).Get(appName)
+		if err != nil || app.Status.SubmissionID != submissionID {


Recently we run into an issue where pods stuck in terminating status (several people have reported similar situations on the K8s project), the question is whether the operator should be responsible for cleaning stuff in the background for things that have failed. For now we can ignore this and keep submitting but we should think about it, because the operator at the end of the day manages stuff for you, if we have to force delete stuff manually then its not that handy. On the other hand you most likely want graceful shutdowns especially with Spark and long running streaming jobs so every data is flushed before something is restarted, so probably you want to wait for the shutdown to happen.

I think essentially we want to wait for the driver pod to be completely gone. This is actually handled in validateSparkResourceDeletion under the PendingRerun state. If validateSparkResourceDeletion returns false, the state will remain PendingRerun and the next resync loop will check again.

That means validateSparkResourceDeletion will always return a stuck pod. Should we add a configurable policy to handle this, like retry a force delete after N time passes or user should deal with this manually? Maybe its better to keep it around so the problem can be investigated at the expense of not making progress.
Going to run the PR and report back.

While, a pod has a terminationGracePeriod after which the pod will be sent a SIGKILL. So once deletion of the pod is requested, the pod should be gone after certain amount of time.

skonto · 2019-03-19T20:03:59Z

@liyinan926 I run into this (although the app is restarted) at the end it never reaches RunningState:

...
  Spark Version:  2.4.0
  Type:           Scala
Status:
  Application State:
    State:  SUBMITTED
  Driver Info:
    Web UI Port:                 32292
    Web UI Service Name:         spark-pi-ui-svc
  Execution Attempts:            1
  Last Submission Attempt Time:  2019-03-19T20:01:21Z
  Submission Attempts:           1
  Submission ID:                 spark-pi-1553025678548580441
  Termination Time:              <nil>
Events:
  Type     Reason                               Age                From            Message
  ----     ------                               ----               ----            -------
  Normal   SparkApplicationAdded                85s                spark-operator  SparkApplication spark-pi was added, enqueuing it for submission
  Normal   SparkExecutorPending                 68s (x2 over 68s)  spark-operator  Executor spark-pi-1553025617735-exec-1 is pending
  Normal   SparkExecutorPending                 67s (x2 over 67s)  spark-operator  Executor spark-pi-1553025617735-exec-2 is pending
  Normal   SparkExecutorRunning                 63s                spark-operator  Executor spark-pi-1553025617735-exec-1 is running
  Normal   SparkExecutorRunning                 62s                spark-operator  Executor spark-pi-1553025617735-exec-2 is running
  Normal   SparkDriverRunning                   46s (x2 over 76s)  spark-operator  Driver spark-pi-driver is running
  Normal   SparkExecutorPending                 42s (x2 over 43s)  spark-operator  Executor spark-pi-1553025648754-exec-1 is pending
  Normal   SparkExecutorPending                 42s (x2 over 42s)  spark-operator  Executor spark-pi-1553025648754-exec-2 is pending
  Normal   SparkExecutorRunning                 37s                spark-operator  Executor spark-pi-1553025648754-exec-2 is running
  Normal   SparkExecutorRunning                 37s                spark-operator  Executor spark-pi-1553025648754-exec-1 is running
  Normal   SparkApplicationSpecUpdateProcessed  22s (x3 over 54s)  spark-operator  Successfully processed spec update for SparkApplication spark-pi
  Warning  SparkApplicationPendingRerun         20s (x3 over 54s)  spark-operator  SparkApplication spark-pi is pending rerun
  Normal   SparkApplicationSubmitted            17s (x4 over 82s)  spark-operator  SparkApplication spark-pi was submitted successfully

while the app is running the status was never updated:

kubectl get pods -n spark

spark-pi-1553025682813-exec-1                                     1/1       Running       0          1m
spark-pi-1553025682813-exec-2                                     1/1       Running       0          1m

spark-pi-driver                                                   1/1       Running       0          1m

I just tried 3 updates in a row.
Log here: https://gist.github.com/skonto/ac95bd761535f46a107227a557a3c9f8

skonto · 2019-03-19T20:43:58Z

It seems that Invalidating state clears the driver thus driver pod not found msg printed and the final pod updates for the new running pods are not triggering the app updates for the same reason.

liyinan926 · 2019-03-19T22:37:14Z

It seems that Invalidating state clears the driver thus driver pod not found msg printed and the final pod updates for the new running pods are not triggering the app updates for the same reason.

The invalidation is not clearing driver info. That driver pod not found log is benign and just means the driver pod has not been created yet. This could happen if the sync is triggered after the application state gets set to Submitted, but the driver pod has not started up running yet.

skonto · 2019-03-19T22:52:29Z

Ok my mistake PendingRerun does:

 else if status.AppState.State == v1beta1.PendingRerunState {
		status.SparkApplicationID = ""
		status.DriverInfo = v1beta1.DriverInfo{}
		status.ExecutorState = nil

That is the critical part of the log (the rest of the log is in the other link above, this is before just before the final submit):

I0319 20:01:21.419152      12 controller.go:699] Deleting pod with name spark-pi-driver in namespace spark
I0319 20:01:21.419877      12 controller.go:205] SparkApplication spark/spark-pi was updated, enqueueing it
I0319 20:01:21.427523      12 spark_pod_eventhandler.go:58] Pod spark-pi-driver updated in namespace spark.
I0319 20:01:21.427552      12 spark_pod_eventhandler.go:95] Enqueuing SparkApplication spark/spark-pi for app update processing.
I0319 20:01:21.433772      12 spark_pod_eventhandler.go:77] Pod spark-pi-driver deleted in namespace spark.
I0319 20:01:21.433804      12 spark_pod_eventhandler.go:95] Enqueuing SparkApplication spark/spark-pi for app update processing.
I0319 20:01:21.433880      12 controller.go:708] Deleting Spark UI Service spark-pi-ui-svc in namespace spark
I0319 20:01:21.466076      12 controller.go:522] Trying to update SparkApplication spark/spark-pi, from: [{ spark-pi-1553025647387553090 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 31774    spark-pi-driver} {INVALIDATING } map[] 0 0}] to [{ spark-pi-1553025647387553090 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 31774    spark-pi-driver} {PENDING_RERUN } map[] 0 0}]
I0319 20:01:21.483097      12 sparkapp_metrics.go:144] Exporting metrics for spark-pi; old status: { spark-pi-1553025647387553090 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 31774    spark-pi-driver} {INVALIDATING } map[] 0 0} new status: { spark-pi-1553025647387553090 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 31774    spark-pi-driver} {PENDING_RERUN } map[] 0 0}
I0319 20:01:21.483140      12 controller.go:252] Ending processing key: "spark/spark-pi"
I0319 20:01:21.483175      12 controller.go:245] Starting processing key: "spark/spark-pi"
W0319 20:01:21.483304      12 controller.go:334] driver not found for SparkApplication: spark/spark-pi
I0319 20:01:21.483327      12 controller.go:522] Trying to update SparkApplication spark/spark-pi, from: [{ spark-pi-1553025678548580441 2019-03-19 20:01:21 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 32292    } {SUBMITTED } map[] 1 1}] to [{ spark-pi-1553025678548580441 2019-03-19 20:01:21 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 32292    } {SUBMITTED } map[] 1 1}]
I0319 20:01:21.483824      12 controller.go:205] SparkApplication spark/spark-pi was updated, enqueueing it
I0319 20:01:21.499039      12 sparkapp_metrics.go:144] Exporting metrics for spark-pi; old status: { spark-pi-1553025678548580441 2019-03-19 20:01:21 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 32292    } {SUBMITTED } map[] 1 1} new status: { spark-pi-1553025678548580441 2019-03-19 20:01:21 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 32292    } {SUBMITTED } map[] 1 1}
I0319 20:01:21.499079      12 controller.go:252] Ending processing key: "spark/spark-pi"
I0319 20:01:21.499106      12 controller.go:245] Starting processing key: "spark/spark-pi"
I0319 20:01:21.499543      12 controller.go:205] SparkApplication spark/spark-pi was updated, enqueueing it
I0319 20:01:21.505490      12 monitoring_config.go:54] Overriding the default Prometheus configuration with config file /etc/metrics/conf/prometheus.yaml in the Spark image.
I0319 20:01:21.505584      12 submission.go:63] spark-submit arguments: [/opt/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master k8s://https://172.30.0.1:443 --deploy-mode cluster --conf spark.kubernetes.namespace=spark --conf spark.app.name=spark-pi --conf spark.kubernetes.driver.pod.name=spark-pi-driver --conf spark.kubernetes.container.image=lightbend/spark:2.1.0-OpenShift-2.4.0-rh --conf spark.kubernetes.container.image.pullPolicy=Always --conf spark.kubernetes.submission.waitAppCompletion=false --conf spark.metrics.namespace=spark.spark-pi --conf spark.metrics.conf=/etc/metrics/conf/metrics.properties --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=spark-pi --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/submission-id=spark-pi-1553025681505520024 --conf spark.kubernetes.driver.pod.name=spark-pi-driver --conf spark.driver.cores=1.000000 --conf spark.driver.memory=1g --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa --conf spark.kubernetes.driver.label.version=2.4.0 --conf spark.kubernetes.driver.annotation.prometheus.io/port=9100 --conf spark.kubernetes.driver.annotation.prometheus.io/path=/metrics --conf spark.kubernetes.driver.annotation.xxxxxxx.lightbend.io/port=9100 --conf spark.kubernetes.driver.annotation.xxxxxxx.lightbend.io/scrape=true --conf spark.kubernetes.driver.annotation.prometheus.io/scrape=true --conf spark.driver.extraJavaOptions=-javaagent:/prometheus/jmx_prometheus_javaagent-0.3.1.jar=9100:/etc/metrics/conf/prometheus.yaml --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=spark-pi --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/submission-id=spark-pi-1553025681505520024 --conf spark.executor.instances=2 --conf spark.executor.cores=2 --conf spark.executor.memory=1g --conf spark.kubernetes.executor.label.version=2.4.0 --conf spark.kubernetes.executor.annotation.prometheus.io/scrape=true --conf spark.kubernetes.executor.annotation.prometheus.io/port=9100 --conf spark.kubernetes.executor.annotation.prometheus.io/path=/metrics --conf spark.kubernetes.executor.annotation.xxxxxxx.lightbend.io/port=9100 --conf spark.kubernetes.executor.annotation.xxxxxxx.lightbend.io/scrape=true --conf spark.executor.extraJavaOptions=-javaagent:/prometheus/jmx_prometheus_javaagent-0.3.1.jar=9100:/etc/metrics/conf/prometheus.yaml local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar 10000]

The issue I think is that currentDriverState is always nil after that.
The same thing happens in a loop, updateStatusAndExportMetrics triggers a new app update but the app never gets a new status in getAndUpdateAppState as currentDriverState is always nil. That means it gets stuck to SUBMITTED. Does the lister works as expected in getAndUpdateAppState ?

liyinan926 · 2019-03-19T23:22:59Z

currentDriverState is a local variable in getAndUpdateAppState, and set based on what the pod listing result is. If the driver pod has started up running, it will show up in the list result, and as a result, currentDriverState will be set.

skonto · 2019-03-19T23:27:10Z

If you check the logs the pod has started, also kubectl shows that, but driver not found for SparkApplication is printed all the time meaning currentDriverState is nil. It never changes and state remains as SUBMITTED. Whole log here: https://gist.github.com/skonto/ae81d4695ff4237293fb42f11fcc42c7

liyinan926 · 2019-03-19T23:31:36Z

That's weird. It means the pod listing result doesn't include the driver pod.

skonto · 2019-03-19T23:44:20Z

Might be a transient error, it does not seem to be deterministic. I am trying to reproduce it again.
We can always merge and re-open if it comes up. @yuchaoran2011 is planning to add some tests to exercise the lifecycle so we can hopefully catch things like that.

skonto · 2019-03-20T00:02:59Z

I just hit another one.. same loop pattern but not the lister issue... so it seems to be a bug... because the state gets updated before the final pod is run and so it gets stuck due to the previous updates since failed is a final state... Yet the submit works as expected and app runs and completes.

I deleted my CRD and run:

kubectl  apply -f pi-metrics.yaml
sleep 10
kubectl  apply -f pi-metrics2.yaml
sleep 10
kubectl  apply -f pi-metrics3.yaml
sleep 10
kubectl  apply -f pi-metrics.yaml

  Application State:
    Error Message:  Driver Pod not found
    State:          FAILED
  Driver Info:
    Pod Name:                    spark-pi-driver
    Web UI Port:                 31660
    Web UI Service Name:         spark-pi-ui-svc
  Execution Attempts:            1
  Last Submission Attempt Time:  2019-03-19T23:56:18Z
  Spark Application Id:          spark-f3e575fefe4444939ca9bf66a69bb464
  Submission Attempts:           1
  Submission ID:                 spark-pi-1553039775702310039
  Termination Time:              2019-03-19T23:56:26Z
Events:
  Type     Reason                               Age                 From            Message
  ----     ------                               ----                ----            -------
  Normal   SparkApplicationAdded                2m1s                spark-operator  SparkApplication spark-pi was added, enqueuing it for submission
  Normal   SparkDriverRunning                   90s (x3 over 112s)  spark-operator  Driver spark-pi-driver is running
  Normal   SparkApplicationSpecUpdateProcessed  88s (x3 over 110s)  spark-operator  Successfully processed spec update for SparkApplication spark-pi
  Warning  SparkApplicationPendingRerun         88s (x3 over 110s)  spark-operator  SparkApplication spark-pi is pending rerun
  Normal   SparkApplicationSubmitted            85s (x4 over 118s)  spark-operator  SparkApplication spark-pi was submitted successfully
  Warning  SparkApplicationFailed               85s                 spark-operator  SparkApplication spark-pi failed: Driver Pod not found

whole log here : https://gist.github.com/skonto/6fefa215b236cb83abd5cce68ace627a

It gets stuck to FAILED....

Pods are running fine at the end:

kubectl get pods -n spark
NAME                                                              READY     STATUS        RESTARTS   AGE

spark-operator-ns-fdp-sparkoperator-695455567f-jhghc              1/1       Running       0          21m
spark-pi-1553039787910-exec-1                                     1/1       Running       0          59s
spark-pi-1553039787910-exec-2                                     1/1       Running       0          59s
spark-pi-driver                                                   1/1       Running       0          1m

liyinan926 · 2019-03-20T04:29:46Z

Again, it's weird. It means the application was in a state that should have an existing driver pod but for some reason the driver pod didn't show up on the pod listing result.

skonto · 2019-03-20T10:25:15Z

I believe this is a different case because after the app update there is no driver not found for SparkApplication printed all over the place. That means the previous currentDriverState is kept around and a failed pod update caused the change to the state to be FAILING Driver Pod not found and then to the final FAILED Driver Pod not found. This happens while there is a submission and the previous state before that is SUBMITTED.
There are two consecutive updates here that move the state to the permanent FAILED one.
I think there is no pod at that moment because the new pod hasnt started yet, the submission just returned successfully, and only spark-pi-ui-svc 31007 exists.

I0319 23:56:29.454104      11 sparkapp_metrics.go:144] Exporting metrics for spark-pi; old status: { spark-pi-1553039775702310039 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 31660    spark-pi-driver} {PENDING_RERUN } map[] 0 0} new status: { spark-pi-1553039786574481132 2019-03-19 23:56:29.396916509 +0000 UTC m=+1186.542531986 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 31007    } {SUBMITTED } map[] 1 1}
I0319 23:56:29.454153      11 controller.go:252] Ending processing key: "spark/spark-pi"
I0319 23:56:29.454190      11 controller.go:245] Starting processing key: "spark/spark-pi"
I0319 23:56:29.454232      11 controller.go:522] Trying to update SparkApplication spark/spark-pi, from: [{spark-f3e575fefe4444939ca9bf66a69bb464 spark-pi-1553039775702310039 2019-03-19 23:56:18 +0000 UTC 2019-03-19 23:56:26 +0000 UTC {spark-pi-ui-svc 31660    spark-pi-driver} {FAILING Driver Pod not found} map[] 1 1}] to [{spark-f3e575fefe4444939ca9bf66a69bb464 spark-pi-1553039775702310039 2019-03-19 23:56:18 +0000 UTC 2019-03-19 23:56:26 +0000 UTC {spark-pi-ui-svc 31660    spark-pi-driver} {FAILED Driver Pod not found} map[] 1 1}]
I0319 23:56:29.454690      11 event.go:221] Event(v1.ObjectReference{Kind:"SparkApplication", Namespace:"spark", Name:"spark-pi", UID:"87dd2218-4aa2-11e9-b674-0288a0a0c3ec", APIVersion:"sparkoperator.k8s.io/v1beta1", ResourceVersion:"2641608", FieldPath:""}): type: 'Warning' reason: 'SparkApplicationFailed' SparkApplication spark-pi failed: Driver Pod not found
I0319 23:56:29.455003      11 controller.go:205] SparkApplication spark/spark-pi was updated, enqueueing it
I0319 23:56:29.494486      11 controller.go:205] SparkApplication spark/spark-pi was updated, enqueueing it

liyinan926 · 2019-03-20T14:42:41Z

The most interesting part of the logs is as follows. There was an update to the app that caused it's state to be changed to INVALIDATING, and then to PENDING_RERUN (the line starting with I0319 23:56:26.556801). However, in the interim, there was also an attempt to change the application state from RUNNING to FAILING. This latter state change was triggered likely by a pod update. The state changed to FAILING because the driver pod got deleted when the app was updated and invalidated, and this line https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/pkg/controller/sparkapplication/controller.go#L334 was triggered. It is clearly that the reason for the second update to start with a RUNNING state is stale cache. The operator uses an informer and a lister for getting and listing SparkApplication objects. The informer/lister maintains a cache internally. When the status of an application gets updated on the API server, the informer receives the update and updates the cache asynchronously. Because there were several updates in a very short amount of time, there's a good chance that some update read a stale state of the app. This is a inherent limitation of the informer/lister.

I0319 23:56:26.542436      11 controller.go:522] Trying to update SparkApplication spark/spark-pi, from: [{spark-f3e575fefe4444939ca9bf66a69bb464 spark-pi-1553039775702310039 2019-03-19 23:56:18 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 31660    spark-pi-driver} {INVALIDATING } map[] 1 1}] to [{ spark-pi-1553039775702310039 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 31660    spark-pi-driver} {PENDING_RERUN } map[] 0 0}]
I0319 23:56:26.556801      11 sparkapp_metrics.go:144] Exporting metrics for spark-pi; old status: {spark-f3e575fefe4444939ca9bf66a69bb464 spark-pi-1553039775702310039 2019-03-19 23:56:18 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 31660    spark-pi-driver} {INVALIDATING } map[] 1 1} new status: { spark-pi-1553039775702310039 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 31660    spark-pi-driver} {PENDING_RERUN } map[] 0 0}
I0319 23:56:26.556841      11 controller.go:252] Ending processing key: "spark/spark-pi"
I0319 23:56:26.556871      11 controller.go:245] Starting processing key: "spark/spark-pi"
W0319 23:56:26.556953      11 controller.go:334] driver not found for SparkApplication: spark/spark-pi
I0319 23:56:26.556975      11 controller.go:522] Trying to update SparkApplication spark/spark-pi, from: [{spark-f3e575fefe4444939ca9bf66a69bb464 spark-pi-1553039775702310039 2019-03-19 23:56:18 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {spark-pi-ui-svc 31660    spark-pi-driver} {RUNNING } map[] 1 1}] to [{spark-f3e575fefe4444939ca9bf66a69bb464 spark-pi-1553039775702310039 2019-03-19 23:56:18 +0000 UTC 2019-03-19 23:56:26.556966857 +0000 UTC m=+1183.702582450 {spark-pi-ui-svc 31660    spark-pi-driver} {FAILING Driver Pod not found} map[] 1 1}]
I0319 23:56:26.558776      11 controller.go:205] SparkApplication spark/spark-pi was updated, enqueueing it

skonto · 2019-03-20T20:17:16Z

ok @liyinan926 it is the stale cache or the fact that for the parallel updates there is no "thread" safety in the operator's logic.

liyinan926 · 2019-03-20T20:18:33Z

@skonto this is a known limitation of the informer and the caching mechanism. There's always some amount of delay because of the async nature. Most of the time this is not an issue as user updates are not expected to be super frequent.

skonto · 2019-03-20T20:25:28Z

It is clearly that the reason for the second update to start with a RUNNING state is stale cache.

I am not sure about this because things run in parallel and the last part about SUBMITTED state is ambiguous. If you get an update while on another thread the submission is happening is the code 100% safe (even if the cache was working fine)? I suspect no according to my description above. Anyway we can always merge and re-open if something fails.
One question though updates will happen at this rate (not uncommon at all) so this cache is becoming a problem IMHO.
If you cannot get a reliable app status then CRD is a bit useless from that point of view.

liyinan926 · 2019-03-20T20:32:35Z

It is clearly that the reason for the second update to start with a RUNNING state is stale cache.

I am not sure about this because things run in parallel and the last part about SUBMITTED state is ambiguous. If you get an update while on another thread the submission is happening is the code 100% safe (even if the cache was working fine)? Anyway we can always merge and re-open if something fails.

Actually update events are put into a queue and processed sequentially. However, because the cache is async in nature, a subsequent update may see a state that is stale. In this particularly case, the second update was trying to change the app state from RUNNING to FAILED as the log shows.

skonto · 2019-03-20T20:42:58Z

There are workers at the controller level what dequeue items, these run in parallel, no?
processNextItem -> syncSparkApplication is called in parallel from what I see.

liyinan926 · 2019-03-20T20:45:21Z

There are workers at the controller level what dequeue items, these run in parallel, no?
processNextItem -> syncSparkApplication is called in parallel from what I see.

Oh, you are right. I forgot that there are multiple worker threads.

skonto · 2019-03-20T22:28:10Z

@liyinan926 are you going to work on these corner cases on another PR?

liyinan926 · 2019-03-25T20:19:47Z

@liyinan926 are you going to work on these corner cases on another PR?

By corner cases, do you mean handling reads from stale cache? If yes, the best possible solution is to always read from the API server. But that risks putting a much bigger pressure onto the API server, which is what the caching mechanism tries to alleviate.

skonto · 2019-03-26T15:19:29Z

@liyinan926 No the cases where things fail because of the parallel execution like above.
Imagine you have a job that keeps failing generating events (another case)...

Added a submissionID to status and use it to group pod of the same run

61a9a75

skonto reviewed Mar 19, 2019

View reviewed changes

liyinan926 merged commit 1bbf0d0 into kubeflow:master Mar 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added a submissionID to status and use it to group pod of the same run #446

Added a submissionID to status and use it to group pod of the same run #446

liyinan926 commented Mar 19, 2019

skonto commented Mar 19, 2019

skonto Mar 19, 2019 •

edited

Loading

skonto Mar 19, 2019 •

edited

Loading

liyinan926 Mar 19, 2019

skonto Mar 19, 2019 •

edited

Loading

liyinan926 Mar 19, 2019

skonto commented Mar 19, 2019 •

edited

Loading

skonto commented Mar 19, 2019 •

edited

Loading

liyinan926 commented Mar 19, 2019

skonto commented Mar 19, 2019 •

edited

Loading

liyinan926 commented Mar 19, 2019

skonto commented Mar 19, 2019 •

edited

Loading

liyinan926 commented Mar 19, 2019

skonto commented Mar 19, 2019 •

edited

Loading

skonto commented Mar 20, 2019 •

edited

Loading

liyinan926 commented Mar 20, 2019

skonto commented Mar 20, 2019 •

edited

Loading

liyinan926 commented Mar 20, 2019 •

edited

Loading

skonto commented Mar 20, 2019 •

edited

Loading

liyinan926 commented Mar 20, 2019 •

edited

Loading

skonto commented Mar 20, 2019 •

edited

Loading

liyinan926 commented Mar 20, 2019

skonto commented Mar 20, 2019 •

edited

Loading

liyinan926 commented Mar 20, 2019

skonto commented Mar 20, 2019 •

edited

Loading

liyinan926 commented Mar 25, 2019

skonto commented Mar 26, 2019 •

edited

Loading

Added a submissionID to status and use it to group pod of the same run #446

Added a submissionID to status and use it to group pod of the same run #446

Conversation

liyinan926 commented Mar 19, 2019

skonto commented Mar 19, 2019

skonto Mar 19, 2019 • edited Loading

Choose a reason for hiding this comment

skonto Mar 19, 2019 • edited Loading

Choose a reason for hiding this comment

liyinan926 Mar 19, 2019

Choose a reason for hiding this comment

skonto Mar 19, 2019 • edited Loading

Choose a reason for hiding this comment

liyinan926 Mar 19, 2019

Choose a reason for hiding this comment

skonto commented Mar 19, 2019 • edited Loading

skonto commented Mar 19, 2019 • edited Loading

liyinan926 commented Mar 19, 2019

skonto commented Mar 19, 2019 • edited Loading

liyinan926 commented Mar 19, 2019

skonto commented Mar 19, 2019 • edited Loading

liyinan926 commented Mar 19, 2019

skonto commented Mar 19, 2019 • edited Loading

skonto commented Mar 20, 2019 • edited Loading

liyinan926 commented Mar 20, 2019

skonto commented Mar 20, 2019 • edited Loading

liyinan926 commented Mar 20, 2019 • edited Loading

skonto commented Mar 20, 2019 • edited Loading

liyinan926 commented Mar 20, 2019 • edited Loading

skonto commented Mar 20, 2019 • edited Loading

liyinan926 commented Mar 20, 2019

skonto commented Mar 20, 2019 • edited Loading

liyinan926 commented Mar 20, 2019

skonto commented Mar 20, 2019 • edited Loading

liyinan926 commented Mar 25, 2019

skonto commented Mar 26, 2019 • edited Loading

skonto Mar 19, 2019 •

edited

Loading

skonto Mar 19, 2019 •

edited

Loading

skonto Mar 19, 2019 •

edited

Loading

skonto commented Mar 19, 2019 •

edited

Loading

skonto commented Mar 19, 2019 •

edited

Loading

skonto commented Mar 19, 2019 •

edited

Loading

skonto commented Mar 19, 2019 •

edited

Loading

skonto commented Mar 19, 2019 •

edited

Loading

skonto commented Mar 20, 2019 •

edited

Loading

skonto commented Mar 20, 2019 •

edited

Loading

liyinan926 commented Mar 20, 2019 •

edited

Loading

skonto commented Mar 20, 2019 •

edited

Loading

liyinan926 commented Mar 20, 2019 •

edited

Loading

skonto commented Mar 20, 2019 •

edited

Loading

skonto commented Mar 20, 2019 •

edited

Loading

skonto commented Mar 20, 2019 •

edited

Loading

skonto commented Mar 26, 2019 •

edited

Loading