[SPARK-2298] Encode stage attempt in SparkListener & UI. #1545

rxin · 2014-07-23T09:15:08Z

Simple way to reproduce this in the UI:

val f = new java.io.File("/tmp/test")
f.delete()
sc.parallelize(1 to 2, 2).map(x => (x,x )).repartition(3).mapPartitionsWithContext { case (context, iter) =>
  if (context.partitionId == 0) {
    val f = new java.io.File("/tmp/test")
    if (!f.exists) {
      f.mkdir()
      System.exit(0);
    }
  }
  iter
}.count()

lianhuiwang · 2014-07-23T12:54:51Z

i think we can add jobid to stageTable. because jobid is very useful when a application has many jobs.that can distinguish every job's stages.

kayousterhout · 2014-07-23T21:37:16Z

core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala

@@ -43,13 +43,16 @@ class JobProgressListener(conf: SparkConf) extends SparkListener with Logging {
  // How many stages to remember
  val retainedStages = conf.getInt("spark.ui.retainedStages", DEFAULT_RETAINED_STAGES)

-  val activeStages = HashMap[Int, StageInfo]()
+  // Map from stageId to StageInfo
+  val activeStages = new HashMap[Int, StageInfo]


Why isn't this also indexed by stageId+attemptID?

Oh because only one attempt will be active at once? If so many add a comment describing that?

SparkQA · 2014-07-29T23:48:47Z

QA tests have started for PR 1545. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17392/consoleFull

rxin · 2014-08-18T22:28:37Z

I pushed a new version that merges cleanly with master.

SparkQA · 2014-08-18T22:30:25Z

QA tests have started for PR 1545 at commit 4e5faa2.

This patch merges cleanly.

kayousterhout · 2014-08-18T23:15:55Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

@@ -1029,6 +1033,7 @@ class DAGScheduler(
      case FetchFailed(bmAddress, shuffleId, mapId, reduceId) =>
        // Mark the stage that the reducer was in as unrunnable
        val failedStage = stageIdToStage(task.stageId)
+        listenerBus.post(SparkListenerStageCompleted(failedStage.info))


Does it make sense to just call markStageAsFinished here (instead of the two lines above)? I just wonder if doing that will help avoid future bugs along this code path.

kayousterhout · 2014-08-18T23:50:56Z

It looks like now, one stage can represent multiple stage attempts (in which case Stage.numTasks is wrong for the later attempts), but there's one StageInfo per attempt, and Stage.info is reset based on which attempt is currently running? This seems a bit ugly / error prone, and it also seems problematic in the case we discussed offline where a stage can have multiple active attempts (if this case really does happen).

Did you consider changing the resubmitFailedStages() method in the DAGScheduler to create a new Stage for the failed one (and then adding a copy() method or something to Stage that creates a new one based on the current one)?

kayousterhout · 2014-08-18T23:56:57Z

Also what are the semantics for accumulables for resubmitted stages? I ask because right now, the way you copy StageInfo, the values of accumulables gets wiped when a stage gets resubmitted...JW is that's the desired behavior.

kayousterhout · 2014-08-19T00:04:18Z

Ok so if you're anxious to get this in, how about this simpler fix to make this a little less ugly:
(1) Change the numTasks parameter to Stage not to be a val -- so it's not saved as part of the class, since it's incorrect for later attempts. Then, change StageInfo.fromStage to always accept a number of tasks. Also update the docstring for Stage to specify that a Stage object is used across multiple stage attempts.
(2) Change the comment above Stage.info to say it's a pointer to the most recent StageInfo, and will be updated by the DAGScheduler for new stage attempts. Maybe also change the name to latestInfo so it's abundantly clear that this can be updated.
(3) Reset the info in resubmitFailedStages, rather than the current place that you have it. I think that makes it more clear what's going on / why Stage.info needs to be set.

SparkQA · 2014-08-19T00:30:26Z

Tests timed out after a configured wait of 120m.

SparkQA · 2014-08-19T02:05:25Z

QA tests have started for PR 1545 at commit 6c08b07.

This patch merges cleanly.

SparkQA · 2014-08-19T04:05:26Z

Tests timed out after a configured wait of 120m.

pwendell · 2014-08-19T05:32:58Z

Jenkins, retest this please.

SparkQA · 2014-08-19T05:35:31Z

QA tests have started for PR 1545 at commit 6c08b07.

This patch merges cleanly.

pwendell · 2014-08-19T06:42:59Z

Pulled this from the jenkins log

14/08/18 22:52:57.452 INFO BlockManager: Found block broadcast_13 locally
14/08/18 22:52:57.453 ERROR Executor: Exception in task 1.0 in stage 13.0 (TID 36)
org.apache.spark.TaskKilledException
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
14/08/18 22:52:57.453 WARN TaskSetManager: Lost task 1.0 in stage 13.0 (TID 36, localhost): org.apache.spark.TaskKilledException:
        org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        java.lang.Thread.run(Thread.java:745)
14/08/18 22:52:57.454 INFO TaskSchedulerImpl: Removed TaskSet 13.0, whose tasks have all completed, from pool
14/08/18 22:52:57.456 ERROR DAGSchedulerActorSupervisor: eventProcesserActor failed; shutting down SparkContext
java.util.NoSuchElementException: key not found: 13
        at scala.collection.MapLike$class.default(MapLike.scala:228)
        at scala.collection.AbstractMap.default(Map.scala:58)
        at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:900)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1378)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/08/18 22:52:57.472 INFO SparkContext: Starting job: first at ChiSqTest.scala:81

pwendell · 2014-08-19T06:50:09Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

+      if (isSuccessful) {
+        logInfo("%s (%s) finished in %s s".format(stage, stage.name, serviceTime))
+      } else {
+


spacing seems off here

…tener.

SparkQA · 2014-08-19T07:05:34Z

QA tests have started for PR 1545 at commit 0f36075.

This patch merges cleanly.

SparkQA · 2014-08-19T07:08:52Z

QA tests have finished for PR 1545 at commit 0f36075.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SparkListenerTaskStart(stageId: Int, stageAttemptId: Int, taskInfo: TaskInfo)
- case class Params(input: String = "data/mllib/sample_linear_regression_data.txt")
- case class Params(input: String = "data/mllib/sample_linear_regression_data.txt")
- case class Params(input: String = "data/mllib/sample_binary_classification_data.txt")

SparkQA · 2014-08-19T07:20:26Z

QA tests have started for PR 1545 at commit b3e2eed.

This patch merges cleanly.

kayousterhout · 2014-08-20T05:14:44Z

core/src/main/scala/org/apache/spark/scheduler/StageInfo.scala

@@ -56,9 +57,15 @@ private[spark] object StageInfo {
   * shuffle dependencies. Therefore, all ancestor RDDs related to this Stage's RDD through a
   * sequence of narrow dependencies should also be associated with this Stage.
   */
-  def fromStage(stage: Stage): StageInfo = {
+  def fromStage(stage: Stage, numTasks: Option[Int] = None): StageInfo = {


this is a nit, but I think this method might be better as an updateStageInfo(numTasks: Int) method in Stage(), that creates an appropriate StageInfo and then sets latestInfo accordingly (since I think that would make it a little clearer to a reader what the usage of this is). Fine if you think it's better this way though...

rxin · 2014-08-20T05:27:58Z

Ok I pushed a new version that should address the hanging JobCancellationSuite test. I also went through all the changes to make sure similar problems wouldn't happen due to racing.

SparkQA · 2014-08-20T05:29:17Z

QA tests have started for PR 1545 at commit c414c36.

This patch merges cleanly.

SparkQA · 2014-08-20T06:14:16Z

QA tests have finished for PR 1545 at commit c414c36.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-08-20T07:00:28Z

Tests timed out after a configured wait of 120m.

pwendell · 2014-08-20T07:19:52Z

Jenkins, test this please.

SparkQA · 2014-08-20T07:25:25Z

QA tests have started for PR 1545 at commit 40a6bd5.

This patch merges cleanly.

SparkQA · 2014-08-20T07:25:31Z

QA tests have finished for PR 1545 at commit 40a6bd5.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SparkListenerTaskStart(stageId: Int, stageAttemptId: Int, taskInfo: TaskInfo)

rxin · 2014-08-20T07:27:11Z

Jenkins, retest this please.

SparkQA · 2014-08-20T07:30:26Z

QA tests have started for PR 1545 at commit 40a6bd5.

This patch merges cleanly.

SparkQA · 2014-08-20T07:30:32Z

QA tests have finished for PR 1545 at commit 40a6bd5.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SparkListenerTaskStart(stageId: Int, stageAttemptId: Int, taskInfo: TaskInfo)

pwendell · 2014-08-20T16:44:16Z

Jenkins retest this please.

pwendell · 2014-08-20T16:44:41Z

Jenkins, test this please.

SparkQA · 2014-08-20T16:49:25Z

QA tests have started for PR 1545 at commit 40a6bd5.

This patch merges cleanly.

SparkQA · 2014-08-20T17:44:05Z

QA tests have finished for PR 1545 at commit 40a6bd5.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

- Properly report stage failure in FetchFailed.

rxin · 2014-08-20T20:53:57Z

Jenkins, test this please.

SparkQA · 2014-08-20T21:00:45Z

QA tests have started for PR 1545 at commit 3ee1d2a.

This patch merges cleanly.

SparkQA · 2014-08-20T21:55:16Z

QA tests have finished for PR 1545 at commit 3ee1d2a.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SparkListenerTaskStart(stageId: Int, stageAttemptId: Int, taskInfo: TaskInfo)

pwendell · 2014-08-20T22:37:06Z

Okay I did another pass on this - thanks @kayousterhout and @rxin for taking a lot of time on this. This will be a major usability improvement in the case of complex jobs that have failures.

Simple way to reproduce this in the UI: ```scala val f = new java.io.File("/tmp/test") f.delete() sc.parallelize(1 to 2, 2).map(x => (x,x )).repartition(3).mapPartitionsWithContext { case (context, iter) => if (context.partitionId == 0) { val f = new java.io.File("/tmp/test") if (!f.exists) { f.mkdir() System.exit(0); } } iter }.count() ``` Author: Reynold Xin <[email protected]> Closes #1545 from rxin/stage-attempt and squashes the following commits: 3ee1d2a [Reynold Xin] - Rename attempt to retry in UI. - Properly report stage failure in FetchFailed. 40a6bd5 [Reynold Xin] Updated test suites. c414c36 [Reynold Xin] Fixed the hanging in JobCancellationSuite. b3e2eed [Reynold Xin] Oops previous code didn't compile. 0f36075 [Reynold Xin] Mark unknown stage attempt with id -1 and drop that in JobProgressListener. 6c08b07 [Reynold Xin] Addressed code review feedback. 4e5faa2 [Reynold Xin] [SPARK-2298] Encode stage attempt in SparkListener & UI.

Simple way to reproduce this in the UI: ```scala val f = new java.io.File("/tmp/test") f.delete() sc.parallelize(1 to 2, 2).map(x => (x,x )).repartition(3).mapPartitionsWithContext { case (context, iter) => if (context.partitionId == 0) { val f = new java.io.File("/tmp/test") if (!f.exists) { f.mkdir() System.exit(0); } } iter }.count() ``` Author: Reynold Xin <[email protected]> Closes apache#1545 from rxin/stage-attempt and squashes the following commits: 3ee1d2a [Reynold Xin] - Rename attempt to retry in UI. - Properly report stage failure in FetchFailed. 40a6bd5 [Reynold Xin] Updated test suites. c414c36 [Reynold Xin] Fixed the hanging in JobCancellationSuite. b3e2eed [Reynold Xin] Oops previous code didn't compile. 0f36075 [Reynold Xin] Mark unknown stage attempt with id -1 and drop that in JobProgressListener. 6c08b07 [Reynold Xin] Addressed code review feedback. 4e5faa2 [Reynold Xin] [SPARK-2298] Encode stage attempt in SparkListener & UI.

rxin mentioned this pull request Jul 23, 2014

SPARK-2298: Show stage attempt in UI #1384

Closed

rxin mentioned this pull request Jul 23, 2014

[SPARK-2567] Resubmitted stage sometimes remains as active stage in the web UI #1516

Closed

kayousterhout reviewed Jul 23, 2014
View reviewed changes

[SPARK-2298] Encode stage attempt in SparkListener & UI.

4e5faa2

kayousterhout reviewed Aug 18, 2014
View reviewed changes

Addressed code review feedback.

6c08b07

pwendell reviewed Aug 19, 2014
View reviewed changes

Mark unknown stage attempt with id -1 and drop that in JobProgressLis…

0f36075

…tener.

Oops previous code didn't compile.

b3e2eed

kayousterhout reviewed Aug 20, 2014
View reviewed changes

Fixed the hanging in JobCancellationSuite.

c414c36

Updated test suites.

40a6bd5

- Rename attempt to retry in UI.

3ee1d2a

- Properly report stage failure in FetchFailed.

asfgit closed this in fb60bec Aug 20, 2014

JoshRosen mentioned this pull request Jan 7, 2015

[SPARK-5132][Core]Correct stage Attempt Id key in stageInfofromJson #3932

Closed

[SPARK-2298] Encode stage attempt in SparkListener & UI. #1545

[SPARK-2298] Encode stage attempt in SparkListener & UI. #1545

Conversation

rxin commented Jul 23, 2014

lianhuiwang commented Jul 23, 2014

kayousterhout Jul 23, 2014

Choose a reason for hiding this comment

kayousterhout Jul 23, 2014

Choose a reason for hiding this comment

SparkQA commented Jul 29, 2014

rxin commented Aug 18, 2014

SparkQA commented Aug 18, 2014

kayousterhout Aug 18, 2014

Choose a reason for hiding this comment

kayousterhout commented Aug 18, 2014

kayousterhout commented Aug 18, 2014

kayousterhout commented Aug 19, 2014

SparkQA commented Aug 19, 2014

SparkQA commented Aug 19, 2014

SparkQA commented Aug 19, 2014

pwendell commented Aug 19, 2014

SparkQA commented Aug 19, 2014

pwendell commented Aug 19, 2014

pwendell Aug 19, 2014

Choose a reason for hiding this comment

SparkQA commented Aug 19, 2014

SparkQA commented Aug 19, 2014

SparkQA commented Aug 19, 2014

kayousterhout Aug 20, 2014

Choose a reason for hiding this comment

rxin commented Aug 20, 2014

SparkQA commented Aug 20, 2014

SparkQA commented Aug 20, 2014

SparkQA commented Aug 20, 2014

pwendell commented Aug 20, 2014

SparkQA commented Aug 20, 2014

SparkQA commented Aug 20, 2014

rxin commented Aug 20, 2014

SparkQA commented Aug 20, 2014

SparkQA commented Aug 20, 2014

pwendell commented Aug 20, 2014

pwendell commented Aug 20, 2014

SparkQA commented Aug 20, 2014

SparkQA commented Aug 20, 2014

rxin commented Aug 20, 2014

SparkQA commented Aug 20, 2014

SparkQA commented Aug 20, 2014

pwendell commented Aug 20, 2014