[SPARK-18905][STREAMING] Fix the issue of removing a failed jobset from JobScheduler.jobSets #16542

CodingCat · 2017-01-11T00:26:45Z

What changes were proposed in this pull request?

the current implementation of Spark streaming considers a batch is completed no matter the results of the jobs (

spark/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala

Line 203 in 1169db4

if (jobSet.hasCompleted) {

)
Let's consider the following case:
A micro batch contains 2 jobs and they read from two different kafka topics respectively. One of these jobs is failed due to some problem in the user defined logic, after the other one is finished successfully.

The main thread in the Spark streaming application will execute the line mentioned above,
and another thread (checkpoint writer) will make a checkpoint file immediately after this line is executed.
Then due to the current error handling mechanism in Spark Streaming, StreamingContext will be closed (

spark/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala

Line 214 in 1169db4

reportError("Error running job " + job, e)

)
the user recovers from the checkpoint file, and because the JobSet containing the failed job has been removed (taken as completed) before the checkpoint is constructed, the data being processed by the failed job would never be reprocessed

This PR fix it by removing jobset from JobScheduler.jobSets only when all jobs in a jobset are successfully finished

How was this patch tested?

existing tests

…a loss

CodingCat · 2017-01-11T00:28:21Z

@zsxwing

SparkQA · 2017-01-11T01:47:49Z

Test build #71172 has finished for PR 16542 at commit a8646ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2017-01-12T19:22:36Z

streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala

+            jobSet.totalDelay / 1000.0, jobSet.time.toString,
+            jobSet.processingDelay / 1000.0
+          ))
+          listenerBus.post(StreamingListenerBatchCompleted(jobSet.toBatchInfo))


Could you also post this event for failure jobSet? Otherwise, the web UI cannot show it.

SparkQA · 2017-01-13T05:38:16Z

Test build #71288 has finished for PR 16542 at commit 465ccc6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2017-01-17T02:32:23Z

LGTM. Thanks! Merging to master and 2.1.

…om JobScheduler.jobSets ## What changes were proposed in this pull request? the current implementation of Spark streaming considers a batch is completed no matter the results of the jobs (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203) Let's consider the following case: A micro batch contains 2 jobs and they read from two different kafka topics respectively. One of these jobs is failed due to some problem in the user defined logic, after the other one is finished successfully. 1. The main thread in the Spark streaming application will execute the line mentioned above, 2. and another thread (checkpoint writer) will make a checkpoint file immediately after this line is executed. 3. Then due to the current error handling mechanism in Spark Streaming, StreamingContext will be closed (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214) the user recovers from the checkpoint file, and because the JobSet containing the failed job has been removed (taken as completed) before the checkpoint is constructed, the data being processed by the failed job would never be reprocessed This PR fix it by removing jobset from JobScheduler.jobSets only when all jobs in a jobset are successfully finished ## How was this patch tested? existing tests Author: CodingCat <[email protected]> Author: Nan Zhu <[email protected]> Closes #16542 from CodingCat/SPARK-18905. (cherry picked from commit f8db894) Signed-off-by: Shixiong Zhu <[email protected]>

CodingCat · 2017-01-17T02:51:29Z

Thanks

…et from JobScheduler.jobSets apache#16542

…om JobScheduler.jobSets ## What changes were proposed in this pull request? the current implementation of Spark streaming considers a batch is completed no matter the results of the jobs (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203) Let's consider the following case: A micro batch contains 2 jobs and they read from two different kafka topics respectively. One of these jobs is failed due to some problem in the user defined logic, after the other one is finished successfully. 1. The main thread in the Spark streaming application will execute the line mentioned above, 2. and another thread (checkpoint writer) will make a checkpoint file immediately after this line is executed. 3. Then due to the current error handling mechanism in Spark Streaming, StreamingContext will be closed (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214) the user recovers from the checkpoint file, and because the JobSet containing the failed job has been removed (taken as completed) before the checkpoint is constructed, the data being processed by the failed job would never be reprocessed This PR fix it by removing jobset from JobScheduler.jobSets only when all jobs in a jobset are successfully finished ## How was this patch tested? existing tests Author: CodingCat <[email protected]> Author: Nan Zhu <[email protected]> Closes apache#16542 from CodingCat/SPARK-18905.

CodingCat added 4 commits November 5, 2016 21:10

improve the doc for "spark.memory.offHeap.size"

24bfa38

fix

2209e34

Merge branch 'master' of https://github.com/apache/spark

65623f4

do not remove a jobset with any failed job from jobset to prevent dat…

a8646ac

…a loss

CodingCat changed the title ~~[SPARK-18905] Fix the issue of removing a failed jobset from JobScheduler.jobSets~~ [SPARK-18905][STREAMING] Fix the issue of removing a failed jobset from JobScheduler.jobSets Jan 11, 2017

zsxwing reviewed Jan 12, 2017

View reviewed changes

address the comments

465ccc6

asfgit closed this in f8db894 Jan 17, 2017

zzcclp added a commit to zzcclp/spark that referenced this pull request Jan 20, 2017

[EXT][SPARK-18905][STREAMING] Fix the issue of removing a failed jobs…

4f44f12

…et from JobScheduler.jobSets apache#16542

jiasheng55 mentioned this pull request Nov 27, 2017

[SPARK][STREAMING] Invoke onBatchCompletion() only when all jobs in the JobSet are Success #19824

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18905][STREAMING] Fix the issue of removing a failed jobset from JobScheduler.jobSets #16542

[SPARK-18905][STREAMING] Fix the issue of removing a failed jobset from JobScheduler.jobSets #16542

CodingCat commented Jan 11, 2017

CodingCat commented Jan 11, 2017

SparkQA commented Jan 11, 2017

zsxwing Jan 12, 2017

CodingCat Jan 13, 2017

SparkQA commented Jan 13, 2017

zsxwing commented Jan 17, 2017

CodingCat commented Jan 17, 2017

[SPARK-18905][STREAMING] Fix the issue of removing a failed jobset from JobScheduler.jobSets #16542

[SPARK-18905][STREAMING] Fix the issue of removing a failed jobset from JobScheduler.jobSets #16542

Conversation

CodingCat commented Jan 11, 2017

What changes were proposed in this pull request?

How was this patch tested?

CodingCat commented Jan 11, 2017

SparkQA commented Jan 11, 2017

zsxwing Jan 12, 2017

Choose a reason for hiding this comment

CodingCat Jan 13, 2017

Choose a reason for hiding this comment

SparkQA commented Jan 13, 2017

zsxwing commented Jan 17, 2017

CodingCat commented Jan 17, 2017