[SPARK-24552][core][SQL] Use task ID instead of attempt number for writes. #21606

vanzin · 2018-06-21T19:59:58Z

This passes the unique task attempt id instead of attempt number to v2 data sources because attempt number is reused when stages are retried. When attempt numbers are reused, sources that track data by partition id and attempt number may incorrectly clean up data because the same attempt number can be both committed and aborted.

For v1 / Hadoop writes, generate a unique ID based on available attempt numbers to avoid a similar problem.

Closes #21558

vanzin · 2018-06-21T20:00:25Z

Credit here should go to @rdblue when merging.

cloud-fan · 2018-06-21T20:18:08Z

core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala

      committer: FileCommitProtocol,
      iterator: Iterator[(K, V)]): TaskCommitMessage = {
    // Set up a task.
    val taskContext = config.createTaskAttemptContext(
-      jobTrackerId, commitJobId, sparkPartitionId, sparkAttemptNumber)
+      jobTrackerId, commitJobId, sparkPartitionId, sparkTaskId.toInt)


is it safe?

task id is unique across the entire Spark application life cycle, which means we may have very large task id in a long-running micro-batch streaming application.

If we do need an int here, I'd suggest we combine stageAttemptNumber and taskAttemptNumber into a int, which is much less risky.(Spark won't have a lot of stage/task attempts)

Streaming still generates separate jobs / stages for each batch, right?

In that case this should be fine; this would only be a problem if a single stage has enough tasks to cover all the integer space (4 billion tasks). That shouldn't be even possible since I doubt that you'd be able to have more than Integer.MAX_VALUE tasks (and even that is unlikely to ever happen).

I could use abs here (and in the sql code) to avoid a negative value (potentially avoiding weird file names).

I don't follow, the task ids increment across jobs. so if you have a very long running application that continues to start new jobs you could potentially run out.

But what does "run out" mean?

If your task ID goes past Int.MaxValue, you'll start getting negative values here. Eventually you'll get to a long value that wraps back again to 0 when cast to an integer:

(2L + Int.MaxValue + Int.MaxValue).toInt res2: Int = 0

So for this to "not work", which means you'd have a conflict where two tasks will generate the same output file name based on all these values (stage, task, partition, etc, etc), you need that situation to happen, which means you need about 4 billion tasks in the same stage for this to be a problem.

In other situations, you may get weird values because of the cast, but it should still work.

Ah I see what you are saying, we just need to make sure it going negative doesn't cause any side affects or anything unpexected

I commented before I saw this thread, but I think it is better to use the TID because that is already exposed in the UI so it is better for tracking between UI tasks and logs. The combined attempt number isn't used anywhere so this would introduce another number to identify a task. And, shifting by 16 means that these grow huge anyway.

To backport this, can we use the .toInt version? I think that should be safe.

cloud-fan · 2018-06-21T21:52:52Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriterFactory.java

   * @param epochId A monotonically increasing id for streaming queries that are split in to
   *                discrete periods of execution. For non-streaming queries,
   *                this ID will always be 0.
   */
-  DataWriter<T> createDataWriter(int partitionId, int attemptNumber, long epochId);
+  DataWriter<T> createDataWriter(int partitionId, int taskId, long epochId);


in SparkHadoopWriter we must have a int, but here why not just change the type to long? data source v2 is still evolving and we already made a lot of changes in the master branch.

I'm fine with that if you're ok with it, but wouldn't that make backporting this to 2.3 a little fishy? Yeah it's evolving, but it's still a little sub-optimal to break things in a maintenance release.

if this patch targets 2.3, I'd say we should not change any API or document, just pass taskId.toInt as attemptNumber and add comments to explain this hacky workaround.

Just so I understand, what's the reason for not changing the parameter name and API docs? The name is not a public API in Java, so it doesn't break anything.

And regardless of the parameter name, the API documentation is wrong (since it says you can have multiple tasks with the same ID, but different attempts, which does not happen).

The V2 commit stuff is not in 2.3

https://issues.apache.org/jira/browse/SPARK-23323

Hmm, interesting. But there is an API in 2.3:

DataWriter<T> createDataWriter(int partitionId, int attemptNumber);

Which I guess would still suffer from the problem Ryan describes in the bug. In any case, that makes it not possible to cleanly backport this, so we can make the type change here.

+1 for the type change.

tgravescs · 2018-06-21T21:57:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala

-          logInfo(s"Writer for stage $stageId / $stageAttempt, " +
-            s"task $partId.$attemptId is authorized to commit.")
+          logInfo(s"Writer for stage $stageId.$stageAttempt, " +
+            s"task $partId.$taskId is authorized to commit.")


this we want to leave as attemptNumber

tgravescs · 2018-06-21T21:58:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala

          logInfo(message)
          // throwing CommitDeniedException will trigger the catch block for abort
-          throw new CommitDeniedException(message, stageId, partId, attemptId)
+          throw new CommitDeniedException(message, stageId, partId, taskId)


I think these and above messages should be the attempt number to match the output committer

tgravescs · 2018-06-21T22:02:51Z

I guess it depends on how picky we want to be there are other places that use attemptNumber that we could update to task id: InternalRowDataWriterFactory, memoryV2, and SimpleWritableDataSource, places that implement createDataWriter

cloud-fan · 2018-06-21T23:55:56Z

core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala

+        // The first two are currently the case in Spark, while the last one is very unlikely to
+        // occur. If it does, two tasks IDs on a single stage could have a clashing integer value,
+        // which could lead to code that generates clashing file names for different tasks. Still,
+        // if the commit coordinator is enabled, only one task would be allowed to commit.


since it's not a simple toInt anymore, how about we combine stage and task attempt number?

val stageAttemptNumer = ... val taskAttempNumber = ... assert(stageAttemptNumer <= Short.MaxValue) assert(taskAttempNumber <= Short.MaxValue) val sparkAttempNumber = (stageAttemptNumer << 16) | taskAttempNumber

we can also remove the assert and assume that, even we have so many attempts, they are not all active.

Ok, I'll use that. I think Spark might fail everything before you even go that high in attempt numbers anyway...

SparkQA · 2018-06-22T00:21:21Z

Test build #92181 has finished for PR 21606 at commit 7233a5f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-22T02:36:34Z

Test build #92187 has finished for PR 21606 at commit c884f4f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-22T03:35:35Z

Test build #92189 has finished for PR 21606 at commit 5efaae7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-22T03:41:38Z

Test build #92190 has finished for PR 21606 at commit 227d513.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-22T05:08:20Z

Test build #92192 has finished for PR 21606 at commit a16d9f9.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-06-22T13:46:10Z

test this please

tgravescs · 2018-06-22T13:52:13Z

core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala

@@ -76,13 +76,17 @@ object SparkHadoopWriter extends Logging {
    // Try to write all RDD partitions as a Hadoop OutputFormat.
    try {
      val ret = sparkContext.runJob(rdd, (context: TaskContext, iter: Iterator[(K, V)]) => {
+        // SPARK-24552: Generate a unique "task ID" based on the stage and task atempt numbers.
+        // Assumes that there won't be more than Short.MaxValue attempts, at least not concurrently.
+        val taskId = (context.stageAttemptNumber << 16) | context.attemptNumber


perhaps we should rename taskId to be something more unique so we don't confuse it

maybe just something like uniqueTaskId or specialTaskId but not a big deal.

SparkQA · 2018-06-22T18:18:33Z

Test build #92214 has finished for PR 21606 at commit a16d9f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-06-22T18:23:34Z

+1

rdblue · 2018-06-22T19:04:28Z

core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala

@@ -76,13 +76,17 @@ object SparkHadoopWriter extends Logging {
    // Try to write all RDD partitions as a Hadoop OutputFormat.
    try {
      val ret = sparkContext.runJob(rdd, (context: TaskContext, iter: Iterator[(K, V)]) => {
+        // SPARK-24552: Generate a unique "attempt ID" based on the stage and task atempt numbers.
+        // Assumes that there won't be more than Short.MaxValue attempts, at least not concurrently.
+        val attemptId = (context.stageAttemptNumber << 16) | context.attemptNumber


I don't think we should generate an ID this way. We already have a unique ID that is exposed in the Spark UI. I'd much rather make it clear that the TID passed to committers as an attempt ID is the same as the TID in the stage view. That makes debugging easier. Going with this approach just introduces yet another number to track an attempt.

the problem is that taskid is a long, we can't change the hadoop api for that, and to me its more possible to have a valid task id > 2^32. It might not be ideal to do it this way but I think its a good bug fix especially for now, we can file a follow on to improve if we have ideas or want to change interface

Okay, that makes sense if this is just for Hadoop attempt IDs. Maybe that's a good thing to put in the comment as well?

rdblue · 2018-06-22T19:06:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala

            s"task $partId.$attemptId is authorized to commit.")
          dataWriter.commit()
        } else {
-          val message = s"Stage $stageId / $stageAttempt, " +
+          val message = s"Stage $stageId.$stageAttempt, " +


Should these logs use TID instead of attempt number? The format used in other log messages is s"Task $taskId (TID $tid)", I think.

(This is for the next line, sorry for the confusion)

I'll change these log messages a bit. I think the attempt is still helpful while we don't change the coordinator API (SPARK-24611), and doesn't hurt to have to there even after we do.

rdblue · 2018-06-22T19:51:32Z

+1

vanzin · 2018-06-22T19:51:45Z

A general comment about the log messages: it seems pretty noisy to have "logInfo" messages for every task (doing so only in the "failure" paths would be better in my opinion); but I'm keeping the current log level.

vanzin · 2018-06-22T19:57:23Z

Also, since this patch won't backport, I'll go ahead and send versions of it for branch-2.3 and branch-2.2 (which I think will be enough to also backport to 2.1).

tgravescs · 2018-06-22T20:07:03Z

just an fyi, I was looking at backporting to 2.2, looks like at least some write calls don't have the issue:
https://github.com/apache/spark/blob/branch-2.2/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L1128

Looks like that was lost when things we refactored.

In fact a test that rolls it, not sure why that was added:
https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/scheduler/OutputCommitCoordinatorSuite.scala#L325

vanzin · 2018-06-22T20:18:38Z

Interesting. But I found the same code in a different place:
https://github.com/apache/spark/blob/branch-2.2/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L89

SparkQA · 2018-06-22T20:42:10Z

Test build #92218 has finished for PR 21606 at commit 503852f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-06-22T20:46:36Z

Yeah so things like saveAsTextFile in 2.2 are ok but other functions like saveAsNewAPIHadoopFile and the dataframe writers have the issue, so we do need to backport

SparkQA · 2018-06-23T00:19:20Z

Test build #92225 has finished for PR 21606 at commit 47131c5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-23T04:13:56Z

Test build #92235 has finished for PR 21606 at commit e892937.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-06-25T14:14:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala

@@ -125,12 +124,12 @@ object DataWritingSparkTask extends Logging {
        val coordinator = SparkEnv.get.outputCommitCoordinator
        val commitAuthorized = coordinator.canCommit(stageId, stageAttempt, partId, attemptId)


a note for the followup: since we decided to use taskId as a unique identifier for write tasks, the output coordinator can also use taskId instead of stage and task attempts.

vanzin · 2018-06-25T18:02:16Z

Cool, so it looks like we can merge this? (And #21615 and #21616?)

vanzin · 2018-06-25T23:53:56Z

Given the deafening silence, I'll merge the PRs myself, given there's a bunch of +1s from others.

…ites. This passes the unique task attempt id instead of attempt number to v2 data sources because attempt number is reused when stages are retried. When attempt numbers are reused, sources that track data by partition id and attempt number may incorrectly clean up data because the same attempt number can be both committed and aborted. For v1 / Hadoop writes, generate a unique ID based on available attempt numbers to avoid a similar problem. Closes apache#21558 Author: Marcelo Vanzin <[email protected]> Author: Ryan Blue <[email protected]> Closes apache#21606 from vanzin/SPARK-24552.2. Ref: LIHADOOP-48531

rdblue and others added 6 commits June 18, 2018 13:22

SPARK-24552: Use task ID instead of attempt number for v2 writes.

6c60d14

Rename attemptId -> taskId for clarity.

2e65524

Use task ID instead of attempt for the Hadoop API too.

3561723

Merge branch 'master' into SPARK-24552.2

d5a079d

Log message update.

fdcd39c

Javadoc updates.

7233a5f

cloud-fan reviewed Jun 21, 2018

View reviewed changes

tgravescs reviewed Jun 21, 2018

View reviewed changes

Marcelo Vanzin added 3 commits June 21, 2018 15:18

Change attemptNumber:Int to taskId:Long in createDataWriter API.

c884f4f

Keep task IDs positive in the Hadoop task context.

5efaae7

Remove now unnecessary cast.

227d513

cloud-fan reviewed Jun 21, 2018

View reviewed changes

Different approach to generate unique task ID.

a16d9f9

tgravescs reviewed Jun 22, 2018

View reviewed changes

Revert to previous parameter names.

503852f

rdblue reviewed Jun 22, 2018

View reviewed changes

Reword some log messages.

47131c5

Typo.

e892937

cloud-fan reviewed Jun 25, 2018

View reviewed changes

cloud-fan approved these changes Jun 25, 2018

View reviewed changes

asfgit closed this in 6d16b98 Jun 25, 2018

vanzin deleted the SPARK-24552.2 branch August 24, 2018 19:56

		@@ -125,12 +124,12 @@ object DataWritingSparkTask extends Logging {
		val coordinator = SparkEnv.get.outputCommitCoordinator
		val commitAuthorized = coordinator.canCommit(stageId, stageAttempt, partId, attemptId)

[SPARK-24552][core][SQL] Use task ID instead of attempt number for writes. #21606

[SPARK-24552][core][SQL] Use task ID instead of attempt number for writes. #21606

Conversation

vanzin commented Jun 21, 2018 • edited Loading

vanzin commented Jun 21, 2018

Choose a reason for hiding this comment

cloud-fan Jun 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Jun 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanzin Jun 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgravescs commented Jun 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 22, 2018

SparkQA commented Jun 22, 2018

SparkQA commented Jun 22, 2018

SparkQA commented Jun 22, 2018

SparkQA commented Jun 22, 2018

tgravescs commented Jun 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 22, 2018

tgravescs commented Jun 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Jun 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Jun 22, 2018

vanzin commented Jun 22, 2018

vanzin commented Jun 22, 2018

tgravescs commented Jun 22, 2018

vanzin commented Jun 22, 2018

SparkQA commented Jun 22, 2018

tgravescs commented Jun 22, 2018

SparkQA commented Jun 23, 2018

SparkQA commented Jun 23, 2018

Choose a reason for hiding this comment

vanzin commented Jun 25, 2018

vanzin commented Jun 25, 2018

vanzin commented Jun 21, 2018 •

edited

Loading

cloud-fan Jun 21, 2018 •

edited

Loading

rdblue Jun 22, 2018 •

edited

Loading

vanzin Jun 21, 2018 •

edited

Loading

rdblue Jun 22, 2018 •

edited

Loading