Skip to content

Commit

Permalink
Javadoc updates.
Browse files Browse the repository at this point in the history
  • Loading branch information
Marcelo Vanzin committed Jun 21, 2018
1 parent fdcd39c commit 7233a5f
Show file tree
Hide file tree
Showing 3 changed files with 10 additions and 13 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -64,8 +64,8 @@ public interface DataSourceWriter {
DataWriterFactory<Row> createWriterFactory();

/**
* Returns whether Spark should use the commit coordinator to ensure that at most one attempt for
* each task commits.
* Returns whether Spark should use the commit coordinator to ensure that at most one task for
* each partition commits.
*
* @return true if commit coordinator should be used, false otherwise.
*/
Expand All @@ -90,9 +90,9 @@ default void onDataWriterCommit(WriterCommitMessage message) {}
* is undefined and @{@link #abort(WriterCommitMessage[])} may not be able to deal with it.
*
* Note that speculative execution may cause multiple tasks to run for a partition. By default,
* Spark uses the commit coordinator to allow at most one attempt to commit. Implementations can
* Spark uses the commit coordinator to allow at most one task to commit. Implementations can
* disable this behavior by overriding {@link #useCommitCoordinator()}. If disabled, multiple
* attempts may have committed successfully and one successful commit message per task will be
* tasks may have committed successfully and one successful commit message per task will be
* passed to this commit method. The remaining commit messages are ignored by Spark.
*/
void commit(WriterCommitMessage[] messages);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,13 +40,13 @@
* writers. If this data writer fails(one record fails to write or {@link #commit()} fails), an
* exception will be sent to the driver side, and Spark may retry this writing task a few times.
* In each retry, {@link DataWriterFactory#createDataWriter(int, int, long)} will receive a
* different `attemptNumber`. Spark will call {@link DataSourceWriter#abort(WriterCommitMessage[])}
* different `taskId`. Spark will call {@link DataSourceWriter#abort(WriterCommitMessage[])}
* when the configured number of retries is exhausted.
*
* Besides the retry mechanism, Spark may launch speculative tasks if the existing writing task
* takes too long to finish. Different from retried tasks, which are launched one by one after the
* previous one fails, speculative tasks are running simultaneously. It's possible that one input
* RDD partition has multiple data writers with different `attemptNumber` running at the same time,
* RDD partition has multiple data writers with different `taskId` running at the same time,
* and data sources should guarantee that these data writers don't conflict and can work together.
* Implementations can coordinate with driver during {@link #commit()} to make sure only one of
* these data writers can commit successfully. Or implementations can allow all of them to commit
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,15 +42,12 @@ public interface DataWriterFactory<T> extends Serializable {
* Usually Spark processes many RDD partitions at the same time,
* implementations should use the partition id to distinguish writers for
* different partitions.
* @param attemptNumber Spark may launch multiple tasks with the same task id. For example, a task
* failed, Spark launches a new task wth the same task id but different
* attempt number. Or a task is too slow, Spark launches new tasks wth the
* same task id but different attempt number, which means there are multiple
* tasks with the same task id running at the same time. Implementations can
* use this attempt number to distinguish writers of different task attempts.
* @param taskId A unique identifier for a task that is performing the write of the partition
* data. Spark may run multiple tasks for the same partition (due to speculation
* or task failures, for example).
* @param epochId A monotonically increasing id for streaming queries that are split in to
* discrete periods of execution. For non-streaming queries,
* this ID will always be 0.
*/
DataWriter<T> createDataWriter(int partitionId, int attemptNumber, long epochId);
DataWriter<T> createDataWriter(int partitionId, int taskId, long epochId);
}

0 comments on commit 7233a5f

Please sign in to comment.