Spark 3.5: Parallelize reading files in snapshot and migrate procedures #10037

manuzhang · 2024-03-25T16:40:27Z

Similar to #9274 for snapshot and migrate procedures.

...extensions/src/test/java/org/apache/iceberg/spark/extensions/TestSnapshotTableProcedure.java

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/MigrateTableSparkAction.java

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/SnapshotTableSparkAction.java

api/src/main/java/org/apache/iceberg/actions/MigrateTable.java

api/src/main/java/org/apache/iceberg/actions/SnapshotTable.java

docs/docs/spark-procedures.md

...-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMigrateTableProcedure.java

data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java

...k/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/ProcedureUtil.java

...-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMigrateTableProcedure.java

RussellSpitzer

The code here IMHO should be checking what number of threads is passed to the executor service constructor. That's sufficient to test for me.

I also think that we shouldn't allow 0 or -1. Seems like those should have been forbidden before and I don't see why we would allow them now.

I'm also approving this in terms that when Eduard is satisfied i'm good to go as well.

...-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMigrateTableProcedure.java

aokolnychyi

LGTM. Nothing to add except what was already noted by others.

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/BaseProcedure.java

aokolnychyi · 2024-04-15T18:51:47Z

data/src/main/java/org/apache/iceberg/data/MigrationService.java

+import org.apache.iceberg.relocated.com.google.common.util.concurrent.MoreExecutors;
+import org.apache.iceberg.relocated.com.google.common.util.concurrent.ThreadFactoryBuilder;
+
+/** Have a separate class for getting ExecutorService to make it testable with static mock */


Are we sure it is a good idea to add another public class for this? How do we see it being used in the future? Why not use ThreadPools instead?

When we do static mock, all invocations need to be mocked as well. That includes ThreadPools used for SnapshotProducer and other places. Having a separate class and mock it is the cleanest way.

I am not convinced it justifies adding another public class. What if we overload listPartition or whatever method we need in TableMigrationUtil to accept ExecutorService?

It is an open question whether we should accept ExecutorService or int in the actions (not procedures). I think we have both approaches in our APIs.

@aokolnychyi I finally get time to refactor a bit, accepting ExecutorService in the actions, SparkTableUtil and TableMigrationUtil. Please take another look.

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/BaseProcedure.java

manuzhang · 2024-06-12T05:29:59Z

@nastra @RussellSpitzer @aokolnychyi Could you please take another look?

nastra · 2024-06-25T15:52:46Z

@manuzhang could you rebase the PR please as there were some changes that might affect it

manuzhang · 2024-06-25T23:37:31Z

@nastra Done!

api/src/main/java/org/apache/iceberg/actions/SnapshotTable.java

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java

nastra · 2024-06-26T08:10:26Z

data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java

@@ -215,11 +250,7 @@ private static DataFile buildDataFile(
        .build();
  }

-  private static ExecutorService migrationService(int parallelism) {
-    return MoreExecutors.getExitingExecutorService(


why are we switching here from an exiting executor service?

ThreadPools.newWorkerPool also returns an exiting executor service.

nastra · 2024-06-26T08:19:34Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/SnapshotTableProcedure.java

+    if (!args.isNullAt(4)) {
+      int parallelism = args.getInt(4);
+      Preconditions.checkArgument(parallelism > 0, "Parallelism should be larger than 0");
+      action = action.executeSnapshotWith(executorService(parallelism, "table-snapshot"));


shutdown of the executor should have been handled in executorService, but we need to do the same for the executor that is used for migrate. See also my other comment in https://github.com/apache/iceberg/pull/10037/files#r1654342618

nastra

LGTM, thanks for getting this done @manuzhang

puchengy · 2024-07-10T22:54:53Z

Leaving an idea for a further speed up for table w/ data skewness on partition level: we can further divide files from a given partition into X number of buckets.

manuzhang · 2024-07-11T02:36:28Z

@puchengy Feel free to open a new issue to track.

…es (apache#10037)

…te procedures Back-port of apache#10037

…cedures Back-port of apache#10037

Back-port of #9274 Back-port of #10037

…es (apache#10037)

…he#11043) Back-port of apache#9274 Back-port of apache#10037

github-actions bot added API spark docs labels Mar 25, 2024