[HUDI-8800] Introduce SingleSparkConsistentBucketClusteringExecutionStrategy to improve performance #12537

TheR1sing3un · 2024-12-24T11:11:15Z

For consistent bucket resizing, it creates a spark job for each clustering group, and then uses spark's datasource method to read the file slices that need to be merged/split inside each job. The data is then written to the new file group using a datasource bulk insert. The above process is fine for other types of clustering, but for bucket resizing, clustering in the above way will have performance problems.

For bucket resizing, the clustering plan already explicitly lists the mapping relationships in the resizing groups, and we already know which file slices each clustering group reads and writes to which file groups

we can avoid unnecessary ser/deser between avro and internalrow
we can eliminate the unnecessary shuffle process after reading and before writing
we can avoid heavy job management

Validation in our production environment：

table

mor
consistent bucket index
initial bucket number = 64
64 buckets per partition
20MB per bucket
15 partitions

operation

try to resize to 128 bucket per partition
executor memory=10G

original clustering

optimized clustering

Change Logs

introduce SingleSparkConsistentBucketClusteringExecutionStrategy to avoid shuffle
Describe context and summary for this change. Highlight if any code was copied.

Impact

none
Describe any public API or user-facing feature change or any performance impact.

Risk level (write none, low medium or high below)

low
If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

none
Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

TheR1sing3un · 2024-12-30T10:43:05Z

@hudi-bot run azure

TheR1sing3un · 2024-12-30T10:45:20Z

One subtask will be finished in anther pr:

Separate resizing and sort logic of consistent bucket index to maintain clear clustering meaning

…y to avoid shuffle 1. introduce SingleSparkConsistentBucketClusteringExecutionStrategy to avoid shuffle Signed-off-by: TheR1sing3un <[email protected]>

TheR1sing3un · 2025-01-02T03:50:35Z

...n/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java

    int readParallelism = Math.min(writeConfig.getClusteringGroupReadParallelism(), clusteringOps.size());

    return HoodieJavaRDD.of(jsc.parallelize(clusteringOps, readParallelism).mapPartitions(clusteringOpsPartition -> {
      List<Supplier<ClosableIterator<HoodieRecord<T>>>> suppliers = new ArrayList<>();
      clusteringOpsPartition.forEachRemaining(clusteringOp -> {

        Supplier<ClosableIterator<HoodieRecord<T>>> iteratorSupplier = () -> {
-          long maxMemoryPerCompaction = IOUtils.getMaxMemoryPerCompaction(new SparkTaskContextSupplier(), config);


These deleted code simply moved to the SparkJobExecutionStrategy used to provide a common reading method

TheR1sing3un · 2025-01-02T03:50:45Z

...n/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java

    int readParallelism = Math.min(writeConfig.getClusteringGroupReadParallelism(), clusteringOps.size());

    return HoodieJavaRDD.of(jsc.parallelize(clusteringOps, readParallelism)
        .mapPartitions(clusteringOpsPartition -> {
          List<Supplier<ClosableIterator<HoodieRecord<T>>>> iteratorGettersForPartition = new ArrayList<>();
          clusteringOpsPartition.forEachRemaining(clusteringOp -> {
-            Supplier<ClosableIterator<HoodieRecord<T>>> recordIteratorGetter = () -> {


These deleted code simply moved to the SparkJobExecutionStrategy used to provide a common reading method

TheR1sing3un · 2025-01-02T03:50:53Z

...n/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java

            iteratorGettersForPartition.add(recordIteratorGetter);
          });

          return new LazyConcatenatingIterator<>(iteratorGettersForPartition);
        }));
  }

-  private HoodieFileReader getBaseOrBootstrapFileReader(StorageConfiguration<?> storageConf, String bootstrapBasePath, Option<String[]> partitionFields, ClusteringOperation clusteringOp)


These deleted code simply moved to the SparkJobExecutionStrategy used to provide a common reading method

TheR1sing3un · 2025-01-02T03:51:59Z

...ain/java/org/apache/hudi/client/clustering/run/strategy/SingleSparkJobExecutionStrategy.java

    final TaskContextSupplier taskContextSupplier = getEngineContext().getTaskContextSupplier();
    final SerializableSchema serializableSchema = new SerializableSchema(schema);
    final List<ClusteringGroupInfo> clusteringGroupInfos = clusteringPlan.getInputGroups().stream().map(ClusteringGroupInfo::create).collect(Collectors.toList());

-    String umask = engineContext.hadoopConfiguration().get("fs.permissions.umask-mode");


None of this logic is valid, and the class hasn't been used anywhere before, so I refactor this part of the code

TheR1sing3un · 2025-01-08T12:05:52Z

@danny0405 Hi, Danny. In my opinion, it is not necessary to adopt the current logic for clustering of the type of consistent bucket resizing. We only need to process it like compaction： no InternalRow conversion, no spark shuffle. And separate resizing and sort logic of consistent bucket index to maintain clear clustering meaning.

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java

...i/client/clustering/run/strategy/SingleSparkConsistentBucketClusteringExecutionStrategy.java

.../src/main/java/org/apache/hudi/client/clustering/run/strategy/SparkJobExecutionStrategy.java

hudi-common/src/main/java/org/apache/hudi/common/model/ClusteringGroupInfo.java

...hudi-spark/src/test/java/org/apache/hudi/functional/TestSparkConsistentBucketClustering.java

TheR1sing3un · 2025-01-09T03:48:44Z

Two subtask:

Allows sorting tables of type consistent hashing index using an existing sorting strategy.
Deprecated previous hash resizing execution, using single mode by default.

1. refactor clustering related code for better readability Signed-off-by: TheR1sing3un <[email protected]>

1. fix ut Signed-off-by: TheR1sing3un <[email protected]>

danny0405 · 2025-01-09T07:58:08Z

...hudi-spark/src/test/java/org/apache/hudi/functional/TestSparkConsistentBucketClustering.java

@@ -110,7 +115,7 @@ public void setup(int maxFileSize, Map<String, String> options) throws IOExcepti
        .withStorageConfig(HoodieStorageConfig.newBuilder().parquetMaxFileSize(maxFileSize).build())
        .withClusteringConfig(HoodieClusteringConfig.newBuilder()
            .withClusteringPlanStrategyClass(SparkConsistentBucketClusteringPlanStrategy.class.getName())
-            .withClusteringExecutionStrategyClass(SparkConsistentBucketClusteringExecutionStrategy.class.getName())
+            .withClusteringExecutionStrategyClass(singleJob ? SINGLE_SPARK_JOB_CONSISTENT_HASHING_EXECUTION_STRATEGY : SPARK_CONSISTENT_BUCKET_EXECUTION_STRATEGY)


Is SINGLE_SPARK_JOB_CONSISTENT_HASHING_EXECUTION_STRATEGY always better than SPARK_CONSISTENT_BUCKET_EXECUTION_STRATEGY, why we need two execution strategy.

Is SINGLE_SPARK_JOB_CONSISTENT_HASHING_EXECUTION_STRATEGY always better than SPARK_CONSISTENT_BUCKET_EXECUTION_STRATEGY, why we need two execution strategy.

I'm not sure if any user is already using SPARK_CONSISTENT_BUCKET_EXECUTION_STRATEGY , and if so, should we keep it for compatibility? If not, we can deprecated it.

1. remove unused SparkJobExecutionStrategy Signed-off-by: TheR1sing3un <[email protected]>

1. fix ut Signed-off-by: TheR1sing3un <[email protected]>

hudi-bot · 2025-01-09T13:36:45Z

CI report:

6198247 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

github-actions bot added the size:L PR with lines of changes in (300, 1000] label Dec 24, 2024

TheR1sing3un changed the title ~~[DNM] Introduce SingleSparkConsistentBucketClusteringExecutionStrategy to avoid shuffle~~ [HUDI-8800] Introduce SingleSparkConsistentBucketClusteringExecutionStrategy to improve performance Dec 30, 2024

feat: introduce SingleSparkConsistentBucketClusteringExecutionStrateg…

665d201

…y to avoid shuffle 1. introduce SingleSparkConsistentBucketClusteringExecutionStrategy to avoid shuffle Signed-off-by: TheR1sing3un <[email protected]>

TheR1sing3un force-pushed the feat_local_clustering_execution_strategy branch from 2cf056a to 665d201 Compare January 2, 2025 03:47

TheR1sing3un commented Jan 2, 2025

View reviewed changes