[HUDI-8841] Fix schema validating exception during flink async cluste… #12598

cshuo · 2025-01-08T02:06:17Z

…ring

Change Logs

In Flink SQL, primary key constraint can be defined, and the type of pk field become non-null，while spark has no primary key constraint, so they have discrepancies in the schema of the underlying files, e.g., parquet.

we can make a schema reconciliation during the async clustering reading.

Impact

Consolidating flink clustering by reconcile schemas to tolerate different nullabilities of record key.

Risk level (write none, low medium or high below)

low

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

…ring

danny0405 · 2025-01-08T02:34:25Z

...-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/ClusteringOperator.java

+   */
+  private Schema reconcileSchemaWithNullability(ClusteringOperation clusteringOperation) {
+    String instantTs = StringUtils.isNullOrEmpty(clusteringOperation.getDataFilePath())
+        ? FSUtils.getCommitTime(clusteringOperation.getDeltaFilePaths().get(0))


Currently, the Flink clustering only work on append only table, all the data files should be in parquet format, we can fetch the record key fields, then the file schema from the parquet footer, and reconcile the record key fields if the nullability discrepency exists.

As for where to get the file schema, actually, files with same commit time share the same write schema, and getTableAvroSchema in TableSchemaResolver is more efficient since there is a cache, what do you think

No, the TableSchemaResolver would trigger file lising of the commit metadata which is a pressure to filesystem. Fetch the schema from file is straight-forward:

private Schema fetchSchemaFromFiles(Iterator<String> filePaths) throws IOException { Schema schema = null; while (filePaths.hasNext() && schema == null) { StoragePath filePath = new StoragePath(filePaths.next()); if (FSUtils.isLogFile(filePath)) { // this is a log file schema = readSchemaFromLogFile(filePath); } else { schema = HoodieIOFactory.getIOFactory(metaClient.getStorage()) .getFileFormatUtils(filePath).readAvroSchema(metaClient.getStorage(), filePath); } } return schema; }

Another way is to force make the record key fields nullable if they are not if a nullable schema can be used to read non-nullable schema data file.

The second way sounds more efficient, I'll verify it.

cc @danny0405 updated using the second way, and added a test to verify nullable schema can be used to read non-nullable schema data file

danny0405 · 2025-01-09T02:54:16Z

hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java

+   * @param nullable nullability of column type
+   * @return a new schema with the nullabilities of the given columns updated
+   */
+  public static Schema createSchemaWithNullabilityUpdate(


Rename to forceNullableColumns, we should also avoid to recreate the schema if the field is already nullable.

danny0405 · 2025-01-09T02:54:46Z

...-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/ClusteringOperator.java

+   * @return schema that has nullability constraints reconciled
+   */
+  private Schema reconcileSchemaWithRecordKeyNullability(Schema schema) {
+    Option<String[]> recordKeyOp = table.getMetaClient().getTableConfig().getRecordKeyFields();


The default record key field is uuid, we should check the validity first.

danny0405 · 2025-01-09T08:00:58Z

hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java

@@ -605,14 +605,16 @@ public static String createSchemaErrorString(String errorMessage, Schema writerS
   * @param nullable nullability of column type
   * @return a new schema with the nullabilities of the given columns updated
   */
-  public static Schema createSchemaWithNullabilityUpdate(
+  public static Schema forceNullableColumns(
          Schema schema, List<String> nullableUpdateCols, boolean nullable) {


nullableUpdateCols -> columns. We can eliminate the flag nullalbe because it is always true.

hudi-bot · 2025-01-09T10:15:06Z

CI report:

a77e69e Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

[HUDI-8841] Fix schema validating exception during flink async cluste…

8e588b2

…ring

cshuo force-pushed the HUDI-8841 branch from e4ee029 to 8e588b2 Compare January 8, 2025 02:10

danny0405 reviewed Jan 8, 2025

View reviewed changes

github-actions bot added the size:M PR with lines of changes in (100, 300] label Jan 8, 2025

fix comments

f8886f4

danny0405 reviewed Jan 9, 2025

View reviewed changes

fix comments

d4aa17d

danny0405 reviewed Jan 9, 2025

View reviewed changes

fix comments

a77e69e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-8841] Fix schema validating exception during flink async cluste… #12598

[HUDI-8841] Fix schema validating exception during flink async cluste… #12598

cshuo commented Jan 8, 2025

danny0405 Jan 8, 2025

cshuo Jan 8, 2025

danny0405 Jan 8, 2025

cshuo Jan 8, 2025

cshuo Jan 8, 2025

danny0405 Jan 9, 2025

danny0405 Jan 9, 2025

danny0405 Jan 9, 2025

hudi-bot commented Jan 9, 2025

[HUDI-8841] Fix schema validating exception during flink async cluste… #12598

Are you sure you want to change the base?

[HUDI-8841] Fix schema validating exception during flink async cluste… #12598

Conversation

cshuo commented Jan 8, 2025

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hudi-bot commented Jan 9, 2025

CI report: