Core: add a new task-type field to task JSON serialization. add data task JSON serialization implementation. #9728

stevenzwu · 2024-02-14T23:34:35Z

Java implementation only adds serialization for StaticDataTask. Other task types should be added in a follow-up PR , which is ready in my fork. with the additional parsers, TestFlinkMetaDataTable can pass with FLIP-27 source using JSON serializers.
stevenzwu@59445b7

format/spec.md

core/src/main/java/org/apache/iceberg/BaseFileScanTaskParser.java

format/spec.md

jackye1995 · 2024-02-16T18:01:51Z

core/src/main/java/org/apache/iceberg/FileScanTaskParser.java

-  private static final String RESIDUAL = "residual-filter";
+  private static final String TASK_TYPE = "task-type";
+
+  private enum TaskType {


I feel the Java class names are starting to get confusing, even at the interface level. Although DataTask extends FileScanTask, there is really no file that the DataTask is scanning. So although StaticDataTask and BaseFileScanTask both implements FileScanTask, they are not really conceptually similar. The actual relationship is more like StaticDataTask implements DataTask, BaseFileScanTask implements FileScanTask, but DataTask and FileScanTask have many things in common so we made DataTask extends FileScanTask.

To resolve this confusing situation and potentially accommodate other types of scan task, I am thinking if we should go one more layer above, so we keep the file scan task spec as is, and on top of that, have a serialization spec for scan task, which has types like file-scan-task, data-task. And we then have ScanTaskParser that delegates to the existing FileScanTaskParser or the DataTaskParser. What do you think?

@jackye1995 I agree that ScanTaskParser is more accurate for the facade/dispatcher class here. The only problem is that FileScanTaskParser is a public class in core module. Technically, we are breaking compatibility.

maybe we can keep FileScanTaskParser and mark it as deprecated. it can just extend from the new ScanTaskParser to avoid code duplication. But it means that we need to keep the file task parser name as BaseFileScanTaskParser, which is package private anyway.

actually, we can just deprecate the public methods from FileScanTaskParser and still keep the file scan task impl here.

class name refactoring is done.

emkornfield · 2024-02-16T18:32:50Z

format/spec.md

-| **`data-file`**       |`JSON object`|`See above, read content file instead`|
-| **`delete-files`**    |`JSON list of objects`|`See above, read content file instead`|
-| **`residual-filter`** |`JSON object: residual filter expression`|`{"type":"eq","term":"id","value":1}`|
+### Task Serialization


Its really not clear to me why file scan task is added into the spec here? I don't think it is referenced any place in the main body of the specification?

@emkornfield that is a fair question. @rdblue also had a question if this should be added to REST OpenAPI. My reservation is that this is also used by Flink for state serialization. Hence it is not just a REST thing.

what do others think? @rdblue @jackye1995 @aokolnychyi

What state is being serialized? How does it relate to reading tables?

Does the flink use-case require standardization here or is it an implementation detail of flink?

Flink source checkpoints pending splits of FileScanTask. The standardization of JSON serialization scan task is used by both REST OpenAPI and Flink (checkpoint and job manager -> task manager split assignment). I would imagine if Spark streaming checkpoint pending splits (scan task), it would probably also prefer to use JSON serialization (than Java serialization).

IMO, unless we expect Spark to read the Rest checkpoints from Flink (I would guess this isn't the case), then this is really an implementation detail of both engines and doesn't really belong in the specification.

I would definitely prefer it if this were not part of the spec. I don't think there is any reason to require a specific JSON serialization and I was surprised to see it in the spec. It's great to have documentation on exactly what the parsers produce, but we have many parsers that are not covered by the table spec and are instead in other documents like the Puffin spec, View spec, or REST spec.

To me, state serialization is a concern internal to Flink. It's harder to adhere to a spec for that, plus make guarantees about forward and backward compatibility. And without context for how this is used and why it is here, we can't make decisions about how to evolve this. For example, if you wanted to remove a field that Flink doesn't use, how do you know whether that is safe in the table spec? What does it mean for this to evolve "safely"?

got it. @rdblue should I just remove this section from the table spec? not sure what's the policy of removing an invalid section from spec.

for now, I have reverted the spec change. we still need to decide if/how we can remove the existing spec section on file scan task JSON serialization.

stevenzwu · 2024-03-13T17:15:18Z

@nastra @pvary can you help review? spec change/revert has been moved to a separate PR: https://github.com/apache/iceberg/pull/9771/files.

This is blocking Flink moving to the FLIP-27 source as default. will need another PR to complete the metadata queries support.

nastra · 2024-03-14T10:36:56Z

@stevenzwu I'll take a look either tomorrow or early next week

nastra · 2024-03-14T13:00:24Z

core/src/main/java/org/apache/iceberg/ScanTaskParser.java

+      this.value = value;
+    }
+
+    public static TaskType fromValue(String value) {


the codebase typically names this fromString() or fromName(). Given that this looks specifically for the task type name, maybe this should be named fromTypeName()?

core/src/main/java/org/apache/iceberg/ScanTaskParser.java

nastra · 2024-03-14T13:05:17Z

core/src/main/java/org/apache/iceberg/ScanTaskParser.java

+      }
+    }
+
+    public String value() {


nit: maybe typeName()?

nastra · 2024-03-14T13:07:30Z

core/src/main/java/org/apache/iceberg/ScanTaskParser.java

+  private static final String TASK_TYPE = "task-type";
+
+  private enum TaskType {
+    FILE_SCAN_TASK("file-scan-task"),


I think this should have an UNKNOWN for forward/backward compatibility. Imagine a client/server that use different Iceberg versions and new task types are being added over time.

see my reply in the comment below

nastra · 2024-03-14T13:08:41Z

core/src/main/java/org/apache/iceberg/ScanTaskParser.java

+      } else if (DATA_TASK.value().equalsIgnoreCase(value)) {
+        return DATA_TASK;
+      } else {
+        throw new IllegalArgumentException("Unknown task type: " + value);


this probably shouldn't fail but rather return UNKNOWN. See also #7145 where a similar issue has been addressed and I think we need to do the same thing here

I checked out PR #7145 . I am not sure it has the same applicability here. report metric may be considered optional and hence return an unknown report metrics object might be desirable there. but here if a scan task is unknown and can't be parsed properly, we should fail explicitly.

nastra · 2024-03-14T13:12:43Z

core/src/main/java/org/apache/iceberg/ScanTaskParser.java

+      generator.writeStringField(TASK_TYPE, TaskType.FILE_SCAN_TASK.value());
+      FileScanTaskParser.toJson(fileScanTask, generator);
+    } else {
+      throw new UnsupportedOperationException(


I don't think this should fail for the same reason I mentioned further above. If you e.g. take a look at the ReportMetricsRequestParser, then it doesn't fail if if sees an unknown type and we would need to do the same handling here

see my reply in the comment above

nastra · 2024-03-14T13:33:06Z

core/src/test/java/org/apache/iceberg/TestDataTaskParser.java

+        .as("Schema should match")
+        .isTrue();
+
+    Assertions.assertThat(expected.projectedSchema().sameSchema(actual.projectedSchema()))


same as above

nastra · 2024-03-14T13:33:29Z

core/src/test/java/org/apache/iceberg/TestFileScanTaskParser.java

@@ -27,29 +27,41 @@
 import org.junit.jupiter.params.provider.ValueSource;

 public class TestFileScanTaskParser {
+


nit: unnecessary change

core/src/main/java/org/apache/iceberg/FileScanTaskParser.java

nastra · 2024-03-14T13:40:40Z

core/src/test/java/org/apache/iceberg/TestFileScanTaskParser.java

  @Test
  public void testNullArguments() {
-    Assertions.assertThatThrownBy(() -> FileScanTaskParser.toJson(null))
+    Assertions.assertThatThrownBy(() -> ScanTaskParser.toJson(null))


I don't think this test class should actually change at all. Since you added a new ScanTaskParser, it's best to also add a TestScanTaskParser. This test should stay the same, since someone could still use the Parser and the existing behavior of the parser shouldn't change IMO

I see your point here. Will add a TestScanTaskParser to test the facade part, like task type field and unsupported task type.

You pointed out another issue in this test class. we should test both the old/deprecated public API (via FileScanTaskParser.toJson/fromJson) and new ScanTaskParser public API.

But this class still need to be changed as we are not going to add FileScanTask testing in the TestScanTaskParser facade test class.

nastra · 2024-03-14T13:42:20Z

core/src/main/java/org/apache/iceberg/SnapshotsTable.java

@@ -27,7 +28,8 @@
 * <p>This does not include snapshots that have been expired using {@link ExpireSnapshots}.
 */
 public class SnapshotsTable extends BaseMetadataTable {
-  private static final Schema SNAPSHOT_SCHEMA =
+  @VisibleForTesting
+  static final Schema SNAPSHOT_SCHEMA =


could we avoid making this visible?

I was testing a real StaticDataTask for snapshot metadata table/rows. Hence expose this as package private. Note that this is not public though.

We can avoid making this visible by define a custom test schema and custom static data task in the TestDataTaskParser.

let me know your thoughts/preference.

I think having a custom schema for the test makes sense here

copied the SnapshotsTable schema to the test class.

stevenzwu · 2024-03-28T22:37:59Z

@nastra can you help take another look?

nastra · 2024-04-23T10:17:01Z

core/src/test/java/org/apache/iceberg/TestDataTaskParser.java

+  private void assertDataTaskEquals(StaticDataTask expected, StaticDataTask actual) {
+    Assertions.assertThat(expected.schema().asStruct())
+        .isEqualTo(actual.schema().asStruct())
+        .as("Schema should match");


.as() needs to come before .isEqualTo(). Same for all the other places in this PR

nastra · 2024-04-23T10:17:42Z

core/src/test/java/org/apache/iceberg/TestDataTaskParser.java

+    ObjectMapper mapper = new ObjectMapper();
+    JsonNode rootNode = mapper.reader().readTree(jsonStr);
+
+    Assertions.assertThatThrownBy(() -> DataTaskParser.fromJson(rootNode.get("str")))


nit: for new tests it's ok to statically import assertThatThrownBy() / assertThat()

core/src/main/java/org/apache/iceberg/DataTaskParser.java

core/src/main/java/org/apache/iceberg/FileScanTaskParser.java

nastra · 2024-04-23T10:29:30Z

core/src/test/java/org/apache/iceberg/TestFileScanTaskParser.java

+
+  @ParameterizedTest
+  @ValueSource(booleans = {true, false})
+  public void testScanTaskParser(boolean caseSensitive) {


imho this test should live in TestScanTaskParser as it's focusing on the ScanTaskParser. Same for testScanTaskParserWithoutTaskTypeField

ScanTaskParser is the facade/dispatch class. I don't want to add all task types testing into a single gigantic test class TestScanTaskParser. I think it is a bit cleaner to have one test class for each task type (file, data, manifest etc.).

nastra · 2024-04-23T10:30:39Z

core/src/test/java/org/apache/iceberg/TestFileScanTaskParser.java

        .isInstanceOf(IllegalArgumentException.class)
        .hasMessage("Invalid JSON string for file scan task: null");
+
+    Assertions.assertThatThrownBy(() -> ScanTaskParser.toJson(null))


this should be in TestScanTaskParser as it's using ScanTaskParser

core/src/main/java/org/apache/iceberg/ScanTaskParser.java

nastra · 2024-04-29T12:20:32Z

core/src/test/java/org/apache/iceberg/TestDataTaskParser.java

+  }
+
+  private void assertDataTaskEquals(StaticDataTask expected, StaticDataTask actual) {
+    assertThat(expected.schema().asStruct())


actual/expected is the wrong way. Should be assertThat(actual...)...isEqualTo(expected...). Same for the other assertions

thanks for catching the mistake

nastra · 2024-04-29T12:22:16Z

core/src/test/java/org/apache/iceberg/TestFileScanTaskParser.java

-        .isTrue();
-    Assertions.assertThat(actual.spec()).isEqualTo(expected.spec());
-    Assertions.assertThat(
+    assertThat(expected.schema().sameSchema(actual.schema())).as("Schema should match").isTrue();


Suggested change

assertThat(expected.schema().sameSchema(actual.schema())).as("Schema should match").isTrue();

assertThat(actual.schema().asStruct()).isEqualTo(expected.schema().asStruct());

if the assertion fails, then this will show where the schema mismatch is

nastra · 2024-04-29T12:24:18Z

core/src/test/java/org/apache/iceberg/TestDataTaskParser.java

+
+    List<StructLike> expectedRows = Lists.newArrayList(expected.rows());
+    List<StructLike> actualRows = Lists.newArrayList(actual.rows());
+    assertThat(actualRows).hasSize(expectedRows.size());


Suggested change

assertThat(actualRows).hasSize(expectedRows.size());

assertThat(actualRows).hasSameSizeAs(expectedRows);

…add JSON serialization for StaticDataTask.

pvary · 2024-06-20T12:21:19Z

core/src/main/java/org/apache/iceberg/DataTaskParser.java

+  private static final String METADATA_FILE = "metadata-file";
+  private static final String ROWS = "rows";
+
+  private DataTaskParser() {}


nit: Why is this not StaticDataTaskParser?
For me the natural thing would be to have a 1-on-1 connection between the 2

please see the earlier comment from Jack: #9728 (comment)

core/src/main/java/org/apache/iceberg/ScanTaskParser.java

…N serde for StaticDataTask. (apache#9728)

stevenzwu requested review from nastra and aokolnychyi February 14, 2024 23:34

github-actions bot added the Specification Issues that may introduce spec changes. label Feb 14, 2024

rdblue reviewed Feb 14, 2024

View reviewed changes

format/spec.md Outdated Show resolved Hide resolved

nastra reviewed Feb 15, 2024

View reviewed changes

format/spec.md Outdated Show resolved Hide resolved

nastra reviewed Feb 15, 2024

View reviewed changes

format/spec.md Outdated Show resolved Hide resolved

nastra reviewed Feb 15, 2024

View reviewed changes

format/spec.md Outdated Show resolved Hide resolved

stevenzwu changed the title ~~Spec: add task-type field to JSON serialization of file scan task. add JSON serialization for StaticDataTask.~~ Spec: add a new task-type field to task JSON serialization. add data task JSON serialization spec. Feb 16, 2024

github-actions bot added the core label Feb 16, 2024

stevenzwu changed the title ~~Spec: add a new task-type field to task JSON serialization. add data task JSON serialization spec.~~ Spec, Core: add a new task-type field to task JSON serialization. add data task JSON serialization spec and imp. Feb 16, 2024

stevenzwu commented Feb 16, 2024

View reviewed changes

core/src/main/java/org/apache/iceberg/BaseFileScanTaskParser.java Outdated Show resolved Hide resolved

stevenzwu force-pushed the issue-9597 branch from 41b9255 to bbdef70 Compare February 16, 2024 16:42

stevenzwu requested a review from jackye1995 February 16, 2024 17:00

jackye1995 reviewed Feb 16, 2024

View reviewed changes

emkornfield reviewed Feb 16, 2024

View reviewed changes

stevenzwu force-pushed the issue-9597 branch from 28d8758 to fc703bb Compare February 16, 2024 22:14

github-actions bot added the flink label Feb 16, 2024

stevenzwu force-pushed the issue-9597 branch 2 times, most recently from c27b5d5 to 4c7c907 Compare February 20, 2024 01:16

stevenzwu changed the title ~~Spec, Core: add a new task-type field to task JSON serialization. add data task JSON serialization spec and imp.~~ Core: add a new task-type field to task JSON serialization. add data task JSON serialization spec and imp. Feb 20, 2024

stevenzwu force-pushed the issue-9597 branch from c3a9443 to b8b581d Compare February 21, 2024 17:05

stevenzwu changed the title ~~Core: add a new task-type field to task JSON serialization. add data task JSON serialization spec and imp.~~ Core: add a new task-type field to task JSON serialization. add data task JSON serialization imp. Mar 13, 2024

nastra reviewed Mar 14, 2024

View reviewed changes

nastra requested changes Mar 14, 2024

View reviewed changes

stevenzwu force-pushed the issue-9597 branch 2 times, most recently from e4ba92a to 557ff4e Compare March 15, 2024 18:10

stevenzwu force-pushed the issue-9597 branch from 557ff4e to 037d039 Compare April 19, 2024 03:13

nastra reviewed Apr 23, 2024

View reviewed changes

core/src/main/java/org/apache/iceberg/DataTaskParser.java Show resolved Hide resolved

nastra reviewed Apr 23, 2024

View reviewed changes

core/src/main/java/org/apache/iceberg/FileScanTaskParser.java Outdated Show resolved Hide resolved

nastra reviewed Apr 23, 2024

View reviewed changes

core/src/main/java/org/apache/iceberg/FileScanTaskParser.java Outdated Show resolved Hide resolved

nastra reviewed Apr 23, 2024

View reviewed changes

core/src/main/java/org/apache/iceberg/ScanTaskParser.java Show resolved Hide resolved

nastra reviewed Apr 29, 2024

View reviewed changes

stevenzwu requested a review from nastra May 7, 2024 16:35

stevenzwu changed the title ~~Core: add a new task-type field to task JSON serialization. add data task JSON serialization imp.~~ Core: add a new task-type field to task JSON serialization. add data task JSON serialization implementation. Jun 18, 2024

stevenzwu force-pushed the issue-9597 branch 2 times, most recently from 4ef2fb9 to be97ade Compare June 19, 2024 00:20

Core, Flink: add task-type field to JSON serialization of scan task. …

078054a

…add JSON serialization for StaticDataTask.

stevenzwu force-pushed the issue-9597 branch from be97ade to 078054a Compare June 19, 2024 00:37

pvary reviewed Jun 20, 2024

View reviewed changes

core/src/main/java/org/apache/iceberg/ScanTaskParser.java Show resolved Hide resolved

pvary approved these changes Jun 20, 2024

View reviewed changes

nastra approved these changes Jun 26, 2024

View reviewed changes

nastra merged commit 9ed3383 into apache:main Jun 26, 2024
41 checks passed

jasonf20 pushed a commit to jasonf20/iceberg that referenced this pull request Aug 4, 2024

Core, Flink: Add task-type field to JSON serde of scan task / Add JSO…

11cb9e5

…N serde for StaticDataTask. (apache#9728)

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024

Core, Flink: Add task-type field to JSON serde of scan task / Add JSO…

ac875ce

…N serde for StaticDataTask. (apache#9728)

		@@ -27,29 +27,41 @@
		import org.junit.jupiter.params.provider.ValueSource;

		public class TestFileScanTaskParser {

	assertThat(expected.schema().sameSchema(actual.schema())).as("Schema should match").isTrue();
	assertThat(actual.schema().asStruct()).isEqualTo(expected.schema().asStruct());

	assertThat(actualRows).hasSize(expectedRows.size());
	assertThat(actualRows).hasSameSizeAs(expectedRows);

Core: add a new task-type field to task JSON serialization. add data task JSON serialization implementation. #9728

Core: add a new task-type field to task JSON serialization. add data task JSON serialization implementation. #9728

Conversation

stevenzwu commented Feb 14, 2024 • edited Loading

Choose a reason for hiding this comment

stevenzwu Feb 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevenzwu Feb 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Feb 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevenzwu commented Mar 13, 2024

nastra commented Mar 14, 2024

nastra Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevenzwu Mar 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevenzwu commented Mar 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevenzwu Apr 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevenzwu commented Feb 14, 2024 •

edited

Loading

stevenzwu Feb 16, 2024 •

edited

Loading

stevenzwu Feb 16, 2024 •

edited

Loading

rdblue Feb 19, 2024 •

edited

Loading

nastra Mar 14, 2024 •

edited

Loading

stevenzwu Mar 15, 2024 •

edited

Loading

stevenzwu Apr 23, 2024 •

edited

Loading