[Bug] [Iceberg] Iceberg Source use multiple parallelism encountering lost data #5661

SamealD · 2023-10-19T07:28:20Z

Search before asking

I had searched in the issues and found no similar issues.

What happened

Iceberg Source use multiple parallelism encountering lost data.
when parallelism=1 ,it won't be lost data. But when parallelism=2 or more ,it will lost data.

SeaTunnel Version

SeaTunnel 2.3.3

SeaTunnel Config

env {
  parallelism = 2
  job.mode = "BATCH"
  checkpoint.interval = 50000
}

source {
  Iceberg {
    catalog_name = "hadoop_prod"
    catalog_type = "hadoop"
    warehouse="hdfs://***:8020/warehouse/hive/test-iceberg"
    namespace = "test01"
    table = "test_table01"
  }
}

sink {
  Console {

 }
}

Running Command

bin/seatunnel.sh --config jobconf/iceberg_to_local.conf

Error Exception

no Error Exception

Zeta or Flink or Spark Version

No response

Java or Scala Version

No response

Screenshots

this is my iceberg table data count:

if I set parallelism = 1, The information I have obtained is as follows:

       Job Statistic Information

Start Time : 2023-10-19 14:41:10
End Time : 2023-10-19 14:41:16
Total Time(s) : 5
Total Read Count : 2000002
Total Write Count : 2000002
Total Failed Count : 0

if I set parallelism = 2, The information I have obtained is as follows:

       Job Statistic Information

Start Time : 2023-10-19 14:48:58
End Time : 2023-10-19 14:49:01
Total Time(s) : 3
Total Read Count : 1000001
Total Write Count : 1000001
Total Failed Count : 0

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

SamealD · 2023-10-19T07:34:56Z

Screenshot failed to upload successfully, It shows original iceberg table data count is 2000002

kangdw0x80 · 2023-10-27T01:49:02Z

It is a bug.
Iceberg Connector assign files to multiple reader with path information (in addPendingSplits function)

(https://github.com/apache/seatunnel/blob/dev/seatunnel-connectors-v2/connector-iceberg/src/main/java/org/apache/seatunnel/connectors/seatunnel/iceberg/source/enumerator/AbstractSplitEnumerator.java#L110C16-L111C16)

int ownerReader = newSplit.splitId().hashCode() % numReaders;

splitId in Iceberg source use path information.

    public String splitId() {
        return task.file().path().toString();
    }

However, It will get negative value from hashCode function with too long path.

This values is id of reader.
So, the Connector can't assign iceberg file to any reader caused by negative value
Change the code

-   int ownerReader = newSplit.splitId().hashCode() % numReaders;  -> 
+  int ownerReader = ( newSplit.splitId().hashCode() & Integer.MAX_VALUE ) % numReaders;% numReaders;

SamealD added the bug label Oct 19, 2023

This was referenced Oct 27, 2023

[BUG][Connector-V2][Iceberg] Iceberg source connector lost data with … #5729

Closed

[BUG][Connector-V2] Iceberg source lost data with parallelism option #5732

Merged

Hisoka-X closed this as completed in #5732 Oct 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] [Iceberg] Iceberg Source use multiple parallelism encountering lost data #5661

[Bug] [Iceberg] Iceberg Source use multiple parallelism encountering lost data #5661

SamealD commented Oct 19, 2023

SamealD commented Oct 19, 2023

kangdw0x80 commented Oct 27, 2023 •

edited

Loading

[Bug] [Iceberg] Iceberg Source use multiple parallelism encountering lost data #5661

[Bug] [Iceberg] Iceberg Source use multiple parallelism encountering lost data #5661

Comments

SamealD commented Oct 19, 2023

Search before asking

What happened

SeaTunnel Version

SeaTunnel Config

Running Command

Error Exception

Zeta or Flink or Spark Version

Java or Scala Version

Screenshots

Are you willing to submit PR?

Code of Conduct

SamealD commented Oct 19, 2023

kangdw0x80 commented Oct 27, 2023 • edited Loading

kangdw0x80 commented Oct 27, 2023 •

edited

Loading