Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improve][Connector-V2][Iceberg] Add hadoop s3 catalog e2e testcase #5745

Merged
merged 2 commits into from
Jun 15, 2024

Conversation

4chicat
Copy link
Contributor

@4chicat 4chicat commented Oct 27, 2023

Purpose of this pull request

We use S3 as iceberg warehouse storage.
If catalog_type is hadoop, only hdfs is supported, so it is being customized and used.
I want to contribute my customized code, so I push a pull request.

Does this PR introduce any user-facing change?

we can use the iceberg table saved in s3.

config example

source {
  iceberg {
    catalog_name = "seatunnel"
    iceberg.catalog.config={
      "type"="hadoop"
      "warehouse"="s3a://your_bucket/spark/warehouse/"
    }
    hadoop.config={
      "fs.s3a.path.style.access" = "true"
      "fs.s3a.connection.ssl.enabled" = "false"
      "fs.s3a.signing.algorithm" = "S3SignerType"
      "fs.s3a.encryption.algorithm" = "AES256"
      "fs.s3a.connection.timeout" = "3000"
      "fs.s3a.aws.credentials.provider" = "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"
      "fs.s3a.endpoint" = "http://minio:9000"
      "fs.s3a.access.key" = "xxx"
      "fs.s3a.secret.key" = "xxx"
      "fs.defaultFS" = "s3a://xxx"
    }
    namespace = "your_iceberg_database"
    table = "your_iceberg_table"
    result_table_name = "iceberg_test"
  }
}

How was this patch tested?

add e2e test case : connector-iceberg-s3-e2e

Check list

@4chicat 4chicat force-pushed the support-s3-warehouse branch 2 times, most recently from 931af66 to b1c6e40 Compare October 27, 2023 14:13
@4chicat 4chicat changed the title [Improve][Connector-V2][Iceberg] Support for S3File in hadoop catalog [Improve][Connector-V2][Iceberg] Support for S3 in hadoop catalog Oct 27, 2023
@4chicat 4chicat force-pushed the support-s3-warehouse branch from b1c6e40 to 7ceedd0 Compare October 27, 2023 14:18
@Hisoka-X
Copy link
Member

cc @ic4y

@Hisoka-X Hisoka-X added the First-time contributor First-time contributor label Oct 30, 2023
Copy link
Member

@EricJoy2048 EricJoy2048 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, cc @ic4y

EricJoy2048
EricJoy2048 previously approved these changes Oct 30, 2023
Copy link
Member

@Hisoka-X Hisoka-X left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a test case for this feature?

@4chicat
Copy link
Contributor Author

4chicat commented Oct 31, 2023

Shall we add a test case for this feature?

I don't think it's going to be easy because it requires an S3 bucket.

@Hisoka-X
Copy link
Member

I think we can create a minio container in e2e test case for s3a file read.

@4chicat
Copy link
Contributor Author

4chicat commented Nov 4, 2023

@EricJoy2048
Thank you so much for the review. 

I made an additional commit today by adding the e2e test case.
However, your previous review was automatically dismissed by the system.
I apologize for that.

@4chicat 4chicat force-pushed the support-s3-warehouse branch 4 times, most recently from 21f2ce6 to 2dfb53a Compare November 10, 2023 15:16
@4chicat
Copy link
Contributor Author

4chicat commented Nov 11, 2023

I think we can create a minio container in e2e test case for s3a file read.

I created minio test container and added e2e test case.
It passes on spark3, but fails on flink/seatunnel/spark2.

I think I need to add hadoop-aws/aws-java-sdk jar file to the test container docker image and need your help.

  • spark2 : hadoop-aws-2.7.5, aws-java-sdk-1.7.4

I'm also looking to see if there are other workarounds.

@Hisoka-X
Copy link
Member

Thanks @4chicat , I will take a look in the few days.

@4chicat 4chicat force-pushed the support-s3-warehouse branch 2 times, most recently from 98537cc to 1e195be Compare December 8, 2023 00:28
@4chicat 4chicat force-pushed the support-s3-warehouse branch from 1e195be to 4cdebc3 Compare February 5, 2024 08:14
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add this file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

connector-iceberg-s3-e2e e2e case was written based on connector-iceberg-hadoop3-e2e.
This file was created by copying the file below.
https://github.com/apache/seatunnel/blob/dev/seatunnel-e2e/seatunnel-connector-v2-e2e/connector-iceberg-hadoop3-e2e/src/test/resources/log4j.properties

Should I delete it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

connector-iceberg-s3-e2e e2e case was written based on connector-iceberg-hadoop3-e2e. This file was created by copying the file below. https://github.com/apache/seatunnel/blob/dev/seatunnel-e2e/seatunnel-connector-v2-e2e/connector-iceberg-hadoop3-e2e/src/test/resources/log4j.properties

Should I delete it?

It is already included in the engine and can be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

connector-iceberg-s3-e2e e2e case was written based on connector-iceberg-hadoop3-e2e. This file was created by copying the file below. https://github.com/apache/seatunnel/blob/dev/seatunnel-e2e/seatunnel-connector-v2-e2e/connector-iceberg-hadoop3-e2e/src/test/resources/log4j.properties
Should I delete it?

It is already included in the engine and can be removed

Removed log4j.properties file.

@4chicat 4chicat requested a review from liugddx February 5, 2024 08:37
Copy link
Member

@hailin0 hailin0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

cc @Hisoka-X
image

@Hisoka-X
Copy link
Member

Look like this PR bring new leak thread in seatunnel server. Can you try closing these leaking threads? If not, please add it to the known issues list.

@4chicat 4chicat force-pushed the support-s3-warehouse branch from d6536b1 to 2fb59ec Compare February 19, 2024 07:55
Comment on lines 255 to 259
public static List<TestContainer> discoverTestSpark3Containers() {
return discoverTestContainers().stream()
.filter(testContainer -> testContainer instanceof Spark3Container)
.collect(Collectors.toList());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has been removed.

Comment on lines 231 to 233
|| s.contains("java-sdk-http-connection-reaper")
|| s.contains("Timer for 's3a-file-system' metrics system")
|| s.startsWith("MutableQuantiles-")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@4chicat 4chicat Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to isIssueWeAlreadyKnow method.

@4chicat 4chicat force-pushed the support-s3-warehouse branch 2 times, most recently from 2078dff to cedd449 Compare February 20, 2024 13:45
@4chicat 4chicat force-pushed the support-s3-warehouse branch from cedd449 to 97222fc Compare March 8, 2024 01:32
@4chicat 4chicat requested review from Hisoka-X and hailin0 March 8, 2024 04:10
@EricJoy2048
Copy link
Member

This PR is very valuable and I think we need to look at it again and review it.

@4chicat 4chicat force-pushed the support-s3-warehouse branch from 97222fc to 861a07c Compare April 8, 2024 00:45
@DisabledOnContainer(
value = {TestContainerId.SPARK_2_4},
type = {EngineType.FLINK},
disabledReason = "Needs hadoop-aws,aws-java-sdk jar for flink and spark2.4.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please help me annotate this section in the usage document and provide an example for AWS S3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment. Added to document.

@4chicat 4chicat requested a review from Carl-Zhou-CN April 8, 2024 02:10
// Iceberg S3 Hadoop catalog
|| threadName.contains("java-sdk-http-connection-reaper")
|| threadName.contains("Timer for 's3a-file-system' metrics system")
|| threadName.startsWith("MutableQuantiles-");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 55 to 59
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>${hadoop3.version}</version>
</dependency>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add hadoop-aws to hadoop shade jar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to not add hadoop-aws.

@EricJoy2048
Copy link
Member

I think we can create a minio container in e2e test case for s3a file read.

I created minio test container and added e2e test case. It passes on spark3, but fails on flink/seatunnel/spark2.

I think I need to add hadoop-aws/aws-java-sdk jar file to the test container docker image and need your help.

  • spark2 : hadoop-aws-2.7.5, aws-java-sdk-1.7.4

I'm also looking to see if there are other workarounds.

Like the driver, you can add hadoop-aws and aws-java-sdk to the test container docker like seatunnel-e2e/seatunnel-connector-v2-e2e/connector-file-oss-e2e/src/test/java/org/apache/seatunnel/e2e/connector/file/oss/OssFileIT.java

@TestContainerExtension
    private final ContainerExtendedFactory extendedFactory =
            container -> {
                Container.ExecResult extraCommands =
                        container.execInContainer(
                                "bash",
                                "-c",
                                "mkdir -p /tmp/seatunnel/plugins/oss/lib && cd /tmp/seatunnel/plugins/oss/lib && curl -O "
                                        + OSS_SDK_DOWNLOAD);
                Assertions.assertEquals(0, extraCommands.getExitCode());

                extraCommands =
                        container.execInContainer(
                                "bash",
                                "-c",
                                "cd /tmp/seatunnel/plugins/oss/lib && curl -O " + JDOM_DOWNLOAD);
                Assertions.assertEquals(0, extraCommands.getExitCode());

                extraCommands =
                        container.execInContainer(
                                "bash",
                                "-c",
                                "cd /tmp/seatunnel/plugins/oss/lib && curl -O "
                                        + HADOOP_ALIYUN_DOWNLOAD);
                Assertions.assertEquals(0, extraCommands.getExitCode());

                extraCommands =
                        container.execInContainer(
                                "bash",
                                "-c",
                                "cd /tmp/seatunnel/lib && curl -O " + OSS_SDK_DOWNLOAD);
                Assertions.assertEquals(0, extraCommands.getExitCode());

                extraCommands =
                        container.execInContainer(
                                "bash", "-c", "cd /tmp/seatunnel/lib && curl -O " + JDOM_DOWNLOAD);
                Assertions.assertEquals(0, extraCommands.getExitCode());

                extraCommands =
                        container.execInContainer(
                                "bash",
                                "-c",
                                "cd /tmp/seatunnel/lib && curl -O " + HADOOP_ALIYUN_DOWNLOAD);
                Assertions.assertEquals(0, extraCommands.getExitCode());
            };

@4chicat 4chicat changed the title [Improve][Connector-V2][Iceberg] Support for S3 in hadoop catalog [Improve][Connector-V2][Iceberg] Add hadoop s3 catalog e2e testcase May 1, 2024
[Improve][Connector-V2][Iceberg] Support for S3 in hadoop catalog # bugfix

[Improve][Connector-V2][Iceberg] Support for S3 in hadoop catalog # add e2e test case

[Improve][Connector-V2][Iceberg] Support for S3 in hadoop catalog # add @DisabledOnContainer

[Improve][Connector-V2][Iceberg] Support for S3 in hadoop catalog # remove log4j.properties

[Improve][Connector-V2][Iceberg] Support for S3 in hadoop catalog # add removeSystemThread

[Improve][Connector-V2][Iceberg] Support for S3 in hadoop catalog # move and remove code

[Improve][Connector-V2][Iceberg] Support for S3 in hadoop catalog # resolve conflict

[Improve][Connector-V2][Iceberg] Support for S3 in hadoop catalog # mvn spotless:apply

[Improve][Connector-V2][Iceberg] Support for S3 in hadoop catalog # modify IcebergSourceIT

[Improve][Connector-V2][Iceberg] Support for S3 in hadoop catalog # modify doc iceberg source

[Improve][Connector-V2][Iceberg] Support for S3 in hadoop catalog # copy jars to test container docker
@4chicat 4chicat force-pushed the support-s3-warehouse branch from d321ae0 to 1a89e8e Compare May 1, 2024 08:39
@4chicat 4chicat force-pushed the support-s3-warehouse branch from 6624f28 to fe14b3a Compare May 8, 2024 10:35
Copy link
Member

@hailin0 hailin0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hailin0 hailin0 merged commit 00a7014 into apache:dev Jun 15, 2024
6 checks passed
chaorongzhi pushed a commit to chaorongzhi/seatunnel that referenced this pull request Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants