Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HADOOP-19229. S3A/ABFS: Vector IO on cloud storage: increase threshold for range merging #7281

Merged

Conversation

steveloughran
Copy link
Contributor

@steveloughran steveloughran commented Jan 9, 2025

HADOOP-19229

  • default min/max are now 16K and 1M
  • s3a and abfs use 128K as the minimum size, 2M for max.

Based on Facebook's velox paper's reported values: (20K for SSD, 500K for cloud storage).

Also adds new file org.apache.hadoop.io.Sizes which provides constants for
various binary sizes, based on a file in hadoop-azure test source. This
was NOT used anywhere else in the source other than in the new vector
ranges -though it MAY/SHOULD be used in future.

We should be aware that with a larger range, the possibility of failure of merged reads
may increase, #71Ø5 HADOOP-19105. Improve resilience in vector reads. is intended to address this. This changes was on that commit chain, but has now been pulled out for independent review

How was this patch tested?

Existing tests rerun with assertions modified to cope with the changed defaults. Did find one unrelated regression in abfs test suites, filed as HADOOP-19382.

No performance tests were done; any numbers here would be very insightful.

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

… size?

* default min/max are now 16K and 1M
* s3a and abfs use 128K as the minimum size, 2M for max.

Based on velox min values (20K for SSD, 500K  for cloud).

Change-Id: Ia4e876d9cd0ad238621faec844e34b83b0a0bcb8
@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 28s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 5m 19s Maven dependency ordering for branch
+1 💚 mvninstall 22m 8s trunk passed
+1 💚 compile 10m 18s trunk passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚 compile 9m 7s trunk passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
+1 💚 checkstyle 2m 15s trunk passed
+1 💚 mvnsite 2m 2s trunk passed
+1 💚 javadoc 1m 46s trunk passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚 javadoc 1m 15s trunk passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
+1 💚 spotbugs 2m 52s trunk passed
+1 💚 shadedclient 24m 25s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 24s Maven dependency ordering for patch
+1 💚 mvninstall 1m 17s the patch passed
+1 💚 compile 9m 6s the patch passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚 javac 9m 6s the patch passed
+1 💚 compile 7m 47s the patch passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
+1 💚 javac 7m 47s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 1m 56s /results-checkstyle-root.txt root: The patch generated 2 new + 3 unchanged - 0 fixed = 5 total (was 3)
+1 💚 mvnsite 2m 8s the patch passed
+1 💚 javadoc 1m 47s the patch passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚 javadoc 1m 36s the patch passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
+1 💚 spotbugs 3m 25s the patch passed
+1 💚 shadedclient 19m 34s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 16m 24s hadoop-common in the patch passed.
+1 💚 unit 2m 21s hadoop-aws in the patch passed.
+1 💚 unit 2m 15s hadoop-azure in the patch passed.
+1 💚 asflicense 0m 41s The patch does not generate ASF License warnings.
155m 38s
Subsystem Report/Notes
Docker ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7281/1/artifact/out/Dockerfile
GITHUB PR #7281
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname Linux 5ab4dc5dfe6e 5.15.0-124-generic #134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / ab0ebe2
Default Java Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7281/1/testReport/
Max. process+thread count 1269 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws hadoop-tools/hadoop-azure U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7281/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Contributor

@mukund-thakur mukund-thakur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one minor change and rest looks good.

@@ -76,7 +76,7 @@ on the client requirements.
</property>
<property>
<name>fs.s3a.vectored.read.max.merged.size</name>
<value>1M</value>
<value>4M</value>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be 2M right?

Change-Id: If4f31d369848deea674c65b644ce530a6481c833
@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 20s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 5m 59s Maven dependency ordering for branch
+1 💚 mvninstall 19m 12s trunk passed
+1 💚 compile 10m 21s trunk passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚 compile 9m 16s trunk passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
+1 💚 checkstyle 2m 21s trunk passed
+1 💚 mvnsite 2m 9s trunk passed
+1 💚 javadoc 1m 49s trunk passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚 javadoc 1m 21s trunk passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
+1 💚 spotbugs 3m 24s trunk passed
+1 💚 shadedclient 23m 27s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 20s Maven dependency ordering for patch
+1 💚 mvninstall 1m 9s the patch passed
+1 💚 compile 9m 58s the patch passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚 javac 9m 58s the patch passed
+1 💚 compile 8m 21s the patch passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
+1 💚 javac 8m 21s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 2m 19s /results-checkstyle-root.txt root: The patch generated 2 new + 3 unchanged - 0 fixed = 5 total (was 3)
+1 💚 mvnsite 1m 53s the patch passed
+1 💚 javadoc 1m 23s the patch passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚 javadoc 1m 19s the patch passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
+1 💚 spotbugs 3m 27s the patch passed
+1 💚 shadedclient 21m 22s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 16m 43s hadoop-common in the patch passed.
+1 💚 unit 2m 3s hadoop-aws in the patch passed.
+1 💚 unit 2m 7s hadoop-azure in the patch passed.
+1 💚 asflicense 0m 36s The patch does not generate ASF License warnings.
154m 58s
Subsystem Report/Notes
Docker ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7281/2/artifact/out/Dockerfile
GITHUB PR #7281
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname Linux 28ccbe5d0963 5.15.0-124-generic #134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 212dc9d
Default Java Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7281/2/testReport/
Max. process+thread count 2152 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws hadoop-tools/hadoop-azure U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7281/2/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Contributor

@mukund-thakur mukund-thakur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

@steveloughran steveloughran merged commit c3e3228 into apache:trunk Jan 15, 2025
4 checks passed
asfgit pushed a commit that referenced this pull request Jan 15, 2025
…d for range merging (#7281)

The thresholds at which adjacent vector IO read ranges are coalesced into a
single range has been increased, as has the limit at which point they are
considered large enough that parallel reads are faster.

* The min/max for local filesystems and any other FS without custom support are
now 16K and 1M
* s3a and abfs use 128K as the minimum size, 2M for max.

These values are based on the Facebook Velox paper which stated
their thresholds for merging were 20K for local SSD and 500K for cloud storage

Contributed by Steve Loughran
dongjoon-hyun added a commit to apache/spark that referenced this pull request Jan 31, 2025
### What changes were proposed in this pull request?

This PR aims to increase S3A Vector IO threshold for range merge.

### Why are the changes needed?

Apache Spark 4.0.0 supported Hadoop Vectored IO via ORC and Parquet.

As a part of [HADOOP-18855 VectorIO API tuning/stabilization](https://issues.apache.org/jira/browse/HADOOP-18855), Apache Hadoop 3.4.2 will have new threshold default values. We had better follow these update in advance until Apache Hadoop 3.4.2 is released.

- apache/hadoop#7281

### Does this PR introduce _any_ user-facing change?

No, Hadoop Vectored IO features are new in Apache Spark 4.0.0 .

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #49748 from dongjoon-hyun/SPARK-51049.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun added a commit to apache/spark that referenced this pull request Jan 31, 2025
### What changes were proposed in this pull request?

This PR aims to increase S3A Vector IO threshold for range merge.

### Why are the changes needed?

Apache Spark 4.0.0 supported Hadoop Vectored IO via ORC and Parquet.

As a part of [HADOOP-18855 VectorIO API tuning/stabilization](https://issues.apache.org/jira/browse/HADOOP-18855), Apache Hadoop 3.4.2 will have new threshold default values. We had better follow these update in advance until Apache Hadoop 3.4.2 is released.

- apache/hadoop#7281

### Does this PR introduce _any_ user-facing change?

No, Hadoop Vectored IO features are new in Apache Spark 4.0.0 .

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #49748 from dongjoon-hyun/SPARK-51049.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit b62c3f4)
Signed-off-by: Dongjoon Hyun <[email protected]>
a0x8o added a commit to a0x8o/spark that referenced this pull request Jan 31, 2025
### What changes were proposed in this pull request?

This PR aims to increase S3A Vector IO threshold for range merge.

### Why are the changes needed?

Apache Spark 4.0.0 supported Hadoop Vectored IO via ORC and Parquet.

As a part of [HADOOP-18855 VectorIO API tuning/stabilization](https://issues.apache.org/jira/browse/HADOOP-18855), Apache Hadoop 3.4.2 will have new threshold default values. We had better follow these update in advance until Apache Hadoop 3.4.2 is released.

- apache/hadoop#7281

### Does this PR introduce _any_ user-facing change?

No, Hadoop Vectored IO features are new in Apache Spark 4.0.0 .

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #49748 from dongjoon-hyun/SPARK-51049.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants