HADOOP-19229. S3A/ABFS: Vector IO on cloud storage: increase threshold for range merging #7281

steveloughran · 2025-01-09T11:16:22Z

HADOOP-19229

default min/max are now 16K and 1M
s3a and abfs use 128K as the minimum size, 2M for max.

Based on Facebook's velox paper's reported values: (20K for SSD, 500K for cloud storage).

Also adds new file org.apache.hadoop.io.Sizes which provides constants for
various binary sizes, based on a file in hadoop-azure test source. This
was NOT used anywhere else in the source other than in the new vector
ranges -though it MAY/SHOULD be used in future.

We should be aware that with a larger range, the possibility of failure of merged reads
may increase, #71Ø5 HADOOP-19105. Improve resilience in vector reads. is intended to address this. This changes was on that commit chain, but has now been pulled out for independent review

How was this patch tested?

Existing tests rerun with assertions modified to cope with the changed defaults. Did find one unrelated regression in abfs test suites, filed as HADOOP-19382.

No performance tests were done; any numbers here would be very insightful.

For code changes:

Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

… size? * default min/max are now 16K and 1M * s3a and abfs use 128K as the minimum size, 2M for max. Based on velox min values (20K for SSD, 500K for cloud). Change-Id: Ia4e876d9cd0ad238621faec844e34b83b0a0bcb8

hadoop-yetus · 2025-01-09T13:53:02Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 28s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+0 🆗	markdownlint	0m 0s		markdownlint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 1 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	5m 19s		Maven dependency ordering for branch
+1 💚	mvninstall	22m 8s		trunk passed
+1 💚	compile	10m 18s		trunk passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚	compile	9m 7s		trunk passed with JDK Private Build-1.8.0_432-8u432-ga~~us1-0ubuntu2~~20.04-ga
+1 💚	checkstyle	2m 15s		trunk passed
+1 💚	mvnsite	2m 2s		trunk passed
+1 💚	javadoc	1m 46s		trunk passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚	javadoc	1m 15s		trunk passed with JDK Private Build-1.8.0_432-8u432-ga~~us1-0ubuntu2~~20.04-ga
+1 💚	spotbugs	2m 52s		trunk passed
+1 💚	shadedclient	24m 25s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 24s		Maven dependency ordering for patch
+1 💚	mvninstall	1m 17s		the patch passed
+1 💚	compile	9m 6s		the patch passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚	javac	9m 6s		the patch passed
+1 💚	compile	7m 47s		the patch passed with JDK Private Build-1.8.0_432-8u432-ga~~us1-0ubuntu2~~20.04-ga
+1 💚	javac	7m 47s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	1m 56s	/results-checkstyle-root.txt	root: The patch generated 2 new + 3 unchanged - 0 fixed = 5 total (was 3)
+1 💚	mvnsite	2m 8s		the patch passed
+1 💚	javadoc	1m 47s		the patch passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚	javadoc	1m 36s		the patch passed with JDK Private Build-1.8.0_432-8u432-ga~~us1-0ubuntu2~~20.04-ga
+1 💚	spotbugs	3m 25s		the patch passed
+1 💚	shadedclient	19m 34s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	16m 24s		hadoop-common in the patch passed.
+1 💚	unit	2m 21s		hadoop-aws in the patch passed.
+1 💚	unit	2m 15s		hadoop-azure in the patch passed.
+1 💚	asflicense	0m 41s		The patch does not generate ASF License warnings.
		155m 38s

Subsystem	Report/Notes
Docker	ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7281/1/artifact/out/Dockerfile
GITHUB PR	#7281
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname	Linux 5ab4dc5dfe6e 5.15.0-124-generic #134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `ab0ebe2`
Default Java	Private Build-1.8.0_432-8u432-ga~~us1-0ubuntu2~~20.04-ga
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_432-8u432-ga~~us1-0ubuntu2~~20.04-ga
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7281/1/testReport/
Max. process+thread count	1269 (vs. ulimit of 5500)
modules	C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws hadoop-tools/hadoop-azure U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7281/1/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

mukund-thakur

one minor change and rest looks good.

hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/Sizes.java

mukund-thakur · 2025-01-09T19:11:39Z

hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/performance.md

@@ -76,7 +76,7 @@ on the client requirements.
 </property>
 <property>
   <name>fs.s3a.vectored.read.max.merged.size</name>
-   <value>1M</value>
+   <value>4M</value>


should be 2M right?

Change-Id: If4f31d369848deea674c65b644ce530a6481c833

hadoop-yetus · 2025-01-14T22:20:00Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 20s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+0 🆗	markdownlint	0m 0s		markdownlint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 1 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	5m 59s		Maven dependency ordering for branch
+1 💚	mvninstall	19m 12s		trunk passed
+1 💚	compile	10m 21s		trunk passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚	compile	9m 16s		trunk passed with JDK Private Build-1.8.0_432-8u432-ga~~us1-0ubuntu2~~20.04-ga
+1 💚	checkstyle	2m 21s		trunk passed
+1 💚	mvnsite	2m 9s		trunk passed
+1 💚	javadoc	1m 49s		trunk passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚	javadoc	1m 21s		trunk passed with JDK Private Build-1.8.0_432-8u432-ga~~us1-0ubuntu2~~20.04-ga
+1 💚	spotbugs	3m 24s		trunk passed
+1 💚	shadedclient	23m 27s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 20s		Maven dependency ordering for patch
+1 💚	mvninstall	1m 9s		the patch passed
+1 💚	compile	9m 58s		the patch passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚	javac	9m 58s		the patch passed
+1 💚	compile	8m 21s		the patch passed with JDK Private Build-1.8.0_432-8u432-ga~~us1-0ubuntu2~~20.04-ga
+1 💚	javac	8m 21s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	2m 19s	/results-checkstyle-root.txt	root: The patch generated 2 new + 3 unchanged - 0 fixed = 5 total (was 3)
+1 💚	mvnsite	1m 53s		the patch passed
+1 💚	javadoc	1m 23s		the patch passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚	javadoc	1m 19s		the patch passed with JDK Private Build-1.8.0_432-8u432-ga~~us1-0ubuntu2~~20.04-ga
+1 💚	spotbugs	3m 27s		the patch passed
+1 💚	shadedclient	21m 22s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	16m 43s		hadoop-common in the patch passed.
+1 💚	unit	2m 3s		hadoop-aws in the patch passed.
+1 💚	unit	2m 7s		hadoop-azure in the patch passed.
+1 💚	asflicense	0m 36s		The patch does not generate ASF License warnings.
		154m 58s

Subsystem	Report/Notes
Docker	ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7281/2/artifact/out/Dockerfile
GITHUB PR	#7281
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname	Linux 28ccbe5d0963 5.15.0-124-generic #134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `212dc9d`
Default Java	Private Build-1.8.0_432-8u432-ga~~us1-0ubuntu2~~20.04-ga
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_432-8u432-ga~~us1-0ubuntu2~~20.04-ga
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7281/2/testReport/
Max. process+thread count	2152 (vs. ulimit of 5500)
modules	C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws hadoop-tools/hadoop-azure U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7281/2/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

mukund-thakur

LGTM +1

…d for range merging (#7281) The thresholds at which adjacent vector IO read ranges are coalesced into a single range has been increased, as has the limit at which point they are considered large enough that parallel reads are faster. * The min/max for local filesystems and any other FS without custom support are now 16K and 1M * s3a and abfs use 128K as the minimum size, 2M for max. These values are based on the Facebook Velox paper which stated their thresholds for merging were 20K for local SSD and 500K for cloud storage Contributed by Steve Loughran

### What changes were proposed in this pull request? This PR aims to increase S3A Vector IO threshold for range merge. ### Why are the changes needed? Apache Spark 4.0.0 supported Hadoop Vectored IO via ORC and Parquet. As a part of [HADOOP-18855 VectorIO API tuning/stabilization](https://issues.apache.org/jira/browse/HADOOP-18855), Apache Hadoop 3.4.2 will have new threshold default values. We had better follow these update in advance until Apache Hadoop 3.4.2 is released. - apache/hadoop#7281 ### Does this PR introduce _any_ user-facing change? No, Hadoop Vectored IO features are new in Apache Spark 4.0.0 . ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49748 from dongjoon-hyun/SPARK-51049. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This PR aims to increase S3A Vector IO threshold for range merge. ### Why are the changes needed? Apache Spark 4.0.0 supported Hadoop Vectored IO via ORC and Parquet. As a part of [HADOOP-18855 VectorIO API tuning/stabilization](https://issues.apache.org/jira/browse/HADOOP-18855), Apache Hadoop 3.4.2 will have new threshold default values. We had better follow these update in advance until Apache Hadoop 3.4.2 is released. - apache/hadoop#7281 ### Does this PR introduce _any_ user-facing change? No, Hadoop Vectored IO features are new in Apache Spark 4.0.0 . ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49748 from dongjoon-hyun/SPARK-51049. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit b62c3f4) Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This PR aims to increase S3A Vector IO threshold for range merge. ### Why are the changes needed? Apache Spark 4.0.0 supported Hadoop Vectored IO via ORC and Parquet. As a part of [HADOOP-18855 VectorIO API tuning/stabilization](https://issues.apache.org/jira/browse/HADOOP-18855), Apache Hadoop 3.4.2 will have new threshold default values. We had better follow these update in advance until Apache Hadoop 3.4.2 is released. - apache/hadoop#7281 ### Does this PR introduce _any_ user-facing change? No, Hadoop Vectored IO features are new in Apache Spark 4.0.0 . ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49748 from dongjoon-hyun/SPARK-51049. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

HADOOP-19229. Vector IO on cloud storage: what is a good minimum seek…

ab0ebe2

… size? * default min/max are now 16K and 1M * s3a and abfs use 128K as the minimum size, 2M for max. Based on velox min values (20K for SSD, 500K for cloud). Change-Id: Ia4e876d9cd0ad238621faec844e34b83b0a0bcb8

github-actions bot added trunk Common TOOLS AWS ABFS labels Jan 9, 2025

mukund-thakur reviewed Jan 9, 2025

View reviewed changes

HADOOP-19229. vector io: mukund's suggestions

212dc9d

Change-Id: If4f31d369848deea674c65b644ce530a6481c833

mukund-thakur approved these changes Jan 14, 2025

View reviewed changes

steveloughran merged commit c3e3228 into apache:trunk Jan 15, 2025
4 checks passed

dongjoon-hyun mentioned this pull request Jan 31, 2025

[SPARK-51049][CORE] Increase S3A Vector IO threshold for range merge apache/spark#49748

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HADOOP-19229. S3A/ABFS: Vector IO on cloud storage: increase threshold for range merging #7281

HADOOP-19229. S3A/ABFS: Vector IO on cloud storage: increase threshold for range merging #7281

steveloughran commented Jan 9, 2025 •

edited

Loading

hadoop-yetus commented Jan 9, 2025

mukund-thakur left a comment

mukund-thakur Jan 9, 2025

hadoop-yetus commented Jan 14, 2025

mukund-thakur left a comment

HADOOP-19229. S3A/ABFS: Vector IO on cloud storage: increase threshold for range merging #7281

HADOOP-19229. S3A/ABFS: Vector IO on cloud storage: increase threshold for range merging #7281

Conversation

steveloughran commented Jan 9, 2025 • edited Loading

How was this patch tested?

For code changes:

hadoop-yetus commented Jan 9, 2025

mukund-thakur left a comment

Choose a reason for hiding this comment

mukund-thakur Jan 9, 2025

Choose a reason for hiding this comment

hadoop-yetus commented Jan 14, 2025

mukund-thakur left a comment

Choose a reason for hiding this comment

steveloughran commented Jan 9, 2025 •

edited

Loading