-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-19229. S3A/ABFS: Vector IO on cloud storage: increase threshold for range merging #7281
HADOOP-19229. S3A/ABFS: Vector IO on cloud storage: increase threshold for range merging #7281
Conversation
… size? * default min/max are now 16K and 1M * s3a and abfs use 128K as the minimum size, 2M for max. Based on velox min values (20K for SSD, 500K for cloud). Change-Id: Ia4e876d9cd0ad238621faec844e34b83b0a0bcb8
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one minor change and rest looks good.
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/Sizes.java
Outdated
Show resolved
Hide resolved
@@ -76,7 +76,7 @@ on the client requirements. | |||
</property> | |||
<property> | |||
<name>fs.s3a.vectored.read.max.merged.size</name> | |||
<value>1M</value> | |||
<value>4M</value> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be 2M right?
Change-Id: If4f31d369848deea674c65b644ce530a6481c833
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1
…d for range merging (#7281) The thresholds at which adjacent vector IO read ranges are coalesced into a single range has been increased, as has the limit at which point they are considered large enough that parallel reads are faster. * The min/max for local filesystems and any other FS without custom support are now 16K and 1M * s3a and abfs use 128K as the minimum size, 2M for max. These values are based on the Facebook Velox paper which stated their thresholds for merging were 20K for local SSD and 500K for cloud storage Contributed by Steve Loughran
### What changes were proposed in this pull request? This PR aims to increase S3A Vector IO threshold for range merge. ### Why are the changes needed? Apache Spark 4.0.0 supported Hadoop Vectored IO via ORC and Parquet. As a part of [HADOOP-18855 VectorIO API tuning/stabilization](https://issues.apache.org/jira/browse/HADOOP-18855), Apache Hadoop 3.4.2 will have new threshold default values. We had better follow these update in advance until Apache Hadoop 3.4.2 is released. - apache/hadoop#7281 ### Does this PR introduce _any_ user-facing change? No, Hadoop Vectored IO features are new in Apache Spark 4.0.0 . ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49748 from dongjoon-hyun/SPARK-51049. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to increase S3A Vector IO threshold for range merge. ### Why are the changes needed? Apache Spark 4.0.0 supported Hadoop Vectored IO via ORC and Parquet. As a part of [HADOOP-18855 VectorIO API tuning/stabilization](https://issues.apache.org/jira/browse/HADOOP-18855), Apache Hadoop 3.4.2 will have new threshold default values. We had better follow these update in advance until Apache Hadoop 3.4.2 is released. - apache/hadoop#7281 ### Does this PR introduce _any_ user-facing change? No, Hadoop Vectored IO features are new in Apache Spark 4.0.0 . ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49748 from dongjoon-hyun/SPARK-51049. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit b62c3f4) Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to increase S3A Vector IO threshold for range merge. ### Why are the changes needed? Apache Spark 4.0.0 supported Hadoop Vectored IO via ORC and Parquet. As a part of [HADOOP-18855 VectorIO API tuning/stabilization](https://issues.apache.org/jira/browse/HADOOP-18855), Apache Hadoop 3.4.2 will have new threshold default values. We had better follow these update in advance until Apache Hadoop 3.4.2 is released. - apache/hadoop#7281 ### Does this PR introduce _any_ user-facing change? No, Hadoop Vectored IO features are new in Apache Spark 4.0.0 . ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49748 from dongjoon-hyun/SPARK-51049. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
HADOOP-19229
Based on Facebook's velox paper's reported values: (20K for SSD, 500K for cloud storage).
Also adds new file
org.apache.hadoop.io.Sizes
which provides constants forvarious binary sizes, based on a file in hadoop-azure test source. This
was NOT used anywhere else in the source other than in the new vector
ranges -though it MAY/SHOULD be used in future.
We should be aware that with a larger range, the possibility of failure of merged reads
may increase, #71Ø5 HADOOP-19105. Improve resilience in vector reads. is intended to address this. This changes was on that commit chain, but has now been pulled out for independent review
How was this patch tested?
Existing tests rerun with assertions modified to cope with the changed defaults. Did find one unrelated regression in abfs test suites, filed as HADOOP-19382.
No performance tests were done; any numbers here would be very insightful.
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?