HADOOP-11867. Add a high-performance vectored read API. #3904

mukund-thakur · 2022-01-19T08:00:04Z

Description of PR

Adding support for multiple ranged read async api in PositionedReadable. The default iterates through the ranges to read each synchronously, but the intent is that FSDataInputStream subclasses can make more efficient readers especially object stores implementation.

How was this patch tested?

Added benchmarks.
Added UT's
Added new contract tests for new API spec.

For code changes:

Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

Conflicts: hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/BufferedFSInputStream.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FSDataInputStream.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/RawLocalFileSystem.java pom.xml

mukund-thakur · 2022-01-19T08:03:11Z

This is based on #3499. The plan is to merge this as a base commit in the feature branch and then split work on the final changes ( other optimisations and features) up into new PRs, each with their own JIRA.

More tests and checkstyle.

steveloughran

apart from minor details, this is ready to go into the branch for testing/tuning.

...oop-common/src/test/java/org/apache/hadoop/fs/contract/AbstractContractVectoredReadTest.java

hadoop-tools/hadoop-benchmark/src/main/java/org/apache/hadoop/benchmark/package-info.java

hadoop-yetus · 2022-01-28T03:34:13Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 44s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 1s		No case conflicting files found.
+0 🆗	codespell	0m 1s		codespell was not available.
+0 🆗	shelldocs	0m 1s		Shelldocs was not available.
+0 🆗	markdownlint	0m 0s		markdownlint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 8 new or modified test files.
			_ feature-vectored-io Compile Tests _
+0 🆗	mvndep	13m 7s		Maven dependency ordering for branch
+1 💚	mvninstall	24m 37s		feature-vectored-io passed
+1 💚	compile	26m 10s		feature-vectored-io passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚	compile	21m 16s		feature-vectored-io passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	checkstyle	3m 41s		feature-vectored-io passed
+1 💚	mvnsite	27m 27s		feature-vectored-io passed
+1 💚	javadoc	8m 34s		feature-vectored-io passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚	javadoc	8m 32s		feature-vectored-io passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+0 🆗	spotbugs	0m 21s		branch/hadoop-project no spotbugs output file (spotbugsXml.xml)
+1 💚	shadedclient	55m 25s		branch has no errors when building and testing our client artifacts.
-0 ⚠️	patch	55m 47s		Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 48s		Maven dependency ordering for patch
+1 💚	mvninstall	29m 48s		the patch passed
+1 💚	compile	23m 59s		the patch passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚	javac	23m 59s		the patch passed
+1 💚	compile	21m 11s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	javac	21m 11s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	3m 37s	/results-checkstyle-root.txt	root: The patch generated 1 new + 78 unchanged - 3 fixed = 79 total (was 81)
+1 💚	mvnsite	22m 54s		the patch passed
+1 💚	shellcheck	0m 0s		No new issues.
+1 💚	xml	0m 10s		The patch has no ill-formed XML file.
+1 💚	javadoc	8m 36s		the patch passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚	javadoc	8m 32s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+0 🆗	spotbugs	0m 21s		hadoop-project has no data from spotbugs
+1 💚	shadedclient	56m 3s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
-1 ❌	unit	794m 56s	/patch-unit-root.txt	root in the patch passed.
+1 💚	asflicense	1m 38s		The patch does not generate ASF License warnings.
		1196m 55s

Reason	Tests
Failed junit tests	hadoop.yarn.csi.client.TestCsiClient

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3904/5/artifact/out/Dockerfile
GITHUB PR	#3904
Optional Tests	dupname asflicense codespell shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle markdownlint xml
uname	Linux f5cb218f3898 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	feature-vectored-io / `326557d`
Default Java	Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3904/5/testReport/
Max. process+thread count	3075 (vs. ulimit of 5500)
modules	C: hadoop-project hadoop-common-project/hadoop-common hadoop-common-project hadoop-tools/hadoop-aws hadoop-tools/hadoop-benchmark hadoop-tools . U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3904/5/console
versions	git=2.25.1 maven=3.6.3 shellcheck=0.7.0 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

mukund-thakur · 2022-01-28T06:28:58Z

Yetus failing because of one known test failure. https://issues.apache.org/jira/browse/YARN-10788

./hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java:551: ChecksumFSOutputSummer(ChecksumFileSystem fs,:5: More than 7 parameters (found 8). [ParameterNumber]

Also this only checkstyle error is not introduced by my patch It was already there.

I think it is time to merge this now.

steveloughran · 2022-02-01T10:51:23Z

+1 to merge into this branch as feature complete, now we follow up with the final integration work.

remove the extended timeout and lets see if things are good. if not, the change can go in as a single patch into trunk
this pr can/should still be rebased as relevant changes go in, it just needs to be co-ordinated with anyone else with the branch checked out
but i do want the patches to go in as independent commits for better tracing, with "part of HADOOP-11867" in each patch so a git log --grep will find them all

part of HADOOP-18103. Add support for multiple ranged vectored read api in PositionedReadable. The default iterates through the ranges to read each synchronously, but the intent is that FSDataInputStream subclasses can make more efficient readers especially in object stores implementation. Also added implementation in S3A where smaller ranges are merged and sliced byte buffers are returned to the readers. All the merged ranged are fetched from S3 asynchronously. Contributed By: Owen O'Malley and Mukund Thakur

part of HADOOP-18103. Add support for multiple ranged vectored read api in PositionedReadable. The default iterates through the ranges to read each synchronously, but the intent is that FSDataInputStream subclasses can make more efficient readers especially in object stores implementation. Also added implementation in S3A where smaller ranges are merged and sliced byte buffers are returned to the readers. All the merged ranged are fetched from S3 asynchronously. Contributed By: Owen O'Malley and Mukund Thakur Conflicts: hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/RawLocalFileSystem.java pom.xml

part of HADOOP-18103. Add support for multiple ranged vectored read api in PositionedReadable. The default iterates through the ranges to read each synchronously, but the intent is that FSDataInputStream subclasses can make more efficient readers especially in object stores implementation. Also added implementation in S3A where smaller ranges are merged and sliced byte buffers are returned to the readers. All the merged ranged are fetched from S3 asynchronously. Contributed By: Owen O'Malley and Mukund Thakur

omalley and others added 14 commits January 19, 2022 10:39

async api to throw IOE and basic S3A implementation

31f1ad7

Vectored Read API spec

7068e36

Adding contract tests for vectored read API

acca6e4

Move benchmark test to hadoop-tools module

2505058

Review comments

268bf0e

Merging of ranges in S3A vectored read implementation

3b653ff

Implementing change detection for vectored reads in S3a

f32aa4f

More tests and javadoc

0a3882f

Cleaning the s3objets else connections were getting exhausted

e9f1282

Adding retries in getS3Object

09e0cba

Fix vectored read for bigger size files

dc70420

Test for bigger file size

bdf4006

Review comments by Steve and Mehakmeet

d8c3950

mukund-thakur requested a review from steveloughran January 19, 2022 08:00

mukund-thakur mentioned this pull request Jan 19, 2022

HADOOP-11867. Add a high performance vectored read API to file system. #3499

Closed

4 tasks

Fix in slice of byte buffers.

642a6c1

More tests and checkstyle.

steveloughran reviewed Jan 21, 2022

View reviewed changes

mukund-thakur added 4 commits January 24, 2022 15:44

Minor fixes to trigger yetus again.

5840ee9

This should fix Yetus findbugs

bcbfbbb

Checkstyle and javadoc

7908d6f

Increasing yetus timeout

326557d

steveloughran changed the title ~~HADOOP-11867. Add a high performance vectored read API to file system.~~ HADOOP-11867. Add a high performance vectored read API. Jan 28, 2022

steveloughran changed the title ~~HADOOP-11867. Add a high performance vectored read API.~~ HADOOP-11867. Add a high-performance vectored read API. Jan 28, 2022

mukund-thakur self-assigned this Feb 1, 2022

mukund-thakur merged commit ac08a25 into apache:feature-vectored-io Feb 1, 2022

xkrogen mentioned this pull request Jun 29, 2022

HADOOP-18315. Fix 3.3 build problems caused by backport of HADOOP-11867. #4511

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HADOOP-11867. Add a high-performance vectored read API. #3904

HADOOP-11867. Add a high-performance vectored read API. #3904

mukund-thakur commented Jan 19, 2022 •

edited

Loading

mukund-thakur commented Jan 19, 2022

steveloughran left a comment

hadoop-yetus commented Jan 28, 2022

mukund-thakur commented Jan 28, 2022

steveloughran commented Feb 1, 2022 •

edited

Loading

HADOOP-11867. Add a high-performance vectored read API. #3904

HADOOP-11867. Add a high-performance vectored read API. #3904

Conversation

mukund-thakur commented Jan 19, 2022 • edited Loading

Description of PR

How was this patch tested?

For code changes:

mukund-thakur commented Jan 19, 2022

steveloughran left a comment

Choose a reason for hiding this comment

hadoop-yetus commented Jan 28, 2022

mukund-thakur commented Jan 28, 2022

steveloughran commented Feb 1, 2022 • edited Loading

mukund-thakur commented Jan 19, 2022 •

edited

Loading

steveloughran commented Feb 1, 2022 •

edited

Loading