Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HADOOP-11867. Add a high-performance vectored read API. #3904

Conversation

mukund-thakur
Copy link
Contributor

@mukund-thakur mukund-thakur commented Jan 19, 2022

Description of PR

Adding support for multiple ranged read async api in PositionedReadable. The default iterates through the ranges to read each synchronously, but the intent is that FSDataInputStream subclasses can make more efficient readers especially object stores implementation.

How was this patch tested?

Added benchmarks.
Added UT's
Added new contract tests for new API spec.

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@mukund-thakur
Copy link
Contributor Author

This is based on #3499. The plan is to merge this as a base commit in the feature branch and then split work on the final changes ( other optimisations and features) up into new PRs, each with their own JIRA.

More tests and checkstyle.
Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apart from minor details, this is ready to go into the branch for testing/tuning.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 44s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 shelldocs 0m 1s Shelldocs was not available.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 8 new or modified test files.
_ feature-vectored-io Compile Tests _
+0 🆗 mvndep 13m 7s Maven dependency ordering for branch
+1 💚 mvninstall 24m 37s feature-vectored-io passed
+1 💚 compile 26m 10s feature-vectored-io passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚 compile 21m 16s feature-vectored-io passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 3m 41s feature-vectored-io passed
+1 💚 mvnsite 27m 27s feature-vectored-io passed
+1 💚 javadoc 8m 34s feature-vectored-io passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 8m 32s feature-vectored-io passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+0 🆗 spotbugs 0m 21s branch/hadoop-project no spotbugs output file (spotbugsXml.xml)
+1 💚 shadedclient 55m 25s branch has no errors when building and testing our client artifacts.
-0 ⚠️ patch 55m 47s Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 48s Maven dependency ordering for patch
+1 💚 mvninstall 29m 48s the patch passed
+1 💚 compile 23m 59s the patch passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚 javac 23m 59s the patch passed
+1 💚 compile 21m 11s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 javac 21m 11s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 3m 37s /results-checkstyle-root.txt root: The patch generated 1 new + 78 unchanged - 3 fixed = 79 total (was 81)
+1 💚 mvnsite 22m 54s the patch passed
+1 💚 shellcheck 0m 0s No new issues.
+1 💚 xml 0m 10s The patch has no ill-formed XML file.
+1 💚 javadoc 8m 36s the patch passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 8m 32s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+0 🆗 spotbugs 0m 21s hadoop-project has no data from spotbugs
+1 💚 shadedclient 56m 3s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 794m 56s /patch-unit-root.txt root in the patch passed.
+1 💚 asflicense 1m 38s The patch does not generate ASF License warnings.
1196m 55s
Reason Tests
Failed junit tests hadoop.yarn.csi.client.TestCsiClient
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3904/5/artifact/out/Dockerfile
GITHUB PR #3904
Optional Tests dupname asflicense codespell shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle markdownlint xml
uname Linux f5cb218f3898 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision feature-vectored-io / 326557d
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3904/5/testReport/
Max. process+thread count 3075 (vs. ulimit of 5500)
modules C: hadoop-project hadoop-common-project/hadoop-common hadoop-common-project hadoop-tools/hadoop-aws hadoop-tools/hadoop-benchmark hadoop-tools . U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3904/5/console
versions git=2.25.1 maven=3.6.3 shellcheck=0.7.0 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

@mukund-thakur
Copy link
Contributor Author

Yetus failing because of one known test failure. https://issues.apache.org/jira/browse/YARN-10788

./hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java:551: ChecksumFSOutputSummer(ChecksumFileSystem fs,:5: More than 7 parameters (found 8). [ParameterNumber]

Also this only checkstyle error is not introduced by my patch It was already there.

I think it is time to merge this now.

@steveloughran steveloughran changed the title HADOOP-11867. Add a high performance vectored read API to file system. HADOOP-11867. Add a high performance vectored read API. Jan 28, 2022
@steveloughran steveloughran changed the title HADOOP-11867. Add a high performance vectored read API. HADOOP-11867. Add a high-performance vectored read API. Jan 28, 2022
@mukund-thakur mukund-thakur self-assigned this Feb 1, 2022
@steveloughran
Copy link
Contributor

steveloughran commented Feb 1, 2022

+1 to merge into this branch as feature complete, now we follow up with the final integration work.

  • remove the extended timeout and lets see if things are good. if not, the change can go in as a single patch into trunk
  • this pr can/should still be rebased as relevant changes go in, it just needs to be co-ordinated with anyone else with the branch checked out
  • but i do want the patches to go in as independent commits for better tracing, with "part of HADOOP-11867" in each patch so a git log --grep will find them all

@mukund-thakur mukund-thakur merged commit ac08a25 into apache:feature-vectored-io Feb 1, 2022
mukund-thakur added a commit that referenced this pull request Feb 22, 2022
part of HADOOP-18103. 
Add support for multiple ranged vectored read api in PositionedReadable.
The default iterates through the ranges to read each synchronously,
but the intent is that FSDataInputStream subclasses can make more
efficient readers especially in object stores implementation.

Also added implementation in S3A where smaller ranges are merged and 
sliced byte buffers are returned to the readers. All the merged ranged are 
fetched from S3 asynchronously.


Contributed By: Owen O'Malley and Mukund Thakur
mukund-thakur added a commit that referenced this pull request May 2, 2022
part of HADOOP-18103.
Add support for multiple ranged vectored read api in PositionedReadable.
The default iterates through the ranges to read each synchronously,
but the intent is that FSDataInputStream subclasses can make more
efficient readers especially in object stores implementation.

Also added implementation in S3A where smaller ranges are merged and
sliced byte buffers are returned to the readers. All the merged ranged are
fetched from S3 asynchronously.

Contributed By: Owen O'Malley and Mukund Thakur
mukund-thakur added a commit that referenced this pull request Jun 15, 2022
part of HADOOP-18103.
Add support for multiple ranged vectored read api in PositionedReadable.
The default iterates through the ranges to read each synchronously,
but the intent is that FSDataInputStream subclasses can make more
efficient readers especially in object stores implementation.

Also added implementation in S3A where smaller ranges are merged and
sliced byte buffers are returned to the readers. All the merged ranged are
fetched from S3 asynchronously.

Contributed By: Owen O'Malley and Mukund Thakur
mukund-thakur added a commit that referenced this pull request Jun 21, 2022
part of HADOOP-18103.
Add support for multiple ranged vectored read api in PositionedReadable.
The default iterates through the ranges to read each synchronously,
but the intent is that FSDataInputStream subclasses can make more
efficient readers especially in object stores implementation.

Also added implementation in S3A where smaller ranges are merged and
sliced byte buffers are returned to the readers. All the merged ranged are
fetched from S3 asynchronously.

Contributed By: Owen O'Malley and Mukund Thakur
asfgit pushed a commit that referenced this pull request Jun 22, 2022
part of HADOOP-18103.
Add support for multiple ranged vectored read api in PositionedReadable.
The default iterates through the ranges to read each synchronously,
but the intent is that FSDataInputStream subclasses can make more
efficient readers especially in object stores implementation.

Also added implementation in S3A where smaller ranges are merged and
sliced byte buffers are returned to the readers. All the merged ranged are
fetched from S3 asynchronously.

Contributed By: Owen O'Malley and Mukund Thakur
mukund-thakur added a commit that referenced this pull request Jun 27, 2022
part of HADOOP-18103.
Add support for multiple ranged vectored read api in PositionedReadable.
The default iterates through the ranges to read each synchronously,
but the intent is that FSDataInputStream subclasses can make more
efficient readers especially in object stores implementation.

Also added implementation in S3A where smaller ranges are merged and
sliced byte buffers are returned to the readers. All the merged ranged are
fetched from S3 asynchronously.

Contributed By: Owen O'Malley and Mukund Thakur

 Conflicts:
	hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/RawLocalFileSystem.java
	pom.xml
HarshitGupta11 pushed a commit to HarshitGupta11/hadoop that referenced this pull request Nov 28, 2022
part of HADOOP-18103.
Add support for multiple ranged vectored read api in PositionedReadable.
The default iterates through the ranges to read each synchronously,
but the intent is that FSDataInputStream subclasses can make more
efficient readers especially in object stores implementation.

Also added implementation in S3A where smaller ranges are merged and
sliced byte buffers are returned to the readers. All the merged ranged are
fetched from S3 asynchronously.

Contributed By: Owen O'Malley and Mukund Thakur
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants