Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HADOOP-11867. Add a high performance vectored read API to file system. #3499

Closed

Conversation

mukund-thakur
Copy link
Contributor

@mukund-thakur mukund-thakur commented Sep 29, 2021

Rebased work on top of #1830

Conflicts:
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/BufferedFSInputStream.java
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FSDataInputStream.java
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/RawLocalFileSystem.java
pom.xml

Description of PR

Adding support for multiple ranged read async api in PositionedReadable. The default iterates through the ranges to read each synchronously, but the intent is that FSDataInputStream subclasses can make more efficient readers especially object stores implementation.

How was this patch tested?

Added benchmarks.
Added UT's
Added new contract tests for new API spec.

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@mukund-thakur mukund-thakur marked this pull request as draft September 29, 2021 11:48
@mukund-thakur
Copy link
Contributor Author

@steveloughran
Copy link
Contributor

Quick review.

  • needs filesystem spec

  • and a contract test

  • test to include
    -overlapping ranges
    -beyond file length
    -negative values
    -on heap and off heap buffers

  • be good to have a PoC of the s3a one, so that we can happy the API xfers to the store

  • and some ORC/parquet branch which shows it can be used

  • is there any API we'd want, e.g an fs command which does a single async ranged get

hadoop fs ranged -start 0 -end 345  -out output.txt s3a://file

would be good for QE tests

  • and we should have a path capability. Anything which supports seek will also do this (so ftp won't)

Copy link
Contributor

@bogthe bogthe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did an initial review and looks promising.

I'm super curious to see what an async read will bring in terms of performance improvements for a job (even if it's a quick scrappy implementation in the FS).

@mukund-thakur mukund-thakur marked this pull request as ready for review December 2, 2021 07:14
omalley and others added 7 commits December 2, 2021 12:53
 Conflicts:
	hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/BufferedFSInputStream.java
	hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java
	hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FSDataInputStream.java
	hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/RawLocalFileSystem.java
	pom.xml
@mukund-thakur mukund-thakur force-pushed the HADOOP-11867-vectored-io branch from 1a03999 to fa3ebc1 Compare December 2, 2021 09:55
@mukund-thakur mukund-thakur requested a review from omalley December 3, 2021 07:50
@steveloughran steveloughran self-assigned this Dec 3, 2021
@mukund-thakur mukund-thakur added fs fs/s3 changes related to hadoop-aws; submitter must declare test endpoint labels Dec 7, 2021
@steveloughran
Copy link
Contributor

I just realized I had some pending comments on one of the updates; maybe out of date now and they were all minor. Let me re-review.

Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's some comments I had outstanding since 2021 .. sorry. will do a bigger review now

@@ -66,7 +74,7 @@
public static double getApproxChkSumLength(long size) {
return ChecksumFSOutputSummer.CHKSUM_AS_FRACTION * size;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all needless and making the patch a bit bigger than it should be. was yetus complaining?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how it is showing up here. In intellJ diff I don't see this.

FileSystem fs = getFileSystem();
List<FileRange> fileRanges = new ArrayList<>();
fileRanges.add(new FileRangeImpl(1293, 25837));
try (FSDataInputStream in = fs.open(path(VECTORED_READ_FILE_1MB_NAME))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use openFile here? just to make sure that codepath is happy

@steveloughran
Copy link
Contributor

OK, I've had a look at the recent changes, and had some shared screen reviews over zoom

here's what I propose

  • create a feature branch in the asf repo where we can rebase/reorder commits
  • get what is done in as the base commit
  • then we can split work on the final changes up into new PRs, each with their own JIRA.

where those changes can be

  • buffer pooling (API to these calls and better weak reference handling in the elastic one or nearby)
  • s3a split handling
  • huge file support (new tests in the s3a huge file suite)

then, once happy we can do a rebase followed by merge commit into trunk/branch-3.

key point: this patch has been going and is ready to stabilize, which can be done with incremental changes in a feature branch.

Copy link
Contributor

@mehakmeet mehakmeet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did an initial Review, looks really good.
I think we should add an integration test around the ordered disjoint list of ranges as well to showcase them not merging and reading or having a few ranges where some are merged and some aren't by validating them.
Also, are both minSeek and maxReadSize going to be configurable? I assume they are hardcoded just so you can test them out..

@mukund-thakur
Copy link
Contributor Author

mukund-thakur commented Jan 19, 2022

Did an initial Review, looks really good. I think we should add an integration test around the ordered disjoint list of ranges as well to showcase them not merging and reading or having a few ranges where some are merged and some aren't by validating them.

Yeah already in my TODO. Thanks.

Also, are both minSeek and maxReadSize going to be configurable? I assume they are hardcoded just so you can test them out..
Yes I will be making them configurable in consecutive PR's

Also Please refer to the new PR now. #3904

@steveloughran
Copy link
Contributor

now this is in the feature branch, this can be closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement fs/s3 changes related to hadoop-aws; submitter must declare test endpoint fs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants