-
Notifications
You must be signed in to change notification settings - Fork 9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-11867. Add a high performance vectored read API to file system. #3499
HADOOP-11867. Add a high performance vectored read API to file system. #3499
Conversation
Quick review.
would be good for QE tests
|
hadoop-common-project/benchmark/src/main/java/org/apache/hadoop/benchmark/AsyncBenchmark.java
Outdated
Show resolved
Hide resolved
hadoop-common-project/benchmark/src/main/java/org/apache/hadoop/benchmark/AsyncBenchmark.java
Outdated
Show resolved
Hide resolved
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/PositionedReadable.java
Outdated
Show resolved
Hide resolved
...p-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/impl/AsyncReaderUtils.java
Outdated
Show resolved
Hide resolved
...-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/impl/CombinedFileRange.java
Show resolved
Hide resolved
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did an initial review and looks promising.
I'm super curious to see what an async read will bring in terms of performance improvements for a job (even if it's a quick scrappy implementation in the FS).
hadoop-common-project/benchmark/src/main/java/org/apache/hadoop/benchmark/AsyncBenchmark.java
Outdated
Show resolved
Hide resolved
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java
Show resolved
Hide resolved
Conflicts: hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/BufferedFSInputStream.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FSDataInputStream.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/RawLocalFileSystem.java pom.xml
1a03999
to
fa3ebc1
Compare
hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdatainputstream.md
Outdated
Show resolved
Hide resolved
I just realized I had some pending comments on one of the updates; maybe out of date now and they were all minor. Let me re-review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's some comments I had outstanding since 2021 .. sorry. will do a bigger review now
...mon-project/hadoop-common/src/test/java/org/apache/hadoop/fs/impl/TestVectoredReadUtils.java
Show resolved
Hide resolved
...mon-project/hadoop-common/src/test/java/org/apache/hadoop/fs/impl/TestVectoredReadUtils.java
Show resolved
Hide resolved
...-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/impl/VectoredReadUtils.java
Show resolved
Hide resolved
@@ -66,7 +74,7 @@ | |||
public static double getApproxChkSumLength(long size) { | |||
return ChecksumFSOutputSummer.CHKSUM_AS_FRACTION * size; | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are all needless and making the patch a bit bigger than it should be. was yetus complaining?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how it is showing up here. In intellJ diff I don't see this.
FileSystem fs = getFileSystem(); | ||
List<FileRange> fileRanges = new ArrayList<>(); | ||
fileRanges.add(new FileRangeImpl(1293, 25837)); | ||
try (FSDataInputStream in = fs.open(path(VECTORED_READ_FILE_1MB_NAME))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use openFile here? just to make sure that codepath is happy
...oop-common/src/test/java/org/apache/hadoop/fs/contract/AbstractContractVectoredReadTest.java
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java
Show resolved
Hide resolved
OK, I've had a look at the recent changes, and had some shared screen reviews over zoom here's what I propose
where those changes can be
then, once happy we can do a rebase followed by merge commit into trunk/branch-3. key point: this patch has been going and is ready to stabilize, which can be done with incremental changes in a feature branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did an initial Review, looks really good.
I think we should add an integration test around the ordered disjoint list of ranges as well to showcase them not merging and reading or having a few ranges where some are merged and some aren't by validating them.
Also, are both minSeek and maxReadSize going to be configurable? I assume they are hardcoded just so you can test them out..
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java
Show resolved
Hide resolved
...oop-common/src/test/java/org/apache/hadoop/fs/contract/AbstractContractVectoredReadTest.java
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java
Show resolved
Hide resolved
hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/test/MoreAsserts.java
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java
Show resolved
Hide resolved
...mmon/src/test/java/org/apache/hadoop/fs/contract/localfs/TestLocalFSConractVectoredRead.java
Show resolved
Hide resolved
Yeah already in my TODO. Thanks.
Also Please refer to the new PR now. #3904 |
now this is in the feature branch, this can be closed |
Rebased work on top of #1830
Conflicts:
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/BufferedFSInputStream.java
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FSDataInputStream.java
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/RawLocalFileSystem.java
pom.xml
Description of PR
Adding support for multiple ranged read async api in PositionedReadable. The default iterates through the ranges to read each synchronously, but the intent is that FSDataInputStream subclasses can make more efficient readers especially object stores implementation.
How was this patch tested?
Added benchmarks.
Added UT's
Added new contract tests for new API spec.
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?