HADOOP-11867: Add gather API to file system. #1830

omalley · 2020-02-03T23:54:54Z

Add API to PositionedReadable to have an asynchronous gather API.

steveloughran

Sweet.

What are the benchmark numbers?
it's going to need some docs in inputstream.md
It'd be good to PoC an object store: s3a, ozone, abfs...
And we will need to think about some contract tests to test/break the implementations

I presume this design will suit ORC?

hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java

steveloughran · 2020-02-04T12:49:19Z

hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java

+    }
+
+    /**
+     * Find the checksum ranges that correspond to the given data ranges.


be nice to explain why this is needed, for those of us who don't normally go near this file

Why what is needed? You mean the code to compare the checksums? The current code requires a lot of context that isn't true in the new API. The current code is super inefficient because it did a bad job of working around those limitations. In particular, if you look at the current pread code, it reopens the crc file for each seek.

steveloughran · 2020-02-04T12:52:02Z

hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/PositionedReadable.java

+   * @return the minimum number of bytes
+   */
+  default int minimumReasonableSeek() {
+    return 4 * 1024;


should really be constants somewhere, even if within this interface

Ok, although the difference between having the constant in the method versus defined and used once is pretty minor.
Those constants should be determined by testing on each of the different file systems. I suspect the minimum seek on local fs < hdfs < s3.

hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/PositionedReadable.java

...p-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/impl/AsyncReaderUtils.java

...mmon-project/hadoop-common/src/test/java/org/apache/hadoop/fs/impl/TestAsyncReaderUtils.java

hadoop-common-project/benchmark/pom.xml

omalley · 2020-02-04T22:07:31Z

The benchmark numbers are posted on the jira.

You'll need to help with the spec that you've developed in fsdatainputstream.md. Fundamentally, the new call is logically the same the input ranges being read using pread in an undefined order.
When the CompletableFuture returned from range.getData() is done, the data must be in the buffer.

And yes, I believe this structure will work well for ORC (and likely Parquet).

steveloughran · 2020-07-23T13:42:03Z

@omalley -you still working on this?

mukund-thakur · 2020-09-21T08:04:39Z

...p-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/impl/AsyncReaderUtils.java

+    int requestLength = request.getLength();
+    // If we need to change the offset or length, make a copy and do it
+    if (offsetChange != 0 || readData.remaining() != requestLength) {
+      readData = readData.slice();


new name should be readDataCopy or anything better just to be sure that we are not changing the original buffer.

But it isn't copying the data. It is much closer to ByteBuffer's slice, which gives a second view on to the same data buffer. So you get a new ByteBuffer object that shares the same underlying memory.

steveloughran · 2020-09-21T10:21:45Z

hey @omalley -thanks for the update. Could you do anything with the fields in AsyncBenchmark, as they are flooding yetus

Unused field:AsyncBenchmark_BufferChoice_jmhType_B3.java

hadoop-common-project/benchmark/src/main/java/org/apache/hadoop/benchmark/AsyncBenchmark.java

omalley · 2020-09-21T15:51:55Z

Yeah, I just added a suppression file for findbugs that hopefully will make Yetus happy. Sigh findbugs and generated code are not a good combination.

hadoop-yetus · 2020-09-23T09:15:18Z

💔 -1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	0m 24s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 1s	No case conflicting files found.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s	The patch appears to include 1 new or modified test files.
		_ trunk Compile Tests _
+0 🆗	mvndep	3m 36s	Maven dependency ordering for branch
+1 💚	mvninstall	28m 40s	trunk passed
+1 💚	compile	20m 36s	trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1
+1 💚	compile	17m 19s	trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01
+1 💚	checkstyle	2m 48s	trunk passed
+1 💚	mvnsite	21m 4s	trunk passed
+1 💚	shadedclient	39m 22s	branch has no errors when building and testing our client artifacts.
+1 💚	javadoc	6m 26s	trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1
+1 💚	javadoc	6m 57s	trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01
+0 🆗	spotbugs	2m 5s	Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚	findbugs	37m 57s	trunk passed
		_ Patch Compile Tests _
+0 🆗	mvndep	0m 33s	Maven dependency ordering for patch
+1 💚	mvninstall	24m 8s	the patch passed
+1 💚	compile	20m 3s	the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1
+1 💚	javac	20m 3s	the patch passed
+1 💚	compile	17m 24s	the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01
+1 💚	javac	17m 24s	the patch passed
-0 ⚠️	checkstyle	2m 50s	root: The patch generated 28 new + 90 unchanged - 4 fixed = 118 total (was 94)
+1 💚	mvnsite	17m 55s	the patch passed
+1 💚	whitespace	0m 0s	The patch has no whitespace issues.
+1 💚	xml	0m 7s	The patch has no ill-formed XML file.
+1 💚	shadedclient	15m 24s	patch has no errors when building and testing our client artifacts.
+1 💚	javadoc	6m 29s	the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1
+1 💚	javadoc	6m 54s	the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01
+1 💚	findbugs	39m 26s	the patch passed
		_ Other Tests _
-1 ❌	unit	577m 2s	root in the patch passed.
-1 ❌	asflicense	1m 50s	The patch generated 1 ASF License warnings.
		895m 41s

Reason	Tests
Failed junit tests	hadoop.yarn.applications.distributedshell.TestDistributedShell
	hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer
	hadoop.yarn.server.resourcemanager.TestRMRestart
	hadoop.yarn.sls.TestReservationSystemInvariants
	hadoop.hdfs.TestFileChecksum
	hadoop.hdfs.TestSnapshotCommands
	hadoop.hdfs.TestFileChecksumCompositeCrc
	hadoop.hdfs.TestDFSClientRetries
	hadoop.hdfs.TestStripedFileAppend
	hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier
	hadoop.crypto.key.kms.server.TestKMS

Subsystem	Report/Notes
Docker	ClientAPI=1.40 ServerAPI=1.40 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-1830/12/artifact/out/Dockerfile
GITHUB PR	#1830
JIRA Issue	HADOOP-11867
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient xml findbugs checkstyle
uname	Linux a64523bb9bef 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `7fae413`
Default Java	Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01
checkstyle	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-1830/12/artifact/out/diff-checkstyle-root.txt
unit	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-1830/12/artifact/out/patch-unit-root.txt
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-1830/12/testReport/
asflicense	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-1830/12/artifact/out/patch-asflicense-problems.txt
Max. process+thread count	3731 (vs. ulimit of 5500)
modules	C: hadoop-common-project/hadoop-common hadoop-common-project . hadoop-common-project/benchmark U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-1830/12/console
versions	git=2.17.1 maven=3.6.0 findbugs=4.0.6
Powered by	Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

mukund-thakur · 2020-09-24T09:14:13Z

I am trying to compile and run the benchmark added.

I am using this command
java -cp target/hadoop-benchmark-3.4.0-SNAPSHOT-uber.jar org.apache.hadoop.benchmark.AsyncBenchmark /tmp/benchmark
and seeing this failure
java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer; at org.apache.hadoop.benchmark.AsyncBenchmark$FileRangeCallback.completed(AsyncBenchmark.java:185) at org.apache.hadoop.benchmark.AsyncBenchmark$FileRangeCallback.completed(AsyncBenchmark.java:161

Also when I try to run the same using IDE while selecting the JRE to be Bundled, it works fine.

Anything specific I have to do before running the benchmark.

FYI related : https://stackoverflow.com/questions/61267495/exception-in-thread-main-java-lang-nosuchmethoderror-java-nio-bytebuffer-flip

Thanks

steveloughran · 2020-09-24T09:31:14Z

@mukund-thakur : build and test with the same JDK; java 9+ added some overloaded methods to bytebuyffer. If code has been built against a newer JVM than the one you test against, you will get link problems.

Warning: some openjdk8 builds (Amazon Corretto) have the overloaded methods, so cannot be used to build things you intend to run elsewhere.

Recommend you set up JAVA_HOME to point to the java version you want, run maven builds on the command line

mukund-thakur · 2020-09-24T14:17:01Z

@mukund-thakur : build and test with the same JDK; java 9+ added some overloaded methods to bytebuyffer. If code has been built against a newer JVM than the one you test against, you will get link problems.

Warning: some openjdk8 builds (Amazon Corretto) have the overloaded methods, so cannot be used to build things you intend to run elsewhere.

Recommend you set up JAVA_HOME to point to the java version you want, run maven builds on the command line

Thank @steveloughran . It works after setting java home explicitly to 1.8.

mukund-thakur · 2020-09-24T15:21:30Z

I have one question. Why merging of ranges is not done for RawLocalFileSystem but done for ChecksumFileSystem?

mukund-thakur · 2020-09-28T09:26:47Z

hadoop-common-project/benchmark/src/main/java/org/apache/hadoop/benchmark/AsyncBenchmark.java

+    }
+    stream.readAsync(ranges, bufferChoice.allocate);
+    for(FileRange range: ranges) {
+      blackhole.consume(range.getData().get());


blackhole.consume(ranges); can be used ?

steveloughran requested changes Feb 4, 2020

View reviewed changes

omalley force-pushed the async-io branch from 9504062 to 684a482 Compare February 4, 2020 21:53

apache deleted a comment from hadoop-yetus Feb 5, 2020

bgaborg self-requested a review February 12, 2020 12:58

omalley force-pushed the async-io branch 3 times, most recently from 7b5a300 to c988142 Compare September 19, 2020 00:52

apache deleted a comment from hadoop-yetus Sep 19, 2020

mukund-thakur reviewed Sep 21, 2020

View reviewed changes

apache deleted a comment from hadoop-yetus Sep 21, 2020

steveloughran reviewed Sep 21, 2020

View reviewed changes

hadoop-common-project/benchmark/src/main/java/org/apache/hadoop/benchmark/AsyncBenchmark.java Show resolved Hide resolved

HADOOP-11867: Add gather API to file system.

ebe5d13

omalley force-pushed the async-io branch from 8179759 to ebe5d13 Compare September 22, 2020 03:26

steveloughran added enhancement fs fs/s3 changes related to hadoop-aws; submitter must declare test endpoint labels Sep 22, 2020

fixing checkstyle issues

5b6de5c

omalley force-pushed the async-io branch from 322b588 to 5b6de5c Compare September 22, 2020 18:26

apache deleted a comment from hadoop-yetus Sep 24, 2020

mukund-thakur reviewed Sep 28, 2020

View reviewed changes

mukund-thakur mentioned this pull request Sep 29, 2021

HADOOP-11867. Add a high performance vectored read API to file system. #3499

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HADOOP-11867: Add gather API to file system. #1830

HADOOP-11867: Add gather API to file system. #1830

omalley commented Feb 3, 2020

steveloughran left a comment

steveloughran Feb 4, 2020

omalley Feb 4, 2020

steveloughran Feb 4, 2020

omalley Feb 4, 2020

omalley commented Feb 4, 2020

steveloughran commented Jul 23, 2020

mukund-thakur Sep 21, 2020

omalley Sep 21, 2020

steveloughran commented Sep 21, 2020

omalley commented Sep 21, 2020

hadoop-yetus commented Sep 23, 2020

mukund-thakur commented Sep 24, 2020

steveloughran commented Sep 24, 2020

mukund-thakur commented Sep 24, 2020

mukund-thakur commented Sep 24, 2020

mukund-thakur Sep 28, 2020

HADOOP-11867: Add gather API to file system. #1830

Are you sure you want to change the base?

HADOOP-11867: Add gather API to file system. #1830

Conversation

omalley commented Feb 3, 2020

steveloughran left a comment

Choose a reason for hiding this comment

steveloughran Feb 4, 2020

Choose a reason for hiding this comment

omalley Feb 4, 2020

Choose a reason for hiding this comment

steveloughran Feb 4, 2020

Choose a reason for hiding this comment

omalley Feb 4, 2020

Choose a reason for hiding this comment

omalley commented Feb 4, 2020

steveloughran commented Jul 23, 2020

mukund-thakur Sep 21, 2020

Choose a reason for hiding this comment

omalley Sep 21, 2020

Choose a reason for hiding this comment

steveloughran commented Sep 21, 2020

omalley commented Sep 21, 2020

hadoop-yetus commented Sep 23, 2020

mukund-thakur commented Sep 24, 2020

steveloughran commented Sep 24, 2020

mukund-thakur commented Sep 24, 2020

mukund-thakur commented Sep 24, 2020

mukund-thakur Sep 28, 2020

Choose a reason for hiding this comment