-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-19389: Optimize shell -text command I/O with multi-byte read. #7291
base: trunk
Are you sure you want to change the base?
Conversation
🎊 +1 overall
This message was automatically generated. |
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/shell/Display.java
Show resolved
Hide resolved
...p-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/shell/TestTextCommand.java
Show resolved
Hide resolved
Hey @cnauroth! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no comments on the main code, more on tests.
I'd like this new code to become a contract test, so we can make sure it goes ok through cloud stores and hdfs
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/shell/Display.java
Outdated
Show resolved
Hide resolved
@@ -67,39 +75,194 @@ public void testDisplayForAvroFiles() throws Exception { | |||
assertEquals(expectedOutput, output); | |||
} | |||
|
|||
@Test | |||
public void testEmptyAvroFile() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to suggest something different: add an abstract contract test for this command and implement it in file:, hdfs, s3a and azure -then in gcs once that is in.
That way we can see how well/badly those stores cope with this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unclear about bringing this into a contract test. These tests are covering semantics very specific to hadoop fs -text
. The expectation is that the command can decode Avro and sequence files and print them in JSON/text representation. At the FileSystem
layer, it would just be returning the raw bytes. Existing contract tests cover "input bytes == output bytes". If hdfs/s3a/azure/gcs get that part right, then the hadoop fs -text
behavior falls into place regardless of the FS.
However, maybe something we can do is expand contract coverage of reads in a more generic way. For example, we could cover things like the bad argument checks and "same data returned from read() vs. read(byte[]...)during
AbstractContractOpenTest. Maybe some updates to the
FSDataInputStream` specification docs too.
I'd recommend driving all this in a separate patch though.
...p-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/shell/TestTextCommand.java
Outdated
Show resolved
Hide resolved
@Test | ||
public void testAvroFileInputStreamZeroLengthRead() throws Exception { | ||
createFile(AVRO_FILENAME, generateWeatherAvroBinaryData()); | ||
URI uri = new URI(AVRO_FILENAME); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create a field for this as it is used everywhere...fix in setup along with avro filename, which will now have to be on the target fs of the contract test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like you would like to see something here lifted to a @Before
. The uri
gets initialized to different values though in different tests, so I'm not sure I can make that comm in a @Before
.
...p-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/shell/TestTextCommand.java
Outdated
Show resolved
Hide resolved
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/shell/Display.java
Outdated
Show resolved
Hide resolved
|
||
private static final String SEPARATOR = System.getProperty("line.separator"); | ||
|
||
@Rule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you make this a contract test, there's a rule in the superclass along with some thread naming
30b8c89
to
124509c
Compare
Thanks, @steveloughran ! I entered a question about the contract test feedback. For all other feedback, I've updated the patch. |
🎊 +1 overall
This message was automatically generated. |
124509c
to
46b95de
Compare
I pushed one more update for checkstyle warnings. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
The remaining checkstyle issues getting flagged are "variable must be private and have accessor methods." These are on inner classes, and I think only getting flagged due to me doing some refactoring improvements, so I'm not planning on addressing these. |
46b95de
to
d1f0ecf
Compare
Actually, those last few Checkstyle warnings were pretty easy to address, so I pushed up one more update. |
🎊 +1 overall
This message was automatically generated. |
Description of PR
hadoop fs -text
reads Avro files and sequence files by internally wrapping the stream inAvroFileInputStream
orTextRecordInputStream
. These classes implement the required single-byteread()
, but not the optional multi-byte bufferedread(byte[], int, int)
. The default implementation in the JDK is a loop over single-byte read, which causes sub-optimal I/O and method call overhead. We can optimize this by overriding the multi-byte read method.How was this patch tested?
Multiple new unit tests cover expectations for both single-byte and multi-byte read. I ran these tests both before and after the code change to make sure there is no unexpected behavioral change.
Here is some benchmarking I ran to test the effects of this patch. This shows a ~1.7x throughput improvement for sequence files.
The improvement is more modest for Avro, where there are unrelated overheads from deserializing and reserializing to JSON:
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?