-
Notifications
You must be signed in to change notification settings - Fork 9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-18543. AliyunOSSFileSystem#open(Path path, int bufferSize) use buffer size as its downloadPartSize #5172
base: trunk
Are you sure you want to change the base?
Conversation
🎊 +1 overall
This message was automatically generated. |
this.store = store; | ||
this.key = key; | ||
this.statistics = statistics; | ||
this.contentLength = contentLength; | ||
downloadPartSize = conf.getLong(MULTIPART_DOWNLOAD_SIZE_KEY, | ||
MULTIPART_DOWNLOAD_SIZE_DEFAULT); | ||
this.downloadPartSize = downloadPartSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might want to have a minimum size for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'd like use IO_FILE_BUFFER_SIZE_DEFAULT(4KB) as its min size, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i owrry that the downloadpart size of kb is way too small for efficient http GET requests; kilobytes to megabytes are better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could see different performance between uploading/requesting 4KB and 4MB ?
In my some cases, some data are orgnazied with unit of ~16KB, and I will read them randomly.
In this case, I am sure what I need is just these KB, more data will cost more time and bandwidth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its about the efficiencies of a GET call, overhead of creating HTTPS connections etc. which comes down to
- reading a whole file is the wrong strategy for random IO formats (ORC, parquet)
- random IO/small ranged GETs wrong for whole files.
- even with random io, KB is way too small.
This is why there's a tendency for the stores to do adaptive "first backwards/big forward seek means random IO", and since HADOOP-16202 let caller declared read policy.
If the existing code was changed to say "we set GET range to be the buffer size you passed in on open()", then everyone's existing code is going to suffer really badly on performance.
… buffer size as its downloadPartSize
7580a7c
to
76ce3cb
Compare
🎊 +1 overall
This message was automatically generated. |
sorry, but I'm going to say -1 to using the normal IO buffer size as the GET range. The default value of 4k is way too small even for parquet/orc reads, it will break all existing apps in performance terms: distcp, parquet library, avro, ORC, everything, as they all use the default value.
|
(note also that includes letting you declare read policy (whole-file, sequential, random, vectored....that can be used to change default block size too) |
Thanks a lot for your detailed explaination and suggestion. |
This is exactly what the API was designed for -to let people provide extra hints/options. to the object stores. Do read the filesystem.md and related docs on the topic, and look at the abfs implementation as well as the s3a one. If your implementation takes a FileStatus or length option, it can then skip the HEAD request on opening and save time and money. All the hadoop internal uses of openFile() do this. My own copies of the parquet and avro readers also do it for better cloud reading performance |
Description of PR
In our application, different components have their own suitable buffer size to download.
But currently, AliyunOSSFileSystem#open(Path path, int bufferSize) just get downloadPartSize from configuration.
We cannnot use different value for different components in our programs.
I think we should the method should use the buffer size from the paramater.
AliyunOSSFileSystem#open(Path path) could have default value as current default downloadPartSize.
How was this patch tested?
Added method TestAliyunOSSInputStream#testConfiguration to test.
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?