-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS: Update ObjectStorageLocationProvider hash to optimize for S3 performance #11112
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments but a had a possibly naive question to just check my understanding:
In the past for object storage provider, we've used a wider character set in the hash portion of the file as a means to maximize entropy and ultimately improve heat distribution (#7128). With this new approach are we saying we can get a good enough heat distribution and at the same time enable s3 to scale capacity more quickly?
aws/src/main/java/org/apache/iceberg/aws/s3/S3LocationProvider.java
Outdated
Show resolved
Hide resolved
|
||
private String computeHash(String fileName) { | ||
HashCode hashCode = HASH_FUNC.hashString(fileName, StandardCharsets.UTF_8); | ||
int hash = hashCode.asInt(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think we could just inline hashCode.asInt() below when computing the binaryString
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works, updating
@amogh-jahagirdar |
aws/src/main/java/org/apache/iceberg/aws/s3/S3LocationProvider.java
Outdated
Show resolved
Hide resolved
* s3://my-bucket/my-table/data/011101101010001111101000-00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet | ||
* </code>. | ||
*/ | ||
public class S3LocationProvider implements LocationProvider { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just include this in the ObjectStoreLocationProvider as an option to choose between base2 and base32? It seems like we could include most of this in just the computeHash
function. That might also allow for choosing whether to include partition context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was our original plan but with the partition values and other trailing directories removed, it turns into a very S3 specific location provider due to reasons mentioned in the above comment. I agree that if it was just the base2 optimization then it would fit into the ObjectStoreLocationProvider with one extra table property probably.
@ookumuso Overall, this looks like a great feature if this is better for S3 to repartition and distribute data, but it also seems like it would fit cleanly into the existing ObjectStoreLocationProvider as opposed to a separate provider. |
* </ol> | ||
* | ||
* The data file is placed immediately under the data location. Partition names are <b>not</b> | ||
* included. The data filename is prefixed with a 24-character binary hash, which ensures that files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 24-character
prefix and lack of a directory marker (/
) is a little problematic for older clients (like HadoopFS) and procedures like orphan files because anything that needs to list that directory will run into problems with the key space. We already have an issue with this with the current provider, but this would compound that problem since there's no hierarch.
This would make it a little easier to brake up prefix-separated ranges for listing (especially due to the low character limit), but that's something would need to look into on the orphan files action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the problem coming from the delimited list or is it some type parallelization issue due to lack of directories? I was thinking not having the directory essentially remove the requirement for discovering them and it can be just a flat list to discover everything under the write.data.path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is having too much cardinality for FS implementations that think in terms of directory structure. By breaking this up we can address both scenarios (files system based and object store listing) at the same time.
Thanks for the review @danielcweeks! See my comment here regarding why we chose to go with a separate a provider. It was essentially done to capture optimizations for all bucket types in one place without having an impact on the existing default ObjectStoreLocationProvider. |
@danielcweeks let me know what you think about this. One alternative is to maybe provide both as in having S3LocationProvider as is and a base2 option for the ObjectStoreLocationProvider to keep the option for partition values along with directories |
@ookumuso I'm still of the opinion that we should just incorporate this into the existing object store location provider. Ultimately, we can change the hashing behavior as the existing behavior is based on earlier discussions with S3, so replacing it or providing alternative options is fine. I do think partition values in the path still needs to be an option. I think we also need to consider adding a path separator to help with some of the maintenance routines. For example |
@danielcweeks Understood thanks for the feedback Daniel. Looks like it is not going to be feasible for us to remove For partition values, I can probably send a separate follow up as an option to remove them from the file name with a new table property so callers can decide but planning to exclude that for now. |
Sorry for the late review, was busy with some internal work...
+1 for that. I think the main concern was that this seems to be too S3 specific, but if there is no major concern from others I am good with changing the ObjectStoreLocationProvider directly, that actually helps adoption.
I reviewed the internal experiments related to this, and having no path separators as well as having at least 24 bits I believe are all important based on load testing. If the concern is compatibility with other systems, would it make sense to only use this approach for locations that begins with |
827dbe2
to
0217f82
Compare
@danielcweeks @amogh-jahagirdar @jackye1995 As discussed above, made the change to update the existing ObjectStoreLocationProvider to use base2 entropy and also added new config to omit the partition values. I set it to false as default to make it backwards compatible but let me know whether you prefer that to be true to prevent customers to rely on partition values in the file path. I kept the "/" delimiter in place due to concerns raised by @danielcweeks but would like to understand what would it take to remove that as a separate thread to potentially follow up in a separate PR. Orphan clean up path was one thing identified but I am curious whether there are any other things we need to consider to make it a viable change |
0217f82
to
fc66d1f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me. I think there are 2 pending topics and I am approving since I am good with the current way it is written:
- should we do it for all paths, or just limit to
s3://
ands3a://
paths: I think given that even the documentation of this is in AWS, it is probably fine to just apply this strategy for all users using object storage location provider. - should we add / in the middle of the binary hash values: that would mitigate the directory listing issue, but will impact the ability for scaling up S3 due to the additional slashes in the middle. My understanding is that the directory lisitng concern could be resolved if we move all HDFS-based listing to use prefix-based listing, and the team can take that as a follow up item.
Any further thoughts with the latest approach and the 2 points above? @danielcweeks @amogh-jahagirdar @nastra
Any additional comments? @amogh-jahagirdar @nastra @danielcweeks |
I think we should treat all of the prefixes the same (e.g.
I think we should add a single That slash is important for directory based file system implementations as omitting it would result in listing effectively the full contents of the table and would likely break clients because there are too many results to hold in memory. |
when you say "directory based file system implementations", do you mean implementation of catalog, or FileIO? And what is the use case of this? I thought the only place we are doing something like this is in orphan file removal, but for that we are just iterating through the directories to find matching files, we are holding all the matching files in memory, not holding everything in memory. |
This is for FileIO, not catalog related.
This is related to orphan file removal. The hashing is all performed at the top-level of the |
The hard-coded default is 3. Usually, there is a /data/ folder which takes 1 level of depth, but that might not always be the case. So if we stick to that default, then we should ensure 3 levels of depth.
When We should potentially introduce a separator concept |
What do you think about this approach @danielcweeks:
It might be a nice win-win case both solving sparse directory problem and orphan clean up. As @jackye1995 it is a bit weird for partitioned-paths but we can do it for not partitioned paths only, so it would like:
|
I like removing additional path when I'm just wondering about the where we want to put the slashes in the bit field. Breaking it up by three bits means that each recursive listing just operates on 8 subpaths, which feels too small. I feel like we might want to move to four bits (i.e. |
Sounds good @danielcweeks, I will work on this change and update the PR! |
7edc637
to
c106c45
Compare
@danielcweeks @jackye1995 Updated the change to divide entropy into dirs so we follow the following format now:
Set the dir depth to 3 and dir length to 4 as discussed. |
c106c45
to
977ee08
Compare
9ee23cb
to
11e4a8f
Compare
@danielcweeks any further comments? |
@ookumuso a couple small remaining comments:
|
11e4a8f
to
fef36c4
Compare
@danielcweeks thanks:
|
Any additional concerns @danielcweeks ? |
fef36c4
to
974ae2f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just waiting on checks. Thanks @ookumuso !
…ized S3 performance Co-authored-by: Drew Schleit <[email protected]>
974ae2f
to
20b4f08
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also looks good to me! Thanks for the work!
looks like CI has passed, merging. Thanks for the work and patience @ookumuso , and thanks for the review @danielcweeks ! |
…ized S3 performance (apache#11112) Co-authored-by: Drew Schleit <[email protected]>
The S3 team is introducing S3LocationProvider, a replacement to ObjectStorageLocationProvider that is better suited for the performance of Iceberg workloads running on Amazon S3. This configuration applies to Iceberg workloads using General Purpose and Directory Buckets in Amazon S3, and we expect it to further improve throughput of Iceberg workloads. Although this change can benefit all Iceberg workloads, the degree of improvement may vary based on specific workload characteristics.
This implementation changes the hash scheme from 32-bit base64 to 20-bit base2. The reduced character range allows S3 to automatically scale request capacity more quickly to match the demands of the target workload and reduce the amount of time that workloads observe throttle responses. A 20-bit hash allows for 2^20 possible prefixes.
This implementation also changes the path structure to reduce the number of directories created. The partition directories are eliminated, and the position of the hash is moved to before the data filename.