-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support special chars in S3URI #10283
Conversation
* Remove code that interpreted URI query and fragments parts according to RFC 3986. In practice, S3 locations do not encode special chars and, therefore, do not really have query and fragment parts. * Add TestS3FileIOMinio for a small subset of tests using Minio as a realistic S3 protocol implementation for validating the handling of special chars. Fixes apache#10279
We shouldn't remove the param/fragment handling because even though they aren't typically used, they are accepted by s3 and there are cases where they are used. This would change the existing behavior. |
As (not so widely) known, special characters in S3 URIs - i.e. those URIs, for example with (back)quotes and hash/question-marks, are not properly escaped by Iceberg. This makes those locations/URIs violate RFC 2396 - causing trouble even with AWSSDK functionality (see for example |
@jackye1995 : What do you think about special chars in S3 URIs? I'd appreciate your insight. Thanks! |
@dimas-b taking a look at the original issue, it appears the problem is more likely that we're not url encoding the column name in the path like we do with the partition value. For example, if you used a string column for the partition, you would see the following path:
We url encode the partition value to avoid this type of issue and should probably do so for the column name as well. S3 supports a lot of pathing that is problematic (e.g. the double slash issue), but I think we want to avoid support those behaviors and be more restrictive about what we allow. This is a larger ongoing discussion and we may even want to put something into the spec v3 to clarify valid pathing. |
Encoding special chars in partition path elements sounds like a good idea, but I'm not sure it is that simple The problem is not specific to what Iceberg code puts into directory names. The base location of a table may also include How do you propose to deal with special chars in base paths? Also, with Iceberg URL-encoding special chars in the file locations, the interpretation of those locations requires special logic. The cannot be interpreted according to RFC 3986 because the URI parsing rules require decoding those chars, which will subsequently lead to mismatches at the S3 API level (the latter interpreting key parameters verbatim). As I noted in comments, this PR attempts to interpret S3 URI in a manner consistent with what AWS S3 UI produces. For example, if one creates a directory named |
@dimas-b Unfortunately, I agree there will likely be problems if people use special characters in the pathing prefix that's outside of Iceberg's control, but overall, I think we want to discourage (if not prohibit through the spec) usage of unencoded characters. While S3 supports this behavior, it causes a number of compatibility problems. For example, this change would likely be incompatible with S3AFileSystem or any other system that uses the java URI implementation. We have this same type of issue with GCS, but for different reasons. I'm not sure about other implementations like pyarrow, OpenDAL in the rust, or Trino. Overall, I think it's better to fail fast where interoperability is a concern as that's more important than supporting the full s3 key space. I would suggest:
|
I believe we are already in this situation even without this PR, because Iceberg already permits quotes (for example) to be used in column names, but the corresponding S3 locations will not be parseable by |
@dimas-b I just put up #10329 to address the field name encoding. This should also address the quotes issue as well since it will be encoded. This doesn't do anything to address other ways that paths can get introduced (including path leading up to the table location), but that's a larger topic and I still feel we want to discourage (if not disallow) special characters in paths due to cross compatibility issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think it's better to fail fast where interoperability is a concern as that's more important than supporting the full s3 key space.
After seeing a variety of these kinds of issues, I think this is the conclusion I'm at as well. The alternative is that every Iceberg implementation needs to handle all these cases, and I think that's too much complexity to probably be worth it.
I'm in the camp of trying to disallow special characters but we still have not defined a notion of what are considered acceptable paths, and I think that's the part we would want to spec out in v3?
I made a comment under #10329 proposing an escaping method that would not conflict with the URI RFC. I think that should effectively resolve the special chars problem as far as Iceberg code is concerned. On top of that discouraging / disallowing special chars in base locations also sounds good to me. |
See #10329 |
Remove code that interpreted URI query and fragments parts according to RFC 3986. In practice, S3 locations do not encode special chars and, therefore, do not really have query and fragment parts.
Add TestS3FileIOMinio for a small subset of tests using Minio as a realistic S3 protocol implementation for validating the handling of special chars.
Fixes #10279
CC: @danielcweeks