-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Url encode field names for partition paths #10329
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -26,6 +26,7 @@ | |
import java.util.Map; | ||
import org.apache.iceberg.io.LocationProvider; | ||
import org.apache.iceberg.relocated.com.google.common.base.Splitter; | ||
import org.apache.iceberg.types.Types; | ||
import org.junit.jupiter.api.TestTemplate; | ||
import org.junit.jupiter.api.extension.ExtendWith; | ||
|
||
|
@@ -285,4 +286,22 @@ public void testObjectStorageWithinTableLocation() { | |
assertThat(parts).element(2).asString().isNotEmpty(); | ||
assertThat(parts).element(3).asString().isEqualTo("test.parquet"); | ||
} | ||
|
||
@TestTemplate | ||
public void testEncodedFieldNameInPartitionPath() { | ||
danielcweeks marked this conversation as resolved.
Show resolved
Hide resolved
|
||
// Update the table to use a string field for partitioning with special characters in the name | ||
table.updateProperties().set(TableProperties.OBJECT_STORE_ENABLED, "true").commit(); | ||
table.updateSchema().addColumn("data#1", Types.StringType.get()).commit(); | ||
table.updateSpec().addField("data#1").commit(); | ||
Comment on lines
+294
to
+295
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could we add a test for some other atypical characters like "$" or "&"? Some character from the "Characters that might require special handling" section in https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. They recommend URL encoding the values, which is what we're doing here. I'm not sure how much value there is in using a range of characters since we are just validating that url encoding is happening. The only one that immediately stands out to me is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe nitpicking, but in my reading of the AWS doc linked above I do not think they are recommending URL encoding. The doc's language is pretty vague. In fact, I suspect URL encoding even partition key values here will cause incorrect processing in tools that use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Example:
Outcome:
Note that Iceberg, by virtue of having its own There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For now, we'll just make this consistent with what we're doing for the values. I agree we should probably up-level this discussion and figure out a path forward that is more usable with other URI parsing utilities. |
||
|
||
// Use a partition value that has a special character | ||
StructLike partitionData = TestHelpers.CustomRow.of(0, "val#1"); | ||
|
||
String fileLocation = | ||
table.locationProvider().newDataLocation(table.spec(), partitionData, "test.parquet"); | ||
List<String> parts = Splitter.on("/").splitToList(fileLocation); | ||
String partitionString = parts.get(parts.size() - 2); | ||
|
||
assertThat(partitionString).isEqualTo("data%231=val%231"); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix looks reasonable to me as far as #10279 is concerned.
However, this will cause partition paths to change in existing tables that have special characters that previously did not cause correctness issues, for example quotes (
"
). So new files under the same partition key will not share a common prefix with old files under that same partition key... Just want to make sure this is intentional.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that's unavoidable. However, the physical path is just for practical/visual validation and isn't necessary for the correct operation of the table. I would also suspect that the number of tables where there's a field name and partitioning that would be affected is rather small.
I looked for cases that would be problematic for existing paths, but usage of this is rather isolated to the location providers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the value of file prefixes is not critical, why not use non-URI encoding and avoid the URI interoperability problems altogether (note my other comment). For example, the
escape()
method could produce values using only non-special chars according to the the URI RFC (that is avoid even using%
for escaping). As far as I understand the escaping method does not even have to be reversible.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dimas-b In general, I agree with you. It would be good to not include any characters that are problematic.
To give a little context around how we ended up here, this layout is largely to provide similarity with Hive-style pathing so people transitioning over felt somewhat comfortable with the layout. Hive encodes some characters, but in a way that would also cause problems mentioned previously. I feel like most systems at this point are wary of any URI parsing because of this legacy behavior and cloud storage has made it more complicated. Tables that are migrated will likely have these issues, so we're largely stuck with supporting them for a while.
I'm not convinced that we want to move away from a reversible encoding because there's an existing expectation that you could recover by inferring the partitioning from the path structure.
The direction I would like to head overall is to provide the option to omit the path structure entirely. In Iceberg, the physical layout is entirely decoupled from the logical partitioning, so there's really no need for it. The additional pathing has a number of downsides in addition to the character encoding (long keys, type erasure, etc.).
For now, I think the simple path forward is to encode the field names like always has been done with values and then either introduce better encodings or remove them entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I comment above, this PR does fix #10279, which is welcome.
I'm looking forward to further improvements in Iceberg to improve interoperability with location strings stored in its metadata files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this PR make the recovery situation worse? For example, in an old path, one could have
s3://bucket/path/id%20=1/...
(no encoding), in the new path it will bes3://bucket/path/id%2520=1/...
... How does one figure out what to decode and when?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess one approach to deal with such a case would be to decode the path string first before applying the encoding in order to not cause double-encoding:
But this also isn't an ideal solution and so I agree that long term we might want to consider better encoding or remove them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
%20
could be part of the column name (or value) as in my example in #10279, so decoding the original string may not be applicable to all cases.... but I did not test that particular name in practice :)