-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) #46511
Conversation
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked only at test cases. They looked good to me. The suggestions are made in order to make the test cases more readable.
Will go over the code separately.
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments on the code.
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Show resolved
Hide resolved
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
Show resolved
Hide resolved
@@ -709,12 +774,24 @@ public void testLocate() throws SparkException { | |||
assertLocate("大千", "test大千世界大千世界", 9, "UNICODE_CI", 9); | |||
assertLocate("大千", "大千世界大千世界", 1, "UNICODE_CI", 1); | |||
// Case-variable character length | |||
assertLocate("\u0307", "i̇", 1, "UTF8_BINARY", 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertLocate("\u0307", "i̇", 1, "UTF8_BINARY", 2); | |
assertLocate("\u0307", "i̇", 1, "UTF8_BINARY_LCASE", 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what the intent of this test case is here. If it is for UTF8_BINARY_LCASE then we have it again below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the intent is: show that "\u0307" is found in "i̇" on the binary level (with respect to UTF8_BINARY)
but "\u0307" is not found in "İ" on the character level (with respect to UTF8_BINARY_LCASE)
@@ -709,12 +774,24 @@ public void testLocate() throws SparkException { | |||
assertLocate("大千", "test大千世界大千世界", 9, "UNICODE_CI", 9); | |||
assertLocate("大千", "大千世界大千世界", 1, "UNICODE_CI", 1); | |||
// Case-variable character length | |||
assertLocate("\u0307", "i̇", 1, "UTF8_BINARY", 2); | |||
assertLocate("\u0307", "İ", 1, "UTF8_BINARY_LCASE", 0); // != UTF8_BINARY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment is wrong here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not wrong, but the intent was a bit different than other cases: instead of comparing differences between UNICODE_CI and UTF8_BINARY_LCASE, I wanted to compare UTF8_BINARY and UTF8_BINARY_LCASE - essentially ensuring that the new UTF8_BINARY_LCASE (character-wise) searching logic does not
equal to applying UTF8_BINARY (byte-wise) searching logic on lowercased strings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for patiently addressing my comments.
The PR looks good to me.
Please fill in the PR description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated PR description & title
@cloud-fan ready for review
I am reading this in the PR description:
This is not really the reason we we are making this change (conditional case mapping is the reason why we want to change the comparator for UTF8_BINARY_LCASE, but not related to string search directly). The real reason is that we need to do string search at the UT8 char-by-char level because otherwise indices returned or considered during string search can return strange and unusable results due to the fact that case mappings are one-to-many, that is one UTF8 character can map to two or more UTF8 characters under the case mapping (both for upper and lower casing). |
@@ -430,7 +430,7 @@ public static int execBinary(final UTF8String string, final UTF8String substring | |||
} | |||
public static int execLowercase(final UTF8String string, final UTF8String substring, | |||
final int start) { | |||
return string.toLowerCase().indexOf(substring.toLowerCase(), start); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to confirm, the previous implementation here is correct, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, unfortunately it's not - while it works fine for ASCII, it actually gives wrong results in some special cases featuring conditional case mapping, when a character has a lowercase equivalent that consists of multiple characters, or is found at a particular place in the string (context-awareness)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so as part of this PR, we actually changed the core definition of string-searching in UTF8_BINARY_LCASE, i.e. what it means for one substring (pattern) to be found in another string (target) under UTF8_BINARY_LCASE
in the old implementation, contains("İ", "i")
would return true
- however, this behaviour is incorrect because it relies on the fact that substr(lower("İ"), 1, 1)
== "i"
(incorrect, old implementation), instead of lower(substr("İ", 1, 1))
!= "i"
(correct, new implementation)
and this is all due to the fact that lower("İ") = "i\u0307"
(1 uppercase character -> 2 lowercase characters)
int endPos) { | ||
assert endPos <= target.numChars(); | ||
for (int len = 0; len <= endPos; ++len) { | ||
if (target.substring(endPos - len, endPos).toLowerCase().equals(lowercasePattern)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way to do this comparison with only one memory copy? Right now it's two: substring
and toLowerCase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the details very well. Is it the same if we lower case the entire target string first, and then call substring
repeatedly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it the same if we lower case the entire target string first
no, it's not - consider the example in the PR description: contains("İ", "i")
where haystack = "İ"
& needle = "i"
; then we have: lower(haystack) = lower("İ") = "i\u0307"
would produce substr(lower("İ"), 1, 1) = "i" = needle
however, we can see that these is no substr(haystack, start, len)
such that lower(substr(haystack, start, len)) == needle
- so these 2 behaviours are not actually equivalent (as we first assumed), and the reason for this discrepancy lies in conditional one-to-many case mapping (certain characters have lowercase equivalents consisted of multiple characters)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that said, we are focusing on correctness for now, and are aware of possible performance regression for string searching in UTF8_BINARY_LCASE - we intend to work on perf optimization in a subsequent PR
thanks, merging to master! |
…llation (Contains, StartsWith, EndsWith, StringLocate) ### What changes were proposed in this pull request? String searching in UTF8_BINARY_LCASE now works on character-level, rather than on byte-level. For example: `contains("İ", "i");` now returns **false**, because there exists no `start, len` such that `lowercase(substring("İ", start, len)) == "i"`. ### Why are the changes needed? Fix functions that give unusable results due to one-to-many case mapping when performing string search under UTF8_BINARY_LCASE (see example above). ### Does this PR introduce _any_ user-facing change? Yes, behaviour of `contains`, `startswith`, `endswith`, and `locate`/`position` expressions is changed for edge cases with one-to-many case mapping. ### How was this patch tested? New unit tests in `CollationSupportSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46511 from uros-db/alter-lcase-impl. Authored-by: Uros Bojanic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? Renaming `UTF8_BINARY_LCASE` collation to `UTF8_LCASE`. ### Why are the changes needed? As part of the collation effort in Spark, we've moved away from byte-by-byte logic towards character-by-character logic, so what we used to call `UTF8_BINARY_LCASE` is now more precisely `UTF8_LCASE`. For example, string searching in UTF8_LCASE now works on character-level (rather than on byte-level), which is reflected in this PRs: #46511, #46589, #46682, #46761, #46762. In addition, string comparison also works on character-level now, as per the changes introduced in this PR: #46700. ### Does this PR introduce _any_ user-facing change? Yes, what was previously named `UTF8_BINARY_LCASE` collation, will from now on be named `UTF8_LCASE`. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46924 from uros-db/rename-lcase. Authored-by: Uros Bojanic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
String searching in UTF8_BINARY_LCASE now works on character-level, rather than on byte-level. For example:
contains("İ", "i");
now returns false, because there exists nostart, len
such thatlowercase(substring("İ", start, len)) == "i"
.Why are the changes needed?
Fix functions that give unusable results due to one-to-many case mapping when performing string search under UTF8_BINARY_LCASE (see example above).
Does this PR introduce any user-facing change?
Yes, behaviour of
contains
,startswith
,endswith
, andlocate
/position
expressions is changed for edge cases with one-to-many case mapping.How was this patch tested?
New unit tests in
CollationSupportSuite
.Was this patch authored or co-authored using generative AI tooling?
No.