New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) #46511

Closed

uros-db wants to merge 21 commits into apache:master from uros-db:alter-lcase-impl

Contributor

uros-db commented May 9, 2024 •

edited

Loading

What changes were proposed in this pull request?

String searching in UTF8_BINARY_LCASE now works on character-level, rather than on byte-level. For example: contains("İ", "i"); now returns false, because there exists no start, len such that lowercase(substring("İ", start, len)) == "i".

Why are the changes needed?

Fix functions that give unusable results due to one-to-many case mapping when performing string search under UTF8_BINARY_LCASE (see example above).

Does this PR introduce any user-facing change?

Yes, behaviour of contains, startswith, endswith, and locate/position expressions is changed for edge cases with one-to-many case mapping.

How was this patch tested?

New unit tests in CollationSupportSuite.

Was this patch authored or co-authored using generative AI tooling?

No.


          Fix string search in contains/startsWith/endsWith

4981a14

github-actions bot added the SQL label

uros-db added 4 commits

May 9, 2024 17:55


          Add tests

b5794fc


          Fix locate/position

dd261bf


          Merge branch 'master' into alter-lcase-impl

31b757b


          Fix CollationAwareUTF8String.java

35dd21a

mkaravel reviewed

View reviewed changes

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java Outdated Show resolved Hide resolved

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java Outdated Show resolved Hide resolved

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java Outdated Show resolved Hide resolved

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java Outdated Show resolved Hide resolved

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java Outdated Show resolved Hide resolved

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java Outdated Show resolved Hide resolved

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java Outdated Show resolved Hide resolved

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java Outdated Show resolved Hide resolved

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java Outdated Show resolved Hide resolved

common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java Show resolved Hide resolved


          Fix UTF8_BINARY_LCASE string search

0baa713

uros-db requested a review from mkaravel

May 10, 2024 13:15

mkaravel reviewed

View reviewed changes

Contributor

mkaravel left a comment

Looked only at test cases. They looked good to me. The suggestions are made in order to make the test cases more readable.
Will go over the code separately.

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java Outdated Show resolved Hide resolved

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java Outdated Show resolved Hide resolved

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java Outdated Show resolved Hide resolved

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java Outdated Show resolved Hide resolved

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java Outdated Show resolved Hide resolved

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java Outdated Show resolved Hide resolved

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java Outdated Show resolved Hide resolved

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java Outdated Show resolved Hide resolved

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java Outdated Show resolved Hide resolved

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java Outdated Show resolved Hide resolved

mkaravel reviewed

View reviewed changes

Contributor

mkaravel left a comment

A few more comments on the code.

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java Outdated Show resolved Hide resolved

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java Outdated Show resolved Hide resolved

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java Outdated Show resolved Hide resolved

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java Outdated Show resolved Hide resolved

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java Outdated Show resolved Hide resolved

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java Outdated Show resolved Hide resolved

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java Show resolved Hide resolved

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java Outdated Show resolved Hide resolved

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java Outdated Show resolved Hide resolved

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java Outdated Show resolved Hide resolved

uros-db added 10 commits

May 14, 2024 11:17


          Fix tests

37e44ad


          Minor code fixes

6f0ceb5


          Add doc comments

aa91b15


          Add back some tests

f72998c


          Add support for StringInStr & SubstringIndex

86bf00b


          Unicode java linter checks

77904fa


          Java style fix

5b47499


          Undo some changes

aceeba8


          Style fixes


          Update comments

30774a3

uros-db changed the title ~~[WIP][SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation~~ [WIP][SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate)


          Merge branch 'apache:master' into alter-lcase-impl

uros-db requested a review from mkaravel

May 15, 2024 19:02

mkaravel reviewed

View reviewed changes

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java Outdated Show resolved Hide resolved

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java Outdated Show resolved Hide resolved

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java Outdated Show resolved Hide resolved

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java Show resolved Hide resolved

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java Outdated Show resolved Hide resolved

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java Show resolved Hide resolved

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java Show resolved Hide resolved

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java Show resolved Hide resolved

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java

@@ @@ -709,12 +774,24 @@ public void testLocate() throws SparkException { @@
                   assertLocate("大千", "test大千世界大千世界", 9, "UNICODE_CI", 9);
                   assertLocate("大千", "大千世界大千世界", 1, "UNICODE_CI", 1);
                   // Case-variable character length
+                  assertLocate("\u0307", "i̇", 1, "UTF8_BINARY", 2);

Contributor

mkaravel May 16, 2024

Suggested change

      
                assertLocate("\u0307", "i̇", 1, "UTF8_BINARY", 2);
          
                assertLocate("\u0307", "i̇", 1, "UTF8_BINARY_LCASE", 2);

Contributor

mkaravel May 16, 2024

I am not sure what the intent of this test case is here. If it is for UTF8_BINARY_LCASE then we have it again below.

Contributor Author

uros-db May 17, 2024

the intent is: show that "\u0307" is found in "i̇" on the binary level (with respect to UTF8_BINARY)

but "\u0307" is not found in "İ" on the character level (with respect to UTF8_BINARY_LCASE)

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java

@@ @@ -709,12 +774,24 @@ public void testLocate() throws SparkException { @@
                   assertLocate("大千", "test大千世界大千世界", 9, "UNICODE_CI", 9);
                   assertLocate("大千", "大千世界大千世界", 1, "UNICODE_CI", 1);
                   // Case-variable character length
+                  assertLocate("\u0307", "i̇", 1, "UTF8_BINARY", 2);
+                  assertLocate("\u0307", "İ", 1, "UTF8_BINARY_LCASE", 0); // != UTF8_BINARY

Contributor

mkaravel May 16, 2024

I think the comment is wrong here.

Contributor Author

uros-db May 17, 2024

not wrong, but the intent was a bit different than other cases: instead of comparing differences between UNICODE_CI and UTF8_BINARY_LCASE, I wanted to compare UTF8_BINARY and UTF8_BINARY_LCASE - essentially ensuring that the new UTF8_BINARY_LCASE (character-wise) searching logic does not equal to applying UTF8_BINARY (byte-wise) searching logic on lowercased strings

mkaravel mentioned this pull request

[SPARK-48281][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringInStr, SubstringIndex) #46589

Closed


          Small fixes

68bada3

uros-db requested a review from mkaravel

May 17, 2024 07:46


          Small fixes

8f59422

mkaravel approved these changes

View reviewed changes

Contributor

mkaravel left a comment

Thank you for patiently addressing my comments.
The PR looks good to me.

Contributor

mkaravel commented May 20, 2024

Please fill in the PR description.

uros-db changed the title ~~[WIP][SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate)~~ [SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate)

uros-db commented

View reviewed changes

Contributor Author

uros-db left a comment

updated PR description & title
@cloud-fan ready for review

Contributor

mkaravel commented May 22, 2024

I am reading this in the PR description:

Why are the changes needed?
Fix functions that give unexpected results due to the conditional case mapping when performing string comparisons under UTF8_BINARY_LCASE.

This is not really the reason we we are making this change (conditional case mapping is the reason why we want to change the comparator for UTF8_BINARY_LCASE, but not related to string search directly).

The real reason is that we need to do string search at the UT8 char-by-char level because otherwise indices returned or considered during string search can return strange and unusable results due to the fact that case mappings are one-to-many, that is one UTF8 character can map to two or more UTF8 characters under the case mapping (both for upper and lower casing).

cloud-fan reviewed

View reviewed changes

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java

                   }
                   public static int execLowercase(final UTF8String string, final UTF8String substring,
                       final int start) {
-                    return string.toLowerCase().indexOf(substring.toLowerCase(), start);

Contributor

cloud-fan May 22, 2024

to confirm, the previous implementation here is correct, right?

Contributor Author

uros-db May 23, 2024

no, unfortunately it's not - while it works fine for ASCII, it actually gives wrong results in some special cases featuring conditional case mapping, when a character has a lowercase equivalent that consists of multiple characters, or is found at a particular place in the string (context-awareness)

Contributor Author

uros-db May 23, 2024 •

edited

Loading

so as part of this PR, we actually changed the core definition of string-searching in UTF8_BINARY_LCASE, i.e. what it means for one substring (pattern) to be found in another string (target) under UTF8_BINARY_LCASE

in the old implementation, contains("İ", "i") would return true - however, this behaviour is incorrect because it relies on the fact that substr(lower("İ"), 1, 1) == "i" (incorrect, old implementation), instead of lower(substr("İ", 1, 1)) != "i" (correct, new implementation)

and this is all due to the fact that lower("İ") = "i\u0307" (1 uppercase character -> 2 lowercase characters)

cloud-fan reviewed

View reviewed changes

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java

+                    int endPos) {
+                  assert endPos <= target.numChars();
+                  for (int len = 0; len <= endPos; ++len) {
+                    if (target.substring(endPos - len, endPos).toLowerCase().equals(lowercasePattern)) {

Contributor

cloud-fan May 22, 2024

is there a way to do this comparison with only one memory copy? Right now it's two: substring and toLowerCase

Contributor

cloud-fan May 22, 2024

I don't know the details very well. Is it the same if we lower case the entire target string first, and then call substring repeatedly?

Contributor Author

uros-db May 22, 2024 •

edited

Loading

Is it the same if we lower case the entire target string first

no, it's not - consider the example in the PR description: contains("İ", "i") where haystack = "İ" & needle = "i"; then we have: lower(haystack) = lower("İ") = "i\u0307" would produce substr(lower("İ"), 1, 1) = "i" = needle

however, we can see that these is no substr(haystack, start, len) such that lower(substr(haystack, start, len)) == needle - so these 2 behaviours are not actually equivalent (as we first assumed), and the reason for this discrepancy lies in conditional one-to-many case mapping (certain characters have lowercase equivalents consisted of multiple characters)

Contributor Author

uros-db May 22, 2024

that said, we are focusing on correctness for now, and are aware of possible performance regression for string searching in UTF8_BINARY_LCASE - we intend to work on perf optimization in a subsequent PR

uros-db mentioned this pull request

[SPARK-48282][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringReplace, FindInSet) #46682

Closed

uros-db added 2 commits

May 24, 2024 15:11


          Update new method access

4b9b83e


          Update doc comment

eb602e8

dbatomic approved these changes

View reviewed changes

cloud-fan approved these changes

View reviewed changes

Contributor

cloud-fan commented May 28, 2024

thanks, merging to master!

cloud-fan closed this in

riyaverm-db pushed a commit to riyaverm-db/spark that referenced this pull request


          [SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE co…

796975d

…llation (Contains, StartsWith, EndsWith, StringLocate)

### What changes were proposed in this pull request?
String searching in UTF8_BINARY_LCASE now works on character-level, rather than on byte-level. For example: `contains("İ", "i");` now returns **false**, because there exists no `start, len` such that `lowercase(substring("İ", start, len)) == "i"`.

### Why are the changes needed?
Fix functions that give unusable results due to one-to-many case mapping when performing string search under UTF8_BINARY_LCASE (see example above).

### Does this PR introduce _any_ user-facing change?
Yes, behaviour of `contains`, `startswith`, `endswith`, and `locate`/`position` expressions is changed for edge cases with one-to-many case mapping.

### How was this patch tested?
New unit tests in `CollationSupportSuite`.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#46511 from uros-db/alter-lcase-impl.

Authored-by: Uros Bojanic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

uros-db mentioned this pull request

[SPARK-48576][SQL] Rename UTF8_BINARY_LCASE to UTF8_LCASE #46924

Closed

cloud-fan pushed a commit that referenced this pull request


          [SPARK-48576][SQL] Rename UTF8_BINARY_LCASE to UTF8_LCASE

aad6771

### What changes were proposed in this pull request?
Renaming `UTF8_BINARY_LCASE` collation to `UTF8_LCASE`.

### Why are the changes needed?
As part of the collation effort in Spark, we've moved away from byte-by-byte logic towards character-by-character logic, so what we used to call `UTF8_BINARY_LCASE` is now more precisely `UTF8_LCASE`. For example, string searching in UTF8_LCASE now works on character-level (rather than on byte-level), which is reflected in this PRs: #46511, #46589, #46682, #46761, #46762. In addition, string comparison also works on character-level now, as per the changes introduced in this PR: #46700.

### Does this PR introduce _any_ user-facing change?
Yes, what was previously named `UTF8_BINARY_LCASE` collation, will from now on be named `UTF8_LCASE`.

### How was this patch tested?
Existing tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46924 from uros-db/rename-lcase.

Authored-by: Uros Bojanic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

SQL