-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
util/collate: implement utf8mb4_0900_ai_ci collation | tidb-test=pr/2192 #45650
Conversation
Skipping CI for Draft Pull Request. |
Skipping CI for Draft Pull Request. |
0ae3f6e
to
96c88ef
Compare
e1a0998
to
8a9521f
Compare
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #45650 +/- ##
================================================
- Coverage 74.7681% 72.9298% -1.8383%
================================================
Files 1293 1305 +12
Lines 398159 402188 +4029
================================================
- Hits 297696 293315 -4381
- Misses 81103 90417 +9314
+ Partials 19360 18456 -904
Flags with carried forward coverage won't be shown. Click here to find out more.
|
f3bd06a
to
27c1ce3
Compare
1863bc2
to
2cec021
Compare
/retest |
d51c37d
to
941415c
Compare
/retest |
@dveeden The license of unicode data file is: https://www.unicode.org/license.txt . It's seems fine. |
Yes that indeed seems to be fine. I think we should include a copy of the license in the
Might be good to use the SPDX ID and name this See also: https://spdx.org/licenses/Unicode-DFS-2016.html It might be useful to follow what GitHub expects: https://github.blog/changelog/2022-05-26-easily-discover-and-navigate-to-multiple-licenses-in-repositories/ |
7fc7f52
to
35e908a
Compare
@dveeden I think this error is compatible with MySQL. MySQL also gives the same error:
The document also approves this behavior:
|
c861b9d
to
a0294a6
Compare
I've checked that the The test program:
|
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM :-)
Signed-off-by: Yang Keao <[email protected]>
a0294a6
to
98c7cb3
Compare
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bb7133, CbcWestwolf The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
/retest |
What problem does this PR solve?
Issue Number: close #37566
Problem Summary:
Add
utf8mb4_0900_ai_ci
collation to the TiDB.This PR is 400-600 lines long, quite big. (A lot of lines in the 100k+ are generated or downloaded from unicode.org, so don't be too scared).
What is changed and how it works?
unicode_ci
implementation to include the code generator in the source files. (Also add tests to make sure they are still the same).utf8mb4_0900_ai_ci
collation.unicode_ci
, through code generation.This PR uses code generation in two places: one for the data (from allkeys.txt), one for the implementation (to reuse the existing
unicode_ci
collation).The second choice is hard, a more normal way is to use
interface
or something to provide different tiny implementation based on a more general UCA framework. However, as I've tested, if the related functions are not inlined (especially when it's an interface method call, or a generic method call), the performance can degrade 20%-50%.TODO
uca_data.h
.Benchmark
It becomes faster, and the
0900AICI
is faster thanUnicodeCI
in the benchmark, but I don't know why 🤦 .Before this PR
After this PR
Check List
Tests
Release note