Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENHANCEMENT] - Better unicode matching #35

Merged
merged 2 commits into from
Oct 28, 2022
Merged

Conversation

joshmcrae
Copy link
Member

This PR updates the regular expression used to match multibyte unicode codepoints for finding emoji by basing it upon the list of emoji that the library is aware of. The previous regex was old and prone to missing newer emoji that have been added in recent times.

Updating Resources

The library already supports generation of an array of shortcode-to-codepoint mappings through the composer update-resources script. An additional step has been added to this script which takes all codepoints from this list, converts them to UTF-8, and generates regex patterns to match them.

In order to reduce the number of patterns within the regex, the first two bytes of each pattern are specific while the third and fourth operate on ranges of code units. Ranges are generated from contiguous values in the 3rd byte, while all encountered values of the 4th byte within a given 3rd byte are included in a range. This will result in the expression matching more codepoints than are supported by our list of shortcodes, but any unsupported codepoints remain unchanged anyway.

Test Suite

The new test case testUnicodeMatching() was introduced to verify that all codepoints for which we have a shortcode are matched by the regex.

@joshmcrae joshmcrae requested a review from bensinclair October 28, 2022 02:11
@joshmcrae joshmcrae linked an issue Oct 28, 2022 that may be closed by this pull request
Copy link
Member

@bensinclair bensinclair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Looks great!

@joshmcrae joshmcrae linked an issue Oct 28, 2022 that may be closed by this pull request
@joshmcrae joshmcrae merged commit f13cf10 into master Oct 28, 2022
@joshmcrae joshmcrae deleted the enhance-unicode-matching branch October 28, 2022 02:32
@joshmcrae
Copy link
Member Author

Pre-released 4.3.0-alpha.

@joshmcrae
Copy link
Member Author

Released as 4.3.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Missing geometric emojis in regex 2 emojies not matched by MB_REGEX
2 participants