[ENHANCEMENT] - Better unicode matching #35
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR updates the regular expression used to match multibyte unicode codepoints for finding emoji by basing it upon the list of emoji that the library is aware of. The previous regex was old and prone to missing newer emoji that have been added in recent times.
Updating Resources
The library already supports generation of an array of shortcode-to-codepoint mappings through the
composer update-resources
script. An additional step has been added to this script which takes all codepoints from this list, converts them to UTF-8, and generates regex patterns to match them.In order to reduce the number of patterns within the regex, the first two bytes of each pattern are specific while the third and fourth operate on ranges of code units. Ranges are generated from contiguous values in the 3rd byte, while all encountered values of the 4th byte within a given 3rd byte are included in a range. This will result in the expression matching more codepoints than are supported by our list of shortcodes, but any unsupported codepoints remain unchanged anyway.
Test Suite
The new test case
testUnicodeMatching()
was introduced to verify that all codepoints for which we have a shortcode are matched by the regex.