Handle unicode grapheme clusters #668

ludwig-jb · 2023-07-24T15:43:30Z

Some characters that render as a single symbol can span over a sequence of several unicode code points (e.g., flag emojis, combination of a letter and a diacritic, Hangul syllables, etc.).

Such composites are called grapheme clusters in the unicode standard, and this patch introduces recognition of extended grapheme cluster boundaries, allowing to iterate over rendered characters. Without this, user may observe the cursor being "stuck" inside a character for several keystrokes, while it's making its way through each code point in the grapheme cluster.

The implementation follows the boundaries search algorithm outlined in the technical report 29 of the Unicode standard1. The implementation was tested against the set of test cases provided by the unicode character database2.

Additionally to the grapheme cluster boundaries search itself, this patch adds isExtendedPictographic function, that answers whether the given code point has a unicode "Extended_Pictographic" property, which is required to correctly determine grapheme cluster boundaries. This method is implemented natively in the JDK 21 and can be removed once we start targeting that version.

Extended_Pictographic property is stored as a bitmap. I was considering making a similar map for the code point classification in the grapheme cluster boundary search implementation, which could yield better performance, but that would require adding another half a megabyte (at least) of data into the JAR and I've settled for the bunch of ifs way.

That is something that can be reconsidered and shouldn't be difficult to change if the impact on performance would be noticeable (in my simple tests it didn't show).

A few functions in the vim-engine were adjusted to handle grapheme clusters (such as getting the horizontal offset and adjusting the cursor to not reach over the end of the line).

Some characters that render as a single symbol can span over a sequence of several unicode code points (e.g., flag emojis, combination of a letter and a diacritic, Hangul syllables, etc.). Such composites are called grapheme clusters in the unicode standard, and this patch introduces recognition of extended grapheme cluster boundaries, allowing to iterate over rendered characters. Without this, user may observe the cursor being "stuck" inside a character for several keystrokes, while it's making its way through each code point in the grapheme cluster. The implementation follows the boundaries search algorithm outlined in the technical report 29 of the Unicode standard[1]. The implementation was tested against the set of test cases provided by the unicode character database[2]. Additionally to the grapheme cluster boundaries search itself, this patch adds `isExtendedPictographic` function, that answers whether the given code point has a unicode "Extended_Pictographic" property, which is required to correctly determine grapheme cluster boundaries. This method is implemented natively in the JDK 21 and can be removed once we start targeting that version. Extended_Pictographic property is stored as a bitmap. I was considering making a similar map for the code point classification in the grapheme cluster boundary search implementation, which could yield better performance, but that would require adding another half a megabyte (at least) of data into the JAR and I've settled for the bunch of `if`s way. That is something that can be reconsidered and shouldn't be difficult to change if the impact on performance would be noticeable (in my simple tests it didn't show). A few functions in the vim-engine were adjusted to handle grapheme clusters (such as getting the horizontal offset and adjusting the cursor to not reach over the end of the line). [1]: https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries [2]: https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakTest.txt

AlexPl292

Hi, I like this change! I've tried to do something similar, but I didn't know this thing is so good defined by unicode.

Also, you mentioned that the code is tested against test cases provided by unicode. Was this a manual testing? Is there is chance to inlude these test cases to the IdeaVim tests?

vim-engine/src/main/kotlin/com/maddyhome/idea/vim/common/Graphemes.kt

vim-engine/src/main/kotlin/com/maddyhome/idea/vim/common/ExtendedPictographics.kt

vim-engine/src/main/kotlin/com/maddyhome/idea/vim/common/Graphemes.kt

GraphemeBreakTest.txt was downloaded from the Unicode Character Database [0]. Changes to build.gradle.kts were required to stop `gradlew test` from regenerating the resources with empty JSON objects. And adding a dependency. [0]: https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakTest.txt

AlexPl292 reviewed Aug 1, 2023

View reviewed changes

ludwig-jb requested a review from AlexPl292 August 11, 2023 20:17

AlexPl292 merged commit 068d610 into JetBrains:master Aug 14, 2023

ludwig-jb deleted the feat/grapheme-clusters branch August 22, 2023 11:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle unicode grapheme clusters #668

Handle unicode grapheme clusters #668

ludwig-jb commented Jul 24, 2023

AlexPl292 left a comment

Handle unicode grapheme clusters #668

Handle unicode grapheme clusters #668

Conversation

ludwig-jb commented Jul 24, 2023

AlexPl292 left a comment

Choose a reason for hiding this comment