Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle unicode grapheme clusters #668

Merged
merged 2 commits into from
Aug 14, 2023

Conversation

ludwig-jb
Copy link
Contributor

Some characters that render as a single symbol can span over a sequence of several unicode code points (e.g., flag emojis, combination of a letter and a diacritic, Hangul syllables, etc.).

Such composites are called grapheme clusters in the unicode standard, and this patch introduces recognition of extended grapheme cluster boundaries, allowing to iterate over rendered characters. Without this, user may observe the cursor being "stuck" inside a character for several keystrokes, while it's making its way through each code point in the grapheme cluster.

The implementation follows the boundaries search algorithm outlined in the technical report 29 of the Unicode standard1. The implementation was tested against the set of test cases provided by the unicode character database2.

Additionally to the grapheme cluster boundaries search itself, this patch adds isExtendedPictographic function, that answers whether the given code point has a unicode "Extended_Pictographic" property, which is required to correctly determine grapheme cluster boundaries. This method is implemented natively in the JDK 21 and can be removed once we start targeting that version.

Extended_Pictographic property is stored as a bitmap. I was considering making a similar map for the code point classification in the grapheme cluster boundary search implementation, which could yield better performance, but that would require adding another half a megabyte (at least) of data into the JAR and I've settled for the bunch of ifs way.

That is something that can be reconsidered and shouldn't be difficult to change if the impact on performance would be noticeable (in my simple tests it didn't show).

A few functions in the vim-engine were adjusted to handle grapheme clusters (such as getting the horizontal offset and adjusting the cursor to not reach over the end of the line).

Some characters that render as a single symbol can span over a sequence
of several unicode code points (e.g., flag emojis, combination of a
letter and a diacritic, Hangul syllables, etc.).

Such composites are called grapheme clusters in the unicode standard,
and this patch introduces recognition of extended grapheme cluster
boundaries, allowing to iterate over rendered characters. Without this,
user may observe the cursor being "stuck" inside a character for several
keystrokes, while it's making its way through each code point in the
grapheme cluster.

The implementation follows the boundaries search algorithm outlined in
the technical report 29 of the Unicode standard[1]. The implementation was
tested against the set of test cases provided by the unicode character
database[2].

Additionally to the grapheme cluster boundaries search itself, this
patch adds `isExtendedPictographic` function, that answers whether the
given code point has a unicode "Extended_Pictographic" property, which
is required to correctly determine grapheme cluster boundaries. This
method is implemented natively in the JDK 21 and can be removed once we
start targeting that version.

Extended_Pictographic property is stored as a bitmap. I was considering
making a similar map for the code point classification in the grapheme
cluster boundary search implementation, which could yield better
performance, but that would require adding another half a megabyte (at
least) of data into the JAR and I've settled for the bunch of `if`s way.

That is something that can be reconsidered and shouldn't be difficult to
change if the impact on performance would be noticeable (in my simple
tests it didn't show).

A few functions in the vim-engine were adjusted to handle grapheme
clusters (such as getting the horizontal offset and adjusting the cursor
to not reach over the end of the line).

[1]: https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
[2]: https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakTest.txt
Copy link
Member

@AlexPl292 AlexPl292 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I like this change! I've tried to do something similar, but I didn't know this thing is so good defined by unicode.

Also, you mentioned that the code is tested against test cases provided by unicode. Was this a manual testing? Is there is chance to inlude these test cases to the IdeaVim tests?

GraphemeBreakTest.txt was downloaded from the Unicode Character Database [0].

Changes to build.gradle.kts were required to stop `gradlew test` from
regenerating the resources with empty JSON objects. And adding a
dependency.

[0]: https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakTest.txt
@ludwig-jb ludwig-jb requested a review from AlexPl292 August 11, 2023 20:17
@AlexPl292 AlexPl292 merged commit 068d610 into JetBrains:master Aug 14, 2023
@ludwig-jb ludwig-jb deleted the feat/grapheme-clusters branch August 22, 2023 11:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants