Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Some characters that render as a single symbol can span over a sequence of several unicode code points (e.g., flag emojis, combination of a letter and a diacritic, Hangul syllables, etc.).
Such composites are called grapheme clusters in the unicode standard, and this patch introduces recognition of extended grapheme cluster boundaries, allowing to iterate over rendered characters. Without this, user may observe the cursor being "stuck" inside a character for several keystrokes, while it's making its way through each code point in the grapheme cluster.
The implementation follows the boundaries search algorithm outlined in the technical report 29 of the Unicode standard1. The implementation was tested against the set of test cases provided by the unicode character database2.
Additionally to the grapheme cluster boundaries search itself, this patch adds
isExtendedPictographic
function, that answers whether the given code point has a unicode "Extended_Pictographic" property, which is required to correctly determine grapheme cluster boundaries. This method is implemented natively in the JDK 21 and can be removed once we start targeting that version.Extended_Pictographic property is stored as a bitmap. I was considering making a similar map for the code point classification in the grapheme cluster boundary search implementation, which could yield better performance, but that would require adding another half a megabyte (at least) of data into the JAR and I've settled for the bunch of
if
s way.That is something that can be reconsidered and shouldn't be difficult to change if the impact on performance would be noticeable (in my simple tests it didn't show).
A few functions in the vim-engine were adjusted to handle grapheme clusters (such as getting the horizontal offset and adjusting the cursor to not reach over the end of the line).