Skip to content

Commit

Permalink
Handle unicode grapheme clusters
Browse files Browse the repository at this point in the history
Some characters that render as a single symbol can span over a sequence
of several unicode code points (e.g., flag emojis, combination of a
letter and a diacritic, Hangul syllables, etc.).

Such composites are called grapheme clusters in the unicode standard,
and this patch introduces recognition of extended grapheme cluster
boundaries, allowing to iterate over rendered characters. Without this,
user may observe the cursor being "stuck" inside a character for several
keystrokes, while it's making its way through each code point in the
grapheme cluster.

The implementation follows the boundaries search algorithm outlined in
the technical report 29 of the Unicode standard[1]. The implementation was
tested against the set of test cases provided by the unicode character
database[2].

Additionally to the grapheme cluster boundaries search itself, this
patch adds `isExtendedPictographic` function, that answers whether the
given code point has a unicode "Extended_Pictographic" property, which
is required to correctly determine grapheme cluster boundaries. This
method is implemented natively in the JDK 21 and can be removed once we
start targeting that version.

Extended_Pictographic property is stored as a bitmap. I was considering
making a similar map for the code point classification in the grapheme
cluster boundary search implementation, which could yield better
performance, but that would require adding another half a megabyte (at
least) of data into the JAR and I've settled for the bunch of `if`s way.

That is something that can be reconsidered and shouldn't be difficult to
change if the impact on performance would be noticeable (in my simple
tests it didn't show).

A few functions in the vim-engine were adjusted to handle grapheme
clusters (such as getting the horizontal offset and adjusting the cursor
to not reach over the end of the line).

[1]: https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
[2]: https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakTest.txt
  • Loading branch information
ludwig-jb authored and AlexPl292 committed Aug 14, 2023
1 parent c2ebacd commit 41177b9
Show file tree
Hide file tree
Showing 7 changed files with 418 additions and 14 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -159,4 +159,27 @@ class MotionEndActionTest : VimTestCase() {
""".trimIndent()
doTest(keys, before, after, VimStateMachine.Mode.COMMAND, VimStateMachine.SubMode.NONE)
}

@TestWithoutNeovim(SkipNeovimReason.NON_ASCII)
@OptionTest(VimOption(TestOptionConstants.keymodel, doesntAffectTest = true))
fun `test motion end with multiple code point grapheme cluster at the end`() {
val keys = listOf("<End>")
val before = """
Lorem Ipsum
I found it in ${c}a legendary land👩‍👩‍👧‍👧
consectetur adipiscing elit
Sed in orci mauris.
Cras id tellus in ex imperdiet egestas.
""".trimIndent()
val after = """
Lorem Ipsum
I found it in a legendary land${c}👩‍👩‍👧‍👧
consectetur adipiscing elit
Sed in orci mauris.
Cras id tellus in ex imperdiet egestas.
""".trimIndent()
doTest(keys, before, after, VimStateMachine.Mode.COMMAND, VimStateMachine.SubMode.NONE)
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -102,4 +102,20 @@ class MotionLeftActionTest : VimTestCase() {
enterCommand("set whichwrap=h")
}
}

@TestWithoutNeovim(SkipNeovimReason.NON_ASCII)
@Test
fun `test simple motion multiple code point grapheme cluster`() {
doTest(
"h",
"""
Oh, hi Mark
You are my👩‍👩‍👧‍👧${c} favourite customer
""".trimIndent(),
"""
Oh, hi Mark
You are my${c}👩‍👩‍👧‍👧 favourite customer
""".trimIndent(),
)
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,32 @@ class MotionRightActionTest : VimTestCase() {
)
}

@TestWithoutNeovim(SkipNeovimReason.NON_ASCII)
@OptionTest(VimOption(TestOptionConstants.virtualedit, doesntAffectTest = true))
fun `test simple motion multiple code point grapheme cluster`() {
doTest(
"l",
"""
Lorem Ipsum
I found it in a legendar${c}👩‍👩‍👧‍👧 land
consectetur adipiscing elit
Sed in orci mauris.
Cras id tellus in ex imperdiet egestas.
""".trimIndent(),
"""
Lorem Ipsum
I found it in a legendar👩‍👩‍👧‍👧${c} land
consectetur adipiscing elit
Sed in orci mauris.
Cras id tellus in ex imperdiet egestas.
""".trimIndent(),
VimStateMachine.Mode.COMMAND,
VimStateMachine.SubMode.NONE,
)
}

@TestWithoutNeovim(SkipNeovimReason.NON_ASCII)
@OptionTest(VimOption(TestOptionConstants.virtualedit, doesntAffectTest = true))
fun `test simple motion czech`() {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

package com.maddyhome.idea.vim.api

import com.maddyhome.idea.vim.common.Graphemes
import com.maddyhome.idea.vim.common.TextRange
import java.nio.CharBuffer

Expand Down Expand Up @@ -146,7 +147,12 @@ public fun VimEditor.getLineEndOffset(line: Int, allowEnd: Boolean): Int {
} else {
val startOffset: Int = getLineStartOffset(line)
val endOffset: Int = getLineEndOffset(line)
endOffset - if (startOffset == endOffset || allowEnd) 0 else 1

if (startOffset == endOffset || allowEnd) {
endOffset
} else {
Graphemes.prev(text(), endOffset) ?: endOffset
}
}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ import com.maddyhome.idea.vim.action.motion.leftright.TillCharacterMotionType
import com.maddyhome.idea.vim.command.Argument
import com.maddyhome.idea.vim.command.MotionType
import com.maddyhome.idea.vim.command.OperatorArguments
import com.maddyhome.idea.vim.common.Graphemes
import com.maddyhome.idea.vim.common.TextRange
import com.maddyhome.idea.vim.handler.Motion
import com.maddyhome.idea.vim.handler.Motion.AbsoluteOffset
Expand All @@ -23,6 +24,7 @@ import com.maddyhome.idea.vim.helper.isEndAllowed
import com.maddyhome.idea.vim.helper.isEndAllowedIgnoringOnemore
import com.maddyhome.idea.vim.helper.mode
import kotlin.math.abs
import kotlin.math.absoluteValue
import kotlin.math.min
import kotlin.math.sign

Expand Down Expand Up @@ -108,29 +110,24 @@ public abstract class VimMotionGroupBase : VimMotionGroup {
allowPastEnd: Boolean,
allowWrap: Boolean,
): Motion {
val oldOffset = caret.offset.point
var diff = 0
val text = editor.text()
val sign = sign(count.toFloat()).toInt()
for (pointer in IntProgression.fromClosedRange(0, count - sign, sign)) {
val textPointer = oldOffset + pointer
diff += if (textPointer < text.length && textPointer >= 0) {
// Actual char size can differ from 1 if unicode characters are used (like 🐔)
Character.charCount(Character.codePointAt(text, textPointer))
} else {
1
}
val oldOffset = caret.offset.point
var current = oldOffset
for (i in 0 until count.absoluteValue) {
val newOffset = if (count > 0) Graphemes.next(text, current) else Graphemes.prev(text, current)
current = newOffset ?: break
}

val offset = if (allowWrap) {
var newOffset = oldOffset + sign * diff
var newOffset = current
val oldLine = editor.offsetToBufferPosition(oldOffset).line
val newLine = editor.offsetToBufferPosition(newOffset).line
if (!allowPastEnd && count > 0 && oldLine == newLine && newOffset == editor.getLineEndForOffset(newOffset)) {
++newOffset // here we skip the /n char and move caret one char forward
}
editor.normalizeOffset(newOffset, allowPastEnd)
} else {
editor.normalizeOffset(caret.getLine().line, oldOffset + (sign * diff), allowPastEnd)
editor.normalizeOffset(caret.getLine().line, current, allowPastEnd)
}

return offset.toMotionOrError()
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
package com.maddyhome.idea.vim.common

/**
* Answers whether a given code point is a unicode Extended_Pictographic.
*
* NOTE: this is a part of the Java 21 API. Can be removed once we start targeting that version.
*/
internal fun isExtendedPictographic(codePoint: Int): Boolean {
// Outside of the bitmap.
if (codePoint >= bitmap.size * 64) return false

val idx = codePoint / 64
val bit = codePoint % 64
val bucket = bitmap[idx]

return (bucket and (1L shl bit)) != 0L
}

// A bitmap that maps a code point into whether it has the Extended_Pictographic property.
// The code points go in increasing order by index and in reverse order by bit in a specific long.
// This way a simple divmod is enough to compute both indices.
private val bitmap = longArrayOf(
0, 0, 72567767433216, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1152921504606846976, 512, 0, 0, 144115205255725056, 0,
6597135826944, 0, 0, 0, 0, 0, 1099712954368, 0, 256, 508904558869643264, 0, 0, 0, 4, 0, 0, 18027592649015296,
8646911284551352321, -524353, -1, -65473, -1, 6756508085255999, 1065163968656, -9223090553273450496, 0, 0, 0, 0, 0,
13510798882111488, 0, 0, 0, 0, 0, 0, 0, 402653408, 2162688, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
2306124484190404608, 0, 0, 0, 0, 0, 0, 0, 0, 0, 41943040, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, 140737488412672, -4610577710706589696, -35184237985792,
274877906943, -577445914654736386, -512, -1, -1, -1, -1, -1, 576460752303423487, -1, -1, -1, -1,
4611686018427387903, -64, -1, -1, -1, 65535, -1, -1, 0, -4503599627370496, 0, -2097152, 61440, 4227923712,
-70368744112384, -1, -576460752303427584, -65, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, 4611686018427387903)
Loading

0 comments on commit 41177b9

Please sign in to comment.