Unicode with two code points is counted as two characters instead of one #1230

benbro · 2017-01-02T00:51:26Z

quill.getLength() counts U+1F600 as two characters instead of one.
When doing OT with a language other than JavaScript it is counted as one glyph which breaks sync.

Steps for Reproduction

Visit http://codepen.io/anon/pen/RKbyVz
See quill length and emoji length in the console

Expected behavior:
quill length should be 2
emoji length should be 1

Actual behavior:
quill length is 3
emoji length is 2

Parchment uses this.text.length which counts the code points and not the glyphs.

This blog post explains the issue and suggests a way to count symbols.
In most cases this is enough:

function countSymbols(string) {
	return Array.from(string).length;
}

To support all available strings we need:

function countSymbols(string) {
	var normalized = string.normalize('NFC');
	return Array.from(normalized).length;
}

Array.from is not supported in IE11 but I think it's OK to use string.length in that case.

Changing the length function in Parchment will break other things like Delta operations.
Is there something we can do?

Platforms:
Firefox 50 on Windows 7

Version:
1.1.8

jhchen · 2017-01-02T02:02:54Z

I had run into this previously and that blog post and ended up deciding to count an emoji as two characters. I'm open to reconsideration but the main issue is "😀".length == 2. If Quill treats this differently, it would be incompatible with the outside Javascript world. Consider this example:

var initialText = getFromServer();
quill.setText(initialText);
quill.insertText(initialText.length, "!");

Now Quill happens to recover from out of bounds indexes but the point is users understandably will and should be able to use Javascript string methods to calculate positions. And they will be wrong depending on if initialText contains unicode characters.

Ultimately it seems to be a choice between compatibility between languages that treat 😀 as having length one or having length two. Javascript is in the latter group and it seems best for a Javascript library like Quill to also be a part of this group. If there is a way to work seamlessly in both that would be ideal but I don't believe this is possible at the moment.

Given this, it seems the responsibility lies in the libraries or applications that crosses the boundary between these two groups to account for Unicode length differences.

benbro · 2017-01-02T11:27:09Z

If you try to delete the emoji you need to press Backspace twice. That's probably related to the fact that length=2.

I agree that changing the string length will break user expectations.

Do you think it's reasonable to filter all glyphs that have two code points in the Clipboard module? Possibly with a custom matcher.
Is it safe to assume that it will only remove symbols like emoji but won't affect 'real' languages?
I tried to find a regex that can do that but there are still two code points symbols that I can't remove like: 🤓🤔🥓🥓🤔🤔🤔🤔🤔

Do you have a suggestion how to make python count the emoji as two symbols instead of one?
I'm currently using utf-8 and it worked well until I got the emoji.

benbro · 2017-01-02T11:42:47Z

Probably out of scope, closing.
Thanks

jhchen · 2017-01-02T22:47:23Z

The two backspace requirement seems like a bug Quill could probably fix, especially considering the weird intermediate character in Chrome.

I'm not sure I understand the motivation behind filtering in the clipboard module?

I'm not familiar with how Python does handles string encoding unfortunately.

benbro · 2017-01-03T00:11:50Z

I'm not sure I understand the motivation behind filtering in the clipboard module?

If all the symbols that use two or more code points in JavaScript aren't essential for writing like emoji, I don't mind filtering them on paste. I'm not sure if it's enough because the user might be able to enter them using the keyboard.

I don't have better suggestion than the current implementation. I'll try to normalize the text on the server to match the encoding and length in the browser.
I'll also try to override the length method of the text blot locally but it'll probably break other things in Quill.

related #1230

sachinrekhi · 2017-04-18T00:13:36Z

@benbro I'm running into this incompatibility issue as well where Quill/Javascript treat an emoji with length 2, but my Python back-end treat an emoji as length 1. This is completely breaking Delta reconciliation due to the lack of matching lengths.

Wanted to check in to see if you ever figured out a workaround given it sounds like you ran into the same issue? At this point I'm thinking of just taking your suggestion earlier of preventing pasting emoji into the editor through a custom matcher. I'd obviously love to find a way to support emoji, but not seeing any easy solutions to match the emoji length issue between Javascript/Python.

Only possibility I've thought of is downgrading my server Python version to a "narrow unicode build", which I think might then treat emoji as length 2. But seems like a total hack and not sure of other implications of using "narrow build" of Python instead of "wide build"

benbro · 2017-04-18T00:35:06Z

@sachinrekhi I didn't fix the issue.
One option is to use UCS-2 encoding on the server. This will enable full compatibility with Quill but I feel uncomfortable with using a weird encoding just to support emoji which are non important for my use case.

Another option is to override the length calculation in Quill/Parchment to 'normalize' the length of astral symbols but I'm not sure if it's possible without deep changes.

Do you know how to prevent inserting emoji from the keyboard and the clipboard?
It might be the simplest solution.

sachinrekhi · 2017-04-18T04:18:16Z

Regarding preventing emoji, my approach was going to be the following (haven't implemented yet, so let me know if you know of any gotchas):

create a custom clipboard matcher that identifies emoji and then removes it from the applied delta
to prevent insertion from the keyboard, I was thinking you can listen for the text-change event, inspect the delta, identify if it contains an emoji and if it does, then apply a new delta that simply deletes the emoji. this might have a perf hit since you are inspecting every text-change, but you would rarely actually apply a delta since most of the time people won't be typing in emoji, so maybe the perf hit won't be bad.

In both cases you need to be able to reliably detect an emoji character. I was planning on using the emoji-regex JS library in npm, which seems to be able to identify the vast majority of emoji I've thrown on it (and it apparently uses the unicode emoji standard so it should support all emoji). Library here: https://github.com/mathiasbynens/emoji-regex

When thinking through this approach, I thought maybe I could leverage the same overall approach to in-fact support emoji (instead of simply deleting it). The idea would be that when I match an emoji via the clipboard or text-change event, instead of deleting the emoji, I replace it with a custom in-line blot that itself renders the emoji character. The benefit being that custom embeds always have a length of 1 to Quill, so would get over the issue of client & server treating the lengths differently. Might be overkill just for emoji support, but I think it should work right?

sachinrekhi · 2017-04-18T05:32:25Z

FYI, I've got a clipboard matcher now removing emoji on paste:

import emojiRegex from 'emoji-regex'

// remove pasted emojis
quill.clipboard.addMatcher(Node.TEXT_NODE, function(node, delta) {
    let index = 0
    let regex = emojiRegex()
    let match = null
    let composer = new Delta()
    if (delta.ops.length > 0) {
        while ((match = regex.exec(delta.ops[0].insert)) !== null) {
            composer.retain(match.index - index)
            composer.delete(match[0].length)
            index = regex.lastIndex
        }
        return delta.compose(composer)
    } else {
        return delta
    }
})

benbro · 2017-04-18T07:51:54Z

@sachinrekhi the actual problem is that "😀".length == 2.
Maybe we can filter it by checking the length without complex regex:

[..."aaa😀bbb"].filter(function(str) {return str.length == 1}).join('');

sachinrekhi · 2017-04-18T19:58:48Z

That's true, that should work and will also deal with any potential non-emojis that are also of length 2. It doesn't solve for the case if we actually want to support emoji, since in that case you actually want to know whether it's an emoji vs non-emoji.

sachinrekhi · 2017-04-18T20:02:26Z

FYI, using my original approach, I also have a 'text-change' event handler that removes keyboard inserted emoji:

// add at the bottom of your existing Quill 'text-change' event handler

// remove keyboard inserted emoji
let hasEmoji = false
let composer = new Delta()
for (let i=0; i < delta.ops.length; i++) {
    if (delta.ops[i].insert !== undefined) {
        if (typeof delta.ops[i].insert === 'string') {
            let regex = emojiRegex()
            let index = 0
            let match = null
            while ((match = regex.exec(delta.ops[i].insert)) !== null) {
                hasEmoji = true
                composer.retain(match.index - index)
                composer.delete(match[0].length)
                index = regex.lastIndex
            }
            composer.retain(delta.ops[i].insert.length - index)
        } else {
            composer.insert(delta.ops[i].insert)
        }
    } else if (delta.ops[i].retain !== undefined) {
        composer.retain(delta.ops[i].retain)
    } else if (delta.ops[i].delete !== undefined) {
        composer.delete(delta.ops[i].delete)
    } 
}
if (hasEmoji) {
    quill.updateContents(composer, Quill.sources.USER)
    setTimeout(()=> {
        quill.focus()
    }, 1)
}

sachinrekhi · 2017-04-18T21:00:48Z

And finally, here is the Emoji Embed which ensures a constant length of 1 for emoji. Now instead of just deleting the emoji in the above clipboard matcher and text-change event handler, I just replace the emoji character with this emoji embed. And I now have consistent support for emoji with length=1.

Setting contentEditable to false hurts the user experience in terms of consistently showing the cursor before and after the emoji. But without it, you end up accidentally inserting characters in the emoji embed itself when you attempt to type before/after the embed.

I think the best solution is not to use contentEditable and instead to add an update(mutations) method on the Emoji Embed class that takes any text typed at the beginning/end of the embed and converts them to inserts outside of the embed. But this was non-trivial. I thought I remember seeing a codepen for a mentions implementation that handled the update(mutations) as I'm suggesting, but I can no longer find the Quill issue it was mentioned in...

import Quill from 'quill'

let Embed = Quill.import('blots/embed')

export default class Emoji extends Embed {
    static create(value) {
        let node = super.create(value)

        node.dataset.emoji = value
        node.innerHTML = value
        node.setAttribute('contentEditable', 'false')

        return node
    }

    static value(node) {
        return node.dataset.emoji
    }
}
Emoji.blotName = 'emoji'
Emoji.tagName = 'SPAN'
Emoji.className = 'ql-emoji'

Quill.register('formats/emoji', Emoji)

benbro · 2017-04-18T21:09:38Z

@sachinrekhi Thank you for sharing the code. Is there a chance you can create a CodePen with all the parts?

What non-emojis symbols have length 2?

sachinrekhi · 2017-04-18T22:44:11Z

Sure, I'll try to put together a codepen.

I actually don't definitely know if there are any non-emoji symbols with a length of 2. I just guessed there might be.

sachinrekhi · 2017-04-18T23:15:39Z

Here you go: http://codepen.io/sachinrekhi/pen/eWpajZ

benbro · 2017-04-18T23:26:55Z

Thanks for the code.

The regex is large and emoji are rare. Testing the string length will probably be much faster.
Do you think it's better to test the regex only if the string has wrong length?

if([...str].length != str.length) {
  // test regex here
  // if not emoji filter it to prevent OT issue on the server?
}

sachinrekhi · 2017-04-19T00:27:33Z

That's a good perf improvement, thanks. Just updated the codepen for both the clipboard matcher and text-change handler to start with this test.

benbro · 2017-04-19T00:35:59Z

If there are non-emoji symbols with a length of 2 it will still break OT.
Maybe it'll be safer to add this after replacing emoji?
Again, assuming that emoji are rare and the performance impact will be negligible.

[...str].filter(function(str) {return str.length == 1}).join('');

khaled4vokalz · 2020-08-18T09:40:46Z

@jhchen ,
even though "ő".length === 2 quill considers it as length 1 when we do quill.getLength() but for an emoji e.g. "😀".length === 2 quill considers it as length of 2 when we do quill.getLength(). Why is this discrepancy in place?

kodeschooler · 2021-04-13T19:08:51Z

Guys, I've found this thread to be very helpful. I used the approach that @sachinrekhi used for my case. It works great. My only problem is that the regex is blocking symbols too, and there are just 3 of them that I need to allow customers to use. They are: ©, ®, ™

I don't actually know how these symbols map to the regex I'm using. I don't know what part of the regex is responsible for blocking them and therefore what to change it to. Can anyone help me identify how to change my regex below that will allow these 3? I simply don't know how these things work.

The regex I'm using is this: /(?:[\u2700-\u27bf]|(?:\ud83c[\udde6-\uddff]){2}|[\ud800-\udbff][\udc00-\udfff]|[\u0023-\u0039]\ufe0f?\u20e3|\u3299|\u3297|\u303d|\u3030|\u24c2|\ud83c[\udd70-\udd71]|\ud83c[\udd7e-\udd7f]|\ud83c\udd8e|\ud83c[\udd91-\udd9a]|\ud83c[\udde6-\uddff]|[\ud83c\ude01-\ude02]|\ud83c\ude1a|\ud83c\ude2f|[\ud83c\ude32-\ude3a]|[\ud83c\ude50-\ude51]|\u203c|\u2049|[\u25aa-\u25ab]|\u25b6|\u25c0|[\u25fb-\u25fe]|\u00a9|\u00ae|\u2122|\u2139|\ud83c\udc04|[\u2600-\u26FF]|\u2b05|\u2b06|\u2b07|\u2b1b|\u2b1c|\u2b50|\u2b55|\u231a|\u231b|\u2328|\u23cf|[\u23e9-\u23f3]|[\u23f8-\u23fa]|\ud83c\udccf|\u2934|\u2935|[\u2190-\u21ff])/g

Thanks for any input.

josephg · 2024-11-22T16:09:32Z

Hi all - just so everyone understands, this isn't a problem specifically with emoji. Its a problem caused by the fact that there's lots of ways to count "the number of characters" in a string:

What javascript, C#, Java, .. does: Count the number of surrogate pairs when using UTF-16
What python3, Go, Rust, .. does: Count the number of unicode codepoints
What rust and others also do: Count the number of bytes when the string is encoded to UTF-8
What UI code wants: Count the number of grapheme clusters

Eg, the polar bear emoji ('🐻‍❄') is one grapheme cluster. If you hit backspace, it deletes the whole thing. But its made up of 3 unicode characters (🐻 + ZWJ + ❄️) so len('🐻‍❄') in python or [...'🐻‍❄'].length in javascript return 3. But '🐻‍❄'.length in javascript returns 4 (because of how JS stores strings). And it takes 10 bytes in UTF8 to store.

All of these "ways to count a length" just so happen to be the same for English text, since ascii was invented by americans and unicode inherits ascii. But they're not the same for anything beyond U+ff00 or something iirc. For example, fun symbols like 𝄞, ©, ®, ™, and several actual non-english languages will all break if you assume all those "string length" values are the same.

If the quill javascript library counts characters using javascript's string.length, thats incompatible with how python3 and other languages deal with strings.

benbro closed this as completed Jan 2, 2017

jhchen added a commit that referenced this issue Jan 10, 2017

check for astral symbols

c55149a

related #1230

benbro mentioned this issue Sep 18, 2017

Support Symbols in the Astral Plane jhchen/fast-diff#1

Closed

natevw mentioned this issue Feb 22, 2020

Unicode combining characters treated individually when editing #2942

Closed

This was referenced Feb 3, 2024

discrepancy in unicode character lengths in quill #3138

Closed

Invalid delta Length and also Delta.slice() breaks latin unicode char in two parts e.g. ő #3074

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode with two code points is counted as two characters instead of one #1230

Unicode with two code points is counted as two characters instead of one #1230

benbro commented Jan 2, 2017

jhchen commented Jan 2, 2017

benbro commented Jan 2, 2017

benbro commented Jan 2, 2017

jhchen commented Jan 2, 2017

benbro commented Jan 3, 2017

sachinrekhi commented Apr 18, 2017

benbro commented Apr 18, 2017

sachinrekhi commented Apr 18, 2017

sachinrekhi commented Apr 18, 2017

benbro commented Apr 18, 2017

sachinrekhi commented Apr 18, 2017

sachinrekhi commented Apr 18, 2017

sachinrekhi commented Apr 18, 2017

benbro commented Apr 18, 2017

sachinrekhi commented Apr 18, 2017

sachinrekhi commented Apr 18, 2017

benbro commented Apr 18, 2017 •

edited

Loading

sachinrekhi commented Apr 19, 2017

benbro commented Apr 19, 2017

khaled4vokalz commented Aug 18, 2020 •

edited

Loading

kodeschooler commented Apr 13, 2021

josephg commented Nov 22, 2024

Unicode with two code points is counted as two characters instead of one #1230

Unicode with two code points is counted as two characters instead of one #1230

Comments

benbro commented Jan 2, 2017

jhchen commented Jan 2, 2017

benbro commented Jan 2, 2017

benbro commented Jan 2, 2017

jhchen commented Jan 2, 2017

benbro commented Jan 3, 2017

sachinrekhi commented Apr 18, 2017

benbro commented Apr 18, 2017

sachinrekhi commented Apr 18, 2017

sachinrekhi commented Apr 18, 2017

benbro commented Apr 18, 2017

sachinrekhi commented Apr 18, 2017

sachinrekhi commented Apr 18, 2017

sachinrekhi commented Apr 18, 2017

benbro commented Apr 18, 2017

sachinrekhi commented Apr 18, 2017

sachinrekhi commented Apr 18, 2017

benbro commented Apr 18, 2017 • edited Loading

sachinrekhi commented Apr 19, 2017

benbro commented Apr 19, 2017

khaled4vokalz commented Aug 18, 2020 • edited Loading

kodeschooler commented Apr 13, 2021

josephg commented Nov 22, 2024

benbro commented Apr 18, 2017 •

edited

Loading

khaled4vokalz commented Aug 18, 2020 •

edited

Loading