-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode with two code points is counted as two characters instead of one #1230
Comments
I had run into this previously and that blog post and ended up deciding to count an emoji as two characters. I'm open to reconsideration but the main issue is var initialText = getFromServer();
quill.setText(initialText);
quill.insertText(initialText.length, "!"); Now Quill happens to recover from out of bounds indexes but the point is users understandably will and should be able to use Javascript string methods to calculate positions. And they will be wrong depending on if Ultimately it seems to be a choice between compatibility between languages that treat 😀 as having length one or having length two. Javascript is in the latter group and it seems best for a Javascript library like Quill to also be a part of this group. If there is a way to work seamlessly in both that would be ideal but I don't believe this is possible at the moment. Given this, it seems the responsibility lies in the libraries or applications that crosses the boundary between these two groups to account for Unicode length differences. |
If you try to delete the emoji you need to press Backspace twice. That's probably related to the fact that length=2. I agree that changing the string length will break user expectations. Do you think it's reasonable to filter all glyphs that have two code points in the Clipboard module? Possibly with a custom matcher. Do you have a suggestion how to make python count the emoji as two symbols instead of one? |
Probably out of scope, closing. |
The two backspace requirement seems like a bug Quill could probably fix, especially considering the weird intermediate character in Chrome. I'm not sure I understand the motivation behind filtering in the clipboard module? I'm not familiar with how Python does handles string encoding unfortunately. |
If all the symbols that use two or more code points in JavaScript aren't essential for writing like emoji, I don't mind filtering them on paste. I'm not sure if it's enough because the user might be able to enter them using the keyboard. I don't have better suggestion than the current implementation. I'll try to normalize the text on the server to match the encoding and length in the browser. |
@benbro I'm running into this incompatibility issue as well where Quill/Javascript treat an emoji with length 2, but my Python back-end treat an emoji as length 1. This is completely breaking Delta reconciliation due to the lack of matching lengths. Wanted to check in to see if you ever figured out a workaround given it sounds like you ran into the same issue? At this point I'm thinking of just taking your suggestion earlier of preventing pasting emoji into the editor through a custom matcher. I'd obviously love to find a way to support emoji, but not seeing any easy solutions to match the emoji length issue between Javascript/Python. Only possibility I've thought of is downgrading my server Python version to a "narrow unicode build", which I think might then treat emoji as length 2. But seems like a total hack and not sure of other implications of using "narrow build" of Python instead of "wide build" |
@sachinrekhi I didn't fix the issue. Another option is to override the length calculation in Quill/Parchment to 'normalize' the length of astral symbols but I'm not sure if it's possible without deep changes. Do you know how to prevent inserting emoji from the keyboard and the clipboard? |
Regarding preventing emoji, my approach was going to be the following (haven't implemented yet, so let me know if you know of any gotchas):
In both cases you need to be able to reliably detect an emoji character. I was planning on using the emoji-regex JS library in npm, which seems to be able to identify the vast majority of emoji I've thrown on it (and it apparently uses the unicode emoji standard so it should support all emoji). Library here: https://github.com/mathiasbynens/emoji-regex When thinking through this approach, I thought maybe I could leverage the same overall approach to in-fact support emoji (instead of simply deleting it). The idea would be that when I match an emoji via the clipboard or text-change event, instead of deleting the emoji, I replace it with a custom in-line blot that itself renders the emoji character. The benefit being that custom embeds always have a length of 1 to Quill, so would get over the issue of client & server treating the lengths differently. Might be overkill just for emoji support, but I think it should work right? |
FYI, I've got a clipboard matcher now removing emoji on paste:
|
@sachinrekhi the actual problem is that "😀".length == 2. [..."aaa😀bbb"].filter(function(str) {return str.length == 1}).join(''); |
That's true, that should work and will also deal with any potential non-emojis that are also of length 2. It doesn't solve for the case if we actually want to support emoji, since in that case you actually want to know whether it's an emoji vs non-emoji. |
FYI, using my original approach, I also have a 'text-change' event handler that removes keyboard inserted emoji:
|
And finally, here is the Emoji Embed which ensures a constant length of 1 for emoji. Now instead of just deleting the emoji in the above clipboard matcher and text-change event handler, I just replace the emoji character with this emoji embed. And I now have consistent support for emoji with length=1. Setting contentEditable to false hurts the user experience in terms of consistently showing the cursor before and after the emoji. But without it, you end up accidentally inserting characters in the emoji embed itself when you attempt to type before/after the embed. I think the best solution is not to use contentEditable and instead to add an update(mutations) method on the Emoji Embed class that takes any text typed at the beginning/end of the embed and converts them to inserts outside of the embed. But this was non-trivial. I thought I remember seeing a codepen for a mentions implementation that handled the update(mutations) as I'm suggesting, but I can no longer find the Quill issue it was mentioned in...
|
@sachinrekhi Thank you for sharing the code. Is there a chance you can create a CodePen with all the parts? What non-emojis symbols have length 2? |
Sure, I'll try to put together a codepen. I actually don't definitely know if there are any non-emoji symbols with a length of 2. I just guessed there might be. |
Here you go: http://codepen.io/sachinrekhi/pen/eWpajZ |
Thanks for the code. The regex is large and emoji are rare. Testing the string length will probably be much faster. if([...str].length != str.length) {
// test regex here
// if not emoji filter it to prevent OT issue on the server?
} |
That's a good perf improvement, thanks. Just updated the codepen for both the clipboard matcher and text-change handler to start with this test. |
If there are non-emoji symbols with a length of 2 it will still break OT. [...str].filter(function(str) {return str.length == 1}).join(''); |
@jhchen , |
Guys, I've found this thread to be very helpful. I used the approach that @sachinrekhi used for my case. It works great. My only problem is that the regex is blocking symbols too, and there are just 3 of them that I need to allow customers to use. They are: ©, ®, ™ I don't actually know how these symbols map to the regex I'm using. I don't know what part of the regex is responsible for blocking them and therefore what to change it to. Can anyone help me identify how to change my regex below that will allow these 3? I simply don't know how these things work. The regex I'm using is this: /(?:[\u2700-\u27bf]|(?:\ud83c[\udde6-\uddff]){2}|[\ud800-\udbff][\udc00-\udfff]|[\u0023-\u0039]\ufe0f?\u20e3|\u3299|\u3297|\u303d|\u3030|\u24c2|\ud83c[\udd70-\udd71]|\ud83c[\udd7e-\udd7f]|\ud83c\udd8e|\ud83c[\udd91-\udd9a]|\ud83c[\udde6-\uddff]|[\ud83c\ude01-\ude02]|\ud83c\ude1a|\ud83c\ude2f|[\ud83c\ude32-\ude3a]|[\ud83c\ude50-\ude51]|\u203c|\u2049|[\u25aa-\u25ab]|\u25b6|\u25c0|[\u25fb-\u25fe]|\u00a9|\u00ae|\u2122|\u2139|\ud83c\udc04|[\u2600-\u26FF]|\u2b05|\u2b06|\u2b07|\u2b1b|\u2b1c|\u2b50|\u2b55|\u231a|\u231b|\u2328|\u23cf|[\u23e9-\u23f3]|[\u23f8-\u23fa]|\ud83c\udccf|\u2934|\u2935|[\u2190-\u21ff])/g Thanks for any input. |
Hi all - just so everyone understands, this isn't a problem specifically with emoji. Its a problem caused by the fact that there's lots of ways to count "the number of characters" in a string:
Eg, the polar bear emoji ('🐻❄') is one grapheme cluster. If you hit backspace, it deletes the whole thing. But its made up of 3 unicode characters (🐻 + ZWJ + ❄️) so All of these "ways to count a length" just so happen to be the same for English text, since ascii was invented by americans and unicode inherits ascii. But they're not the same for anything beyond U+ff00 or something iirc. For example, fun symbols like 𝄞, ©, ®, ™, and several actual non-english languages will all break if you assume all those "string length" values are the same. If the quill javascript library counts characters using javascript's string.length, thats incompatible with how python3 and other languages deal with strings. |
quill.getLength() counts U+1F600 as two characters instead of one.
When doing OT with a language other than JavaScript it is counted as one glyph which breaks sync.
Steps for Reproduction
Expected behavior:
quill length should be 2
emoji length should be 1
Actual behavior:
quill length is 3
emoji length is 2
Parchment uses this.text.length which counts the code points and not the glyphs.
This blog post explains the issue and suggests a way to count symbols.
In most cases this is enough:
To support all available strings we need:
Array.from is not supported in IE11 but I think it's OK to use string.length in that case.
Changing the length function in Parchment will break other things like Delta operations.
Is there something we can do?
Platforms:
Firefox 50 on Windows 7
Version:
1.1.8
The text was updated successfully, but these errors were encountered: