-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Symbols in the Astral Plane #1
Comments
Thank you for the detailed report.
Have you found this to be an issue in practice? It already breaks up characters within words for example but this is not an issue. Performance would be the other concern. The diff has to work on very large strings (pages and pages of text). Right now it scales on the number of changes so works very well on even large strings. I would be concerned this is lost with converting a large string to an array and back. Intuitively this overhead is not trivial but numbers would trump everything. |
Great question and valid concerns. In its current form the diff generated, even with surrogate pairs being broken up, remains accurate and can be applied to the original string to generate the second string.
I did encounter an issue when passing the diff object to a server over websocket (npm ws). The diff goes through JSON.stringify (client side) and then JSON.parse (server side). On parsing, when only half of a surrogate pair is present, it is parsed into the Replacement Character � (\uFFFD or 65533), rather than the original char code. "🐶" // is '\uD83D\uDC36' or 55357 56374
"🐯" // is '\uD83D\uDC2F' or 55357 56367
// produces diff of:
[ [0,"\uD83D"], [-1,"\uDC36"], [1,"\uDC2F"] ]
// after JSON.stringify on client & JSON.parse on server:
[ [0,"\uFFFD"], [-1,"\uFFFD"], [1,"\uFFFD"] ] This does seem unusual, because if you run JSON.stringify and then JSON.parse locally, with the surrogate halves broken up, the char codes still remain accurate. However, once passed over the websocket, lone surrogate halves are parsed as When the surrogate pairs are sent together and go through the stringify / parse process, they remain accurate and valid. I don't yet know if this is a protocol level decision or possibly an issue with the |
In regards to performance...
Your intuition is generally correct. The extra overhead of parsing each String into an Array of codepoint symbols is non trivial. And really shows itself on strings of 1000 characters or less. Interestingly enough and much to my surprise, there seem to be some cases where fast-diff-astral appears to actually perform slightly faster, even with the extra overhead of doing an upfront conversion of String to Array. (Assuming due to better performance on Arrays than Strings?) Even more interesting, it's on cases were when the String lengths are greater than 6000 characters. So even with the large up front cost of conversion, apparently the better performance with arrays made up for the conversion loss. However, it isn't all great, since there are only minimal gains on very long strings but large losses on shorter strings... So depending on what one is working with and the number of iterations being performed, the differences can really add up. I put together a little script to performance test fast-diff vs fast-diff-astral. You can try it yourself and play with the number of iterations / lengths if you'd like. Performance results (similar on multiple runs):
All that said, occasionally when running with larger iteration numbers, I sometimes saw fast-diff-astral gains on long strings diminish, which I don't know what to make of. Possibly with the new ES6 String spread operator coming, the upfront overhead of converting a string into any array of symbols would be minimized and a speed increases could be gained. (Just out of curiosity, I will try Babel's transpiled version of the String spread operator to see if it performs better than the very manual version built in right now... Though I would imagine eventually the actual ES6 language level implementation would be even faster.) |
That's interesting there are cases where the array runs faster. One testing variant to focus on though is diffs in where there are a small number of changes. This allows for applications to be very responsive and emit change events very often, even on large documents or strings. In fact the more responsive the application is, the smaller number of batched changes, the faster the overall system. If we make this modification to your test (the else branch): function generateRandomStrings(seed, length, count, changes = 3) {
var strings = [];
var random = seedrandom(seed);
for(var i = 0; i <= count; ++i) {
if (strings.length === 0) {
var chars = [];
for(var l = 0; l < length; ++l) {
var letter = ALPHABET.substr(Math.floor(random() * ALPHABET.length), 1);
chars.push(letter);
}
strings.push(chars.join(''));
} else {
var next = strings[strings.length-1];
for(var j = 0; j < changes; ++j) {
var index = Math.floor(random()*next.length);
var letter = ALPHABET.substr(Math.floor(random() * ALPHABET.length), 1);
next = next.slice(0, index) + letter + next.slice(index + 1);
}
strings.push(next);
}
}
return strings;
} You can see where the performance diverges, regardless of document size.
The diff library has an optimization for 1-2 changes so it's even faster in those cases. |
Interesting... It makes a lot of sense that your I did keep in your optimizations for a binary search of common suffix and prefixes, and they work quickly with the arrays as well. That said, in those cases the majority of the diffing time for the I doubt there could be any solution to make I think there are still some use cases for performing diffs that don't split the surrogate pairs, but ultimately it will likely never be as fast. Therefore, it probably makes most sense to keep these libraries separate. Since for many use cases the speed of large similar diffs is most important and spitting surrogate pairs has little consequence. I'll also add a note to the |
Sounds good thanks for looking into this. |
I did run into this issue in practice when the back end is not Javascript. In those cases converting between UTF-16 and UTF-8 will fail if strings contain partial surrogate pairs. So this is now fixed in fast-diff itself. |
Does this fix affect slab/quill#1230 ? |
It should help the approach where the back end tries to count characters in UTF-16 but not sure it it completely solves it yet. |
I'm currently using UTF-8 on the server and I don't see a difference after this change. |
To be clear, when it reaches the back end it has to be UTF-8 off the network but the counting part in the OT code would treat them like UTF-16. Consider the this: Initial Doc: "Go 🚀!" The server will get the Delta , where the insert string length is only 1 now if using the native string length functions. But note the other integers are still 2. Because the back end can't know we are deleting two characters that are both a part of the same surrogate pair, I would claim the only way for the back end to work correctly with this delta is if it counted strings using UTF-16. It presumably has the initial doc in memory however it counts needs to count the string to have a length of 6, not 5 or else the delete(2) instruction will fail. |
Use python at the back end of my project. After watching everyone's discussion, I still haven't got a solution. How to ensure that javascript and python have the same emoji length? Thank you |
Hi @jhchen,
I've been happily enjoying your fast-diff library in a custom editor that I'm building.
However, I came across an issue when trying to support code points outside of the BMP. In its current implementation fast-diff checks each JS "character" or "code unit". Being that ES5 JS iterates over code units rather than code points this creates issues with symbols outside the BMP.
A diff of two different string symbols such as "🐶" and "🐯" results in the first code unit matching and the second code unit non-matching. Thereby splitting up the surrogate pairs and potentially causing unwanted side effects.
Fast-Diff Produces:
[ [ 0, '�' ], [ -1, '�' ], [ 1, '�' ] ]
Rather than:
[ [ -1, '🐶' ], [ 1, '🐯' ] ]
I've modified your library to handle surrogate pairs as single "symbols". You can find my changes at fast-diff-astral. If you would like I can submit a pull request.
The method it uses is to break up the string into an array of symbols. Thereby allowing iteration over each symbol and full comparison of each surrogate pair as a single code point.
This may cause some extra overhead in terms of processing speed, which, along with the amount of changes, is why I broke it out into a separate library. That said, it still seems quick but I haven't tested the speed differences.
I've chosen to use a custom method to find and group surrogate pairs into an array rather than using the new ES6 String iterator methods. I did so only to avoid introducing a transpiler to the build process.
Hope you get a chance to look over it. Let me know if you'd like a pull request.
Cheers!
Refs: Great article by Mathias Bynens on JS unicode support, if you haven't already read it.
The text was updated successfully, but these errors were encountered: