-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
some remarks regarding the parser #42
Comments
Hello, Thanks for sharing with me these important updates to xterm.js, I will try to bring some of those, and will keep this ticket open until I sort those out. I do not think that a terminal should be implemented with UTF-16, specially not to recover from bad data (either cat /dev/random, or noise in the line, noise by background processes). So the terminal emulator should work without assuming that a correct mapping from the byte stream to unicode codepoints exists. My Swift port (incomplete, that is why it is not public, but I can make the partial port public if there is interest) further shows the difference in the low-level parsing implemented with bytes and the surfacing to grapheme clusters (which are better than codepoints, which is what the XtermSharp uses in the form of Thanks for linking to the changes in upstream, I should try to bring those in. Miguel. |
The emulator cannot work reliable without that notion. This is also the reason why all emulators nowadays treat unicode in terminal streams as "transport encoding", where the whole stream gets treated as of encoding XY, not only string parts. Grapheme support is an annoying issue, that gets more and more pressing since unicode 9+ releases. Currently none of the common emulators/terminal libs handle those, we are basically stuck with (slightly wrong) wcwidth tables. Also unicode has introduced some terminal unfriendly rules (all rules that depend on renderer caps), we def. need some formal specification in this field for terminal usage. |
xterm.js' parser was written with JS string (UTF-16 codepoints) in mind. Plz note that it will not work correctly with raw UTF8 bytes (for any byte with the 8th bit set). For this to work you would have to rewrite
Parse
in a way to correctly decode multibyte chars on the fly. Another and prolly easier approach is to simply decode the whole byte chunk beforehand to UTF16 or UTF32, this way the parser can stay mostly untouched. In xterm.js we went with UTF32 after some longish testing, mainly for performance reasons (with a small memory sacrifice compared to UTF16).Furthermore the older version of the parser has some loose ends, that are already fixed upstream:
Last but not least the parser has seen some code/perf improvements, but those are mostly JS engine related and less likely show a big impact in C#. On a sidenote - the somewhat complicated print handling with aggregating chunk slices shows a major perf improvement in JS (5 to 10 times compared to single char handling), which is not true for the original C parser. In C jumping for every single char into print is much faster, prolly due to heavy inlining/unrolling (JS JITs are still lacking in this regard). Not sure how C# would do here. In general the perf of the parser should not be an issue, in JS it handles typical sequences with 80-150 MB/s (the C variant only being twice as fast).
The text was updated successfully, but these errors were encountered: