-
Notifications
You must be signed in to change notification settings - Fork 8.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad characters occasionally displayed when writing lots of identical UTF-8 lines #386
Comments
I've filed this internally as MSFT:20812701 to make sure this gets looked at. Thanks for the report! |
Seems there is a buffer in place that causes this bug. If the buffer is filled but the bytes read end somewhere inside of a multi-byte UTF-8 character then the conversion of this buffer end fails and the conversion routine puts as many Unicode replacement characters in the output buffer as invalid bytes were at the end of the buffer. The same will happen with the next chunk that begins with the bytes left from the last multi-byte character of the former chunk. The example above shows three replacement characters which is the number of bytes the € sign takes in a UTF-8 character. |
In the hope that this will help to fix the issue I offer a minimalist C code that works around it. The code takes care of character boundaries. (In addition it discards the Byte Order Mark which is a different issue.)
In a CMD prompt you may use it like that:
|
I updated the test code above. Please see my comments related to the OUT_PROC macro. |
@miniksa This is a lot like what we saw with CMD and WSL bringing up emoji for the build demo. Marking as a bug. |
@DHowett-MSFT May I ask you - In case of emojis split UTF-16 surrogates, I guess? |
That's possible. There's also a chance that we are getting the 3-byte UTF-8 encoded emoji in two parts from the application. |
I'm convinced you guys from the terminal/console team are nice and will work around that kind of faulty input from an application, right? 😃 |
As best we can! |
I'm taking this because it feels similar to some of the other bugs I have assigned to me talking about U+FFFD. |
Even if there's an issue in conhost, I'm convinced there's an issue in Terminal as well. terminal/src/cascadia/TerminalConnection/ConhostConnection.cpp Lines 188 to 190 in 4c47631
We are absolutely cutting UTF-8 sequences in half right there. We can do fun games and cache partial sequences (partial code below), but it does add complexity. if ((buffer[iLastGoodByte] & 0b1'0000000) == 0b1'0000000)
{
// does the encoding fit?
const auto leadByte = buffer[iLastGoodByte];
auto sequenceLen = dwRead - iLastGoodByte;
if ((((leadByte & 0b111'00000) == 0b110'00000) && sequenceLen < 2) || // lead reqs 2 bytes
(((leadByte & 0b1111'0000) == 0b1110'0000) && sequenceLen < 3) || // lead reqs 3 bytes
(((leadByte & 0b11111'000) == 0b11110'000) && sequenceLen < 4)) // lead reqs 4 bytes
{
::memmove_s(utf8Partials, std::extent<decltype(utf8Partials)>::value, buffer + iLastGoodByte, dwRead - iLastGoodByte);
// ... do the rest ... I hacked it up to write (you can see where one It wouldn't be strictly easier to just |
Sorry for my ignorance but I don't understand this statement. Correct me if I'm wrong but passing a string that contains an incomplete codepoint to any other function causes an additional risk (just like |
Sorry, that’s what this proposal is! When we encounter an incomplete codepoint we need to wait for the next trip around the I/O loop to complete it. We can still send everything up to but not including the incomplete codepoint. My statement about |
This is almost certainly the right place to use @adiviness's terminal/src/host/utf8ToWideCharParser.hpp Lines 5 to 14 in 4c47631
To be done:
|
I was about to write a piece of code when I saw you already have a parser. But now that you added some more notes I finished it in the hope it is a little helpful.
It also still addresses the BOM issue (blank that appears in the first screen shot). Please tell me if you want me to report it separately. Attached you'll find a testing environment with the compiled app. EDIT: Code and attachment updated. |
I updated the latest test code again (with VT Processing enabled) in order to see how it might be related to the occurrence of � (Unicode Replacement Character U+FFFD) as they appear in #455 and #666. The replacement characters don't appear at the same positions as in the linked issues. But I guess this is due to different sizes of character buffers in the Console and Terminal source codes. |
It was not too hard to update the function that @DHowett-MSFT quoted in #386 (comment) |
Uhhhh, what's still wrong after your proposed test change? That would help me better narrow down which files might still be suspect. If you're looking inside the console host for problems, I'd personally breakpoint on the (I think you're trying to fix the console host side given the screenshots above of just CMD.exe in a plain conhost window.) |
Thanks you @miniksa The screenshots above have been made before I forked the source code. So, yes that was of course for conhost only. The function that @DHowett-MSFT quoted and that I tried to update is the interface between terminal and conhost (at least that's my poor understanding of how it is linked together, correct me if I'm wrong). You can find it in this commit:
Well, nothing changed yet. I didn't expect that this would already close this issue because the buffer size in this function is 4096 bytes but the U+FFFD character appears more often. I suspect there is a much smaller buffer involved somewhere else. However, the update also includes the BOM removal. And at least that is what I expected to see.
Some more screenshots might be helpful. Neither as stand-alone nor in the terminal.
... and with the updated code.
I'll try my best. Steffen |
Oh. The difference between Powershell/PWSH, CMD, and WSL is that they're likely all using different Write* method entrypoints. I forget which is which, but they're probably using a selection of The post-API-decoding-calls should be findable by searching the code for It's probably that CMD is using one and the others are using others and the one that CMD is using is broken somehow. The other thing to watch out for is whether they're calling But CMD.exe doesn't usually do any of that. It's probably using whatever codepage the system is in and writing everything into the A API without thinking about it. |
Yes I'm aware of that fact. If you look at the prompts in my screenshots you'll see that I always have to use
|
Um ... wait ... we already have a |
There cannot be concurrent calls in that we cannot service two calls at the exact same time because they're queued and the console is locked. However, there can be concurrent calls in that Process A can write a partial fragment, then Process B's request is the next in queue and can write some other unrelated fragment, then Process A is next to write its remainder. So a static cache will leave some sort of hole in the quest for a perfect solution. However, I'm pretty sure There is also another static cache for other codepages called That's been that way for decades and is probably why we ignored the concurrence problem when writing The caller of But |
We posted at the same time 😀 I didn't see the static declaration in the first place. However, thanks for the explanation and insight.
Yes, that's exactly what I believe is the reason for this problem. See what I did in my test code. If conhost gets the the data directly from cmd.exe (using its internal TYPE command) then we are facing this issue. If the same data is piped to my test app which takes care of incoming partials, and conhost gets chunks of data from the test app that don't contain partials anymore, then you don't have this problems anymore. So, the actual question is whether or not conhost is supposed to work around the deficiencies of applications like cmd.exe 😉 It would be nice, no doubt. |
Update: Next step for me is to review |
I'm of two minds on the BOM thing. The first is that it's impossible to tell really what is the beginning of stream without some heuristics or further state on when a process connected and if it's the first thing that it has said. Probably more complicated than we want to deal with. The second is that I can't think of a valid use for echoing a BOM to a Console/Terminal in the slightest and we should probably just wholesale discard them even though we should technically only do that as the "start of stream" rule. But instead of discarding them here, I think I'd rather pass it along further in the In short, leave the BOM alone for now and we'll revisit. For the other part, GOOD CALL. You're right, the |
Thanks for your advice regarding the BOM. Unfortunately updating the |
CMD.exe is an evil beast 👹 another test code:
It just reads from standard input and uses
@miniksa My conclusion is that CMD.exe detects the file type (likely using However, apparently there are some functions that should be updated even if these updates are not able to resolve the CMD.exe issues. Any advice on how to proceed? |
Yes. Very yes.
I wouldn't be surprised if it was converting. I can try to spend like 5 minutes on checking if this is true by reading the internal source for CMD.exe on Monday when I get back into work, but it's not worth spending much time on because we can't really fix anything in CMD without breaking the world.
Start a PR anyway with the proposed fixes to the other functions referencing this issue. It's OK if we only fix part of it. |
@miniksa |
That misaligned text ".majerus.net/" is just a terminal hyperlink on the ANSI logo using an ESC]8 control sequence. Conhost and Terminal both properly ignore it as they don't support hyperlinks yet. |
Kind of. The terminal does not entirely ignore it as it seems. I believe it's a matter of window updating. As soon as I resize the window (e.g. from maximized to normal) this remnant disappears miraculously. |
CMD only uses the console's wide-character API for console I/O, at least back to NT 4 in 1996, and probably all the way to NT 3.1. It does use the current console output codepage, but only indirectly. It uses it by default for encoding output written to disk files and pipes, and for decoding input read from files such as a line from a batch script or a line from the pipe in a |
People who are still coming across this topic: The initial post describes a behavior which is almost certainly a bug in cmd.exe. It seems to convert the read UTF-8 chunks with lacerated code points and in the consequence writes ruined UTF-16. No terminal application can heal it anymore. Thus, a fix for #1851 won't ever fix this. |
Your Windows build number: (Type
ver
at a Windows Command Prompt)Microsoft Windows [Version 10.0.18342.8]
(But this happens on all versions of Windows since at least Windows 7.)
What you're doing and what's happening: (Copy & paste specific commands and their output, or include screen shots)
In the cmd.exe console, set to code page 65001, type a large UTF-8 file containing hundreds of short lines with just the characters "ü€ü€ü€ü€ü€"
What's wrong / what should be happening instead:
95% of the lines are displayed correctly, but about 5% contain spurious characters. Ex: "ü���ü€ü€ü€ü€"
This happens with any application, not just cmd.exe's built-in TYPE command. So I suspect this is a bug in the UTF-8 to Unicode conversion routine in the console output handler.
For more details, and files and scripts for reproducing it, see the discussion thread on this forum: https://www.dostips.com/forum/viewtopic.php?f=3&t=9017
The text was updated successfully, but these errors were encountered: