-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid characters in buffer #34
Comments
Your code: var length = 256;
var buf = new Buffer(length);
for (var i = 0; i < length; ++i) {
buf[i] = i;
}
var string = buf.toString();
for(var i = 0; i < string.length; i++) {
console.log("i",i, string[i], string.charCodeAt(i));
} The salient part: var string = buf.toString(); // Defaults to UTF-8. UTF-8 is a run-length encoding where non-ASCII characters (i.e. characters > 127) are encoded as multi-byte sequences. Bytes in a sequence have their high bits (bit 7 or 128) set. What you're doing in your code snippet is creating invalid UTF-8 sequences. Those get replaced with U+FFFD (65533 in decimal, the Unicode replacement character) when converting the Buffer to string. |
Thanks for the response! One more question: Where in the code is the part for checking whether a sequence is invalid UTF-8 and thus replaced with U+FFFD? |
Close, it's 49792: Buffer([128+64+2, 128]).toString().charCodeAt(0) // 128
(128+64+2) * 256 + 128 // 49792
That's done by V8's built-in UTF-8 decoder. |
Yes, of course. I see my error now. |
I asked this in the following issue: nodejs/node#3982
However, I still have a few follow up questions:
Why are the characters from U+007F up to U+00A0 considered invalid chars? Because they are not displayable? Also, where in the code is the distinction made whether an character is invalid?
The text was updated successfully, but these errors were encountered: