Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid characters in buffer #34

Closed
Junkern opened this issue Nov 23, 2015 · 4 comments
Closed

Invalid characters in buffer #34

Junkern opened this issue Nov 23, 2015 · 4 comments

Comments

@Junkern
Copy link

Junkern commented Nov 23, 2015

I asked this in the following issue: nodejs/node#3982
However, I still have a few follow up questions:
Why are the characters from U+007F up to U+00A0 considered invalid chars? Because they are not displayable? Also, where in the code is the distinction made whether an character is invalid?

@bnoordhuis
Copy link
Member

Your code:

var length = 256;
var buf = new Buffer(length);
for (var i = 0; i < length; ++i) {
        buf[i] = i;
}
var string = buf.toString();
for(var i = 0; i < string.length; i++) {
    console.log("i",i, string[i], string.charCodeAt(i));
}

The salient part:

var string = buf.toString();  // Defaults to UTF-8.

UTF-8 is a run-length encoding where non-ASCII characters (i.e. characters > 127) are encoded as multi-byte sequences. Bytes in a sequence have their high bits (bit 7 or 128) set.

What you're doing in your code snippet is creating invalid UTF-8 sequences. Those get replaced with U+FFFD (65533 in decimal, the Unicode replacement character) when converting the Buffer to string.

@Junkern
Copy link
Author

Junkern commented Nov 24, 2015

Thanks for the response!
Ok, so basically as soon as I am not in the ASCII character set, a multi-byte sequence is required. To come back to my code: After inserting 127 (0111 1111), the next possible sequence would be 110xxxxx 10xxxxxx. Filling up the x with zeros and converting to int would result in 49820. So after 127 the next possible number to insert is 49820?

One more question: Where in the code is the part for checking whether a sequence is invalid UTF-8 and thus replaced with U+FFFD?

@bnoordhuis
Copy link
Member

So after 127 the next possible number to insert is 49820?

Close, it's 49792:

Buffer([128+64+2, 128]).toString().charCodeAt(0)  // 128
(128+64+2) * 256 + 128  // 49792

Where in the code is the part for checking whether a sequence is invalid UTF-8 and thus replaced with U+FFFD?

That's done by V8's built-in UTF-8 decoder.

@Junkern
Copy link
Author

Junkern commented Nov 24, 2015

Yes, of course. I see my error now.
Thanks for helping, this gave me a good understanding of the workings of UTF-8 (and to some degree UCS-2) in the buffer.

@Junkern Junkern closed this as completed Nov 24, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants