Invalid characters in buffer #34

Junkern · 2015-11-23T13:29:18Z

I asked this in the following issue: nodejs/node#3982
However, I still have a few follow up questions:
Why are the characters from U+007F up to U+00A0 considered invalid chars? Because they are not displayable? Also, where in the code is the distinction made whether an character is invalid?

bnoordhuis · 2015-11-23T17:33:31Z

Your code:

var length = 256;
var buf = new Buffer(length);
for (var i = 0; i < length; ++i) {
        buf[i] = i;
}
var string = buf.toString();
for(var i = 0; i < string.length; i++) {
    console.log("i",i, string[i], string.charCodeAt(i));
}

The salient part:

var string = buf.toString();  // Defaults to UTF-8.

UTF-8 is a run-length encoding where non-ASCII characters (i.e. characters > 127) are encoded as multi-byte sequences. Bytes in a sequence have their high bits (bit 7 or 128) set.

What you're doing in your code snippet is creating invalid UTF-8 sequences. Those get replaced with U+FFFD (65533 in decimal, the Unicode replacement character) when converting the Buffer to string.

Junkern · 2015-11-24T12:33:00Z

Thanks for the response!
Ok, so basically as soon as I am not in the ASCII character set, a multi-byte sequence is required. To come back to my code: After inserting 127 (0111 1111), the next possible sequence would be 110xxxxx 10xxxxxx. Filling up the x with zeros and converting to int would result in 49820. So after 127 the next possible number to insert is 49820?

One more question: Where in the code is the part for checking whether a sequence is invalid UTF-8 and thus replaced with U+FFFD?

bnoordhuis · 2015-11-24T13:09:21Z

So after 127 the next possible number to insert is 49820?

Close, it's 49792:

Buffer([128+64+2, 128]).toString().charCodeAt(0)  // 128
(128+64+2) * 256 + 128  // 49792

Where in the code is the part for checking whether a sequence is invalid UTF-8 and thus replaced with U+FFFD?

That's done by V8's built-in UTF-8 decoder.

Junkern · 2015-11-24T13:39:14Z

Yes, of course. I see my error now.
Thanks for helping, this gave me a good understanding of the workings of UTF-8 (and to some degree UCS-2) in the buffer.

Junkern closed this as completed Nov 24, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid characters in buffer #34

Invalid characters in buffer #34

Junkern commented Nov 23, 2015

bnoordhuis commented Nov 23, 2015

Junkern commented Nov 24, 2015

bnoordhuis commented Nov 24, 2015

Junkern commented Nov 24, 2015

Invalid characters in buffer #34

Invalid characters in buffer #34

Comments

Junkern commented Nov 23, 2015

bnoordhuis commented Nov 23, 2015

Junkern commented Nov 24, 2015

bnoordhuis commented Nov 24, 2015

Junkern commented Nov 24, 2015