Subtle bug opportunity with utf-8 converter #1

Omnifarious · 2018-05-27T06:06:15Z

I would like to point out that it is very easy to end up with a subtle buffer overrun bug with the SSE optimized UTF-8 converter. I want to point it out because it's subtle and easy to miss. Given your current interface guarantees, it isn't really a bug yet... :-)

So, if the tail of the string you're converting contains an ASCII character followed by a bunch of multibyte characters, the SSE code will kick in and convert all the multibyte characters as if they're ASCII and write them into the destination buffer. Then you will notice that you've written too many, walk things back and begin multibyte conversion with the first multibyte character.

But, you've still written all those bytes. And someone may have sized the output buffer with prior knowledge of how many code points will be generated. An output buffer sized in this way will be too small to handle this case and there will be a buffer overrun. Even worse, that buffer overrun will be subtle because the final reported used output size will be just fine and it will all appear to have worked.

So, when implementing proper error handling, or using this, people really need to keep this in mind and make sure the output buffer is large enough to handle the last 43 bytes of the input buffer being converted as ASCII even if they aren't.

It needs to be larger than 16 because someone might have an ASCII byte followed by 14 3 byte sequences and only have 15 output slots to store it and the SSE code will write to 16 output slots.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subtle bug opportunity with utf-8 converter #1

Subtle bug opportunity with utf-8 converter #1

Omnifarious commented May 27, 2018 •

edited

Loading

Subtle bug opportunity with utf-8 converter #1

Subtle bug opportunity with utf-8 converter #1

Comments

Omnifarious commented May 27, 2018 • edited Loading

Omnifarious commented May 27, 2018 •

edited

Loading