Come-up with consistent approach for handling invalid UTF-8 strings #283

rajsite · 2017-06-06T15:07:01Z

Currently the function used for reading strings from Vireo's memory assumes a valid UTF-8 format:

VireoSDK/source/core/module_coreHelpers.js

Line 60 in 6a94624

    
           // TODO mraj assumes valid UTF8 encoding https://github.com/ni/VireoSDK/issues/283

As Vireo / LabVIEW strings are often treated as byte arrays the possibility for invalid UTF-8 string encodings is high.

I incorrectly believed that the string corruption from assuming valid code points would not cause internal errors: #281 (comment)

However I ended up finding a case where this does result in an error. For the LabVIEW string \127 when it is read as JSON the string is encoded as "\127"

The algorithm causes the string to results in "\crazystuff causing the closing quote to be lost in corruption and resulting in invalid JSON. So while no error occurs in Vireos space, when the string is parsed in JS land an exception is thrown because it is invalid JSON.

Some discussion on Wikipedia: https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

The wiki article describes "popular" replacement options:

Use the unicode replacement character
Use the low byte of the invalid code point range U+DC80-U+DCFF
Use code points U+0080–U+00FF with the same value as the byte
The Unicode code point for the character represented by the byte in CP1252

Some thoughts on the options:

After reading through the other popular options I think this will be the clearest. It makes it very easy to identify corrupt bytes and the number of corrupt bytes. It is also semantically appropriate. The replacement character is intended to replace unknown, unrecognized, or unrepresentable characters.
Does not work because we are going from UTF-8 to UTF-16 and the invalid code point range is utilized for UTF-16 surrogate pairs
This was my first thought at implementation but it results in a lot of collisions. ie if the user did intend to use codepoints U+0080-U+00FF in correctly formatted UTF-8 they would be indistinguishable from invalid bytes mapped to that range.

I think it's also possible to miss the corruption because there are valid and not too uncommon symbols in the range.

As the string functions are used to read values from Vireo memory for controls and indicators, if the user (vireo integrator) believes invalid UTF-8 strings / arbitrary byte arrays may be present they should not try to read strings as JSON strings but as JSON arrays of numbers or using another API function. The benefit of readJSON on a string type is to aid in conversion from UTF-8 encoded strings to UTF-16 encoded strings.

Edit: So turns out Windows-1252 and Unicode are intentionally matching from U+0080-U+00FF
Running in Browser environment Unicode environment so not relevant

The text was updated successfully, but these errors were encountered:

rajsite · 2017-06-27T00:00:43Z

I am working towards implementing option 1 above. Will add a look ahead to validate UTF-8 bytes and replace invalid bytes with the Unicode Replacement Character U+FFFD.

rajsite · 2017-06-27T02:15:17Z

Looks like desktop came to the same decision

rajsite · 2017-06-27T22:41:26Z

After some more thought I am realizing that the EggShell_ReadValueString function should always be returning a byte buffer that represents valid UTF-8 encoded JSON.

The implementation above which modifies core_helpers.js handles the case where the bytebuffer represents invalid UTF-8 bytes and the burden is being placed on JS to handle that. I'll create a separate issue for improving the output of readvaluestring

rajsite mentioned this issue Jun 6, 2017

Fix how UTF8 strings are read from memory #281

Merged

rajsite self-assigned this Jun 26, 2017

rajsite mentioned this issue Jul 10, 2017

HTTP Binary support and safe string reads from Vireo #292

Merged

rajsite closed this as completed Jul 10, 2017

This was referenced Jul 10, 2017

eggShell.readJSON & Flatten / Unflatten JSON behavior #294

Open

Pointer_stringify with UTF-8 string and byte length copies entire heap buffer emscripten-core/emscripten#4693

Closed

rajsite mentioned this issue Jan 2, 2018

readJSON should always return valid, safe UTF-8 strings #366

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Come-up with consistent approach for handling invalid UTF-8 strings #283

Come-up with consistent approach for handling invalid UTF-8 strings #283

rajsite commented Jun 6, 2017 •

edited

Loading

rajsite commented Jun 27, 2017

rajsite commented Jun 27, 2017

rajsite commented Jun 27, 2017 •

edited

Loading

Come-up with consistent approach for handling invalid UTF-8 strings #283

Come-up with consistent approach for handling invalid UTF-8 strings #283

Comments

rajsite commented Jun 6, 2017 • edited Loading

rajsite commented Jun 27, 2017

rajsite commented Jun 27, 2017

rajsite commented Jun 27, 2017 • edited Loading

rajsite commented Jun 6, 2017 •

edited

Loading

rajsite commented Jun 27, 2017 •

edited

Loading