Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Come-up with consistent approach for handling invalid UTF-8 strings #283

Closed
rajsite opened this issue Jun 6, 2017 · 3 comments
Closed
Assignees

Comments

@rajsite
Copy link
Member

rajsite commented Jun 6, 2017

Currently the function used for reading strings from Vireo's memory assumes a valid UTF-8 format:

// TODO mraj assumes valid UTF8 encoding https://github.com/ni/VireoSDK/issues/283

As Vireo / LabVIEW strings are often treated as byte arrays the possibility for invalid UTF-8 string encodings is high.

I incorrectly believed that the string corruption from assuming valid code points would not cause internal errors: #281 (comment)

However I ended up finding a case where this does result in an error. For the LabVIEW string \127 when it is read as JSON the string is encoded as "\127"

The algorithm causes the string to results in "\crazystuff causing the closing quote to be lost in corruption and resulting in invalid JSON. So while no error occurs in Vireos space, when the string is parsed in JS land an exception is thrown because it is invalid JSON.

Some discussion on Wikipedia: https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

The wiki article describes "popular" replacement options:

  1. Use the unicode replacement character
  2. Use the low byte of the invalid code point range U+DC80-U+DCFF
  3. Use code points U+0080–U+00FF with the same value as the byte
  4. The Unicode code point for the character represented by the byte in CP1252

Some thoughts on the options:

  1. After reading through the other popular options I think this will be the clearest. It makes it very easy to identify corrupt bytes and the number of corrupt bytes. It is also semantically appropriate. The replacement character is intended to replace unknown, unrecognized, or unrepresentable characters.

  2. Does not work because we are going from UTF-8 to UTF-16 and the invalid code point range is utilized for UTF-16 surrogate pairs

  3. This was my first thought at implementation but it results in a lot of collisions. ie if the user did intend to use codepoints U+0080-U+00FF in correctly formatted UTF-8 they would be indistinguishable from invalid bytes mapped to that range.

    I think it's also possible to miss the corruption because there are valid and not too uncommon symbols in the range.

    As the string functions are used to read values from Vireo memory for controls and indicators, if the user (vireo integrator) believes invalid UTF-8 strings / arbitrary byte arrays may be present they should not try to read strings as JSON strings but as JSON arrays of numbers or using another API function. The benefit of readJSON on a string type is to aid in conversion from UTF-8 encoded strings to UTF-16 encoded strings.

    Edit: So turns out Windows-1252 and Unicode are intentionally matching from U+0080-U+00FF

  4. Running in Browser environment Unicode environment so not relevant

@rajsite
Copy link
Member Author

rajsite commented Jun 27, 2017

I am working towards implementing option 1 above. Will add a look ahead to validate UTF-8 bytes and replace invalid bytes with the Unicode Replacement Character U+FFFD.

@rajsite
Copy link
Member Author

rajsite commented Jun 27, 2017

Looks like desktop came to the same decision

replace character

@rajsite
Copy link
Member Author

rajsite commented Jun 27, 2017

After some more thought I am realizing that the EggShell_ReadValueString function should always be returning a byte buffer that represents valid UTF-8 encoded JSON.

The implementation above which modifies core_helpers.js handles the case where the bytebuffer represents invalid UTF-8 bytes and the burden is being placed on JS to handle that. I'll create a separate issue for improving the output of readvaluestring

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant