You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
// TODO mraj assumes valid UTF8 encoding https://github.com/ni/VireoSDK/issues/283
As Vireo / LabVIEW strings are often treated as byte arrays the possibility for invalid UTF-8 string encodings is high.
I incorrectly believed that the string corruption from assuming valid code points would not cause internal errors: #281 (comment)
However I ended up finding a case where this does result in an error. For the LabVIEW string \127 when it is read as JSON the string is encoded as "\127"
The algorithm causes the string to results in "\crazystuff causing the closing quote to be lost in corruption and resulting in invalid JSON. So while no error occurs in Vireos space, when the string is parsed in JS land an exception is thrown because it is invalid JSON.
The wiki article describes "popular" replacement options:
Use the unicode replacement character
Use the low byte of the invalid code point range U+DC80-U+DCFF
Use code points U+0080–U+00FF with the same value as the byte
The Unicode code point for the character represented by the byte in CP1252
Some thoughts on the options:
After reading through the other popular options I think this will be the clearest. It makes it very easy to identify corrupt bytes and the number of corrupt bytes. It is also semantically appropriate. The replacement character is intended to replace unknown, unrecognized, or unrepresentable characters.
Does not work because we are going from UTF-8 to UTF-16 and the invalid code point range is utilized for UTF-16 surrogate pairs
This was my first thought at implementation but it results in a lot of collisions. ie if the user did intend to use codepoints U+0080-U+00FF in correctly formatted UTF-8 they would be indistinguishable from invalid bytes mapped to that range.
I think it's also possible to miss the corruption because there are valid and not too uncommon symbols in the range.
As the string functions are used to read values from Vireo memory for controls and indicators, if the user (vireo integrator) believes invalid UTF-8 strings / arbitrary byte arrays may be present they should not try to read strings as JSON strings but as JSON arrays of numbers or using another API function. The benefit of readJSON on a string type is to aid in conversion from UTF-8 encoded strings to UTF-16 encoded strings.
Edit: So turns out Windows-1252 and Unicode are intentionally matching from U+0080-U+00FF
Running in Browser environment Unicode environment so not relevant
The text was updated successfully, but these errors were encountered:
I am working towards implementing option 1 above. Will add a look ahead to validate UTF-8 bytes and replace invalid bytes with the Unicode Replacement Character U+FFFD.
After some more thought I am realizing that the EggShell_ReadValueString function should always be returning a byte buffer that represents valid UTF-8 encoded JSON.
The implementation above which modifies core_helpers.js handles the case where the bytebuffer represents invalid UTF-8 bytes and the burden is being placed on JS to handle that. I'll create a separate issue for improving the output of readvaluestring
Currently the function used for reading strings from Vireo's memory assumes a valid UTF-8 format:
VireoSDK/source/core/module_coreHelpers.js
Line 60 in 6a94624
As Vireo / LabVIEW strings are often treated as byte arrays the possibility for invalid UTF-8 string encodings is high.
I incorrectly believed that the string corruption from assuming valid code points would not cause internal errors: #281 (comment)
However I ended up finding a case where this does result in an error. For the LabVIEW string
\127
when it is read as JSON the string is encoded as"\127"
The algorithm causes the string to results in
"\crazystuff
causing the closing quote to be lost in corruption and resulting in invalid JSON. So while no error occurs in Vireos space, when the string is parsed in JS land an exception is thrown because it is invalid JSON.Some discussion on Wikipedia: https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
The wiki article describes "popular" replacement options:
Some thoughts on the options:
After reading through the other popular options I think this will be the clearest. It makes it very easy to identify corrupt bytes and the number of corrupt bytes. It is also semantically appropriate. The replacement character is intended to replace unknown, unrecognized, or unrepresentable characters.
Does not work because we are going from UTF-8 to UTF-16 and the invalid code point range is utilized for UTF-16 surrogate pairs
This was my first thought at implementation but it results in a lot of collisions. ie if the user did intend to use codepoints U+0080-U+00FF in correctly formatted UTF-8 they would be indistinguishable from invalid bytes mapped to that range.
I think it's also possible to miss the corruption because there are valid and not too uncommon symbols in the range.
As the string functions are used to read values from Vireo memory for controls and indicators, if the user (vireo integrator) believes invalid UTF-8 strings / arbitrary byte arrays may be present they should not try to read strings as JSON strings but as JSON arrays of numbers or using another API function. The benefit of readJSON on a string type is to aid in conversion from UTF-8 encoded strings to UTF-16 encoded strings.
Edit: So turns out Windows-1252 and Unicode are intentionally matching from U+0080-U+00FF
Running in Browser environment Unicode environment so not relevant
The text was updated successfully, but these errors were encountered: