Buggy support for binary string data #587

zvrba · 2017-05-16T15:53:51Z

I wanted to check whether I could use this library to store binary data w/o prior encoding to base64. Therefore I wrote the following simple function

static void TestBinDump()
{
    const char* data = "A\x13""B\x8B\0C";
    std::string s(data, data + 6);
    
    json js;
    js["key"] = s;

    auto dump = js.dump(4);
    std::cout << s.size() << " " << dump << std::endl;

    auto jd = json::parse(dump.begin(), dump.end());
    std::string v = jd.at("key");
    std::cout << v.size();
}

This correctly outputs 6 { "key": "A\u0013Bï\u0000C" }. The first surprise is that byte data has been encoded to 16-bit Unicode, so I wanted to see whether "8-bit" Unicode characters would get translated back to 8-bit values. I didn't get so far because parsing threw invalid_argument exception with message "parse error - unexpected '\"'".

I'm using version 2.1.1.

The text was updated successfully, but these errors were encountered:

nlohmann · 2017-05-18T15:53:50Z

I realized that on my machine, the key is serialized to "A\u0013B\213\u0000C". That byte \213 made me think: \213 or \x8B is 10001011 as binary, which is invalid UTF-8: bytes starting with 10 must follow bytes starting 11which is not the case for the B which is 01000010. That said, the given byte sequence in data is not valid UTF-8 and therefore its serialization cannot be expressed in this library (or maybe even JSON at all).

Your output shows ï which is Unicode U+00EF and would need to the string literal \u00ef rather than \213. That said, input A\x13""B\u00ef\0C yields serialization A\u0013Bï\u0000C which properly roundtrips (not the size of the input string is then 7 instead of 6 bytes).

I'm not entirely sure whether my take on this is right, so I would be happy for feedback on this.

jaredgrubb · 2017-05-18T16:44:24Z

FWIW, Python rejects serializing it (throws "UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 3: invalid start byte").

zvrba · 2017-05-18T16:49:35Z

I don't understand why you have to assume that the input is encoded as UTF8. ECMA-404 has to say the following on strings:

A string is a sequence of Unicode code points wrapped with quotation marks (U+0022). [...] Any code point may be represented as a hexadecimal number. The meaning of such a number is determined by ISO/IEC 10646. If the code point is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point. [...] To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair.

The first two sentences seem to imply that you can dump any binary data verbatim by treating pairs of bytes as Unicode characters, except if such pair of bytes would represent a character that MUST be quoted. If you end-up with a single-byte remainder of the string, represent it as \u00XX. Decoding would reverse the process.

If you put in meaningful Unicode, ASCII or UTF-8 data, the data is still meaningful (i.e., "readable text") after encoding. (Thus, you'd need to adopt "big-endian" convention for representing wide characters, but the spec doesn't say anything about this and there's even a special Unicode code point used for that purpose.) If you put in binary data, you get "garbage", but valid garbage after encoding. In both cases, decoding can reconstruct both length and data.

I.e., my proposal is to treat input as being 16-bit wide characters. With some care around quoting special characters, this can work nicely with ASCII/UTF8, wide characters and binary data, while also preserving length and not violating the spec in any way.

jaredgrubb · 2017-05-18T17:40:21Z

As you quote, a JSON string is a sequence of "Unicode code points" -- but the technical term "code point" has NOTHING to do with bits and bytes. Bits and bytes happen once you start talking about encodings, like UTF-8 or UTF-16.

For example, if you have the "Unicode code point U+0058" (the letter X), then we still have no idea what that looks like in memory. It depends on what encoding you attach to your in-memory string. It could look like "58" (UTF8) or "0058" or "5800" (UTF16-LE, -BE), etc.

JSON cannot encode arbitrary binary, just sequences of code points -- which, again, has nothing to do with binary representations :)

The nlohmann library has chosen that strings be encoded in UTF8 (see the README); he could have chosen any of the Unicode encoding format (or support them all!), but he had to pick one of them, and UTF8 is a great one (see http://utf8everywhere.org/ for a great read on Unicode and encodings; it's a fascinating topic).

Therefore, your input is not valid UTF8, and therefore should have been rejected by the encoder.

And, in case you still don't believe me, do a quick search for "JSON encoding binary" and you'll see lots of people trying to figure out good ways to do this -- which should be a hint that it's not natively supported by JSON. (For example, this popular StackOverflow question: http://stackoverflow.com/questions/1443158/binary-data-in-json-string-something-better-than-base64)

zvrba · 2017-05-18T17:50:12Z

What I'm saying is that you can pretend that your input data is big-endian UTF16. With minimal munging of data (escaping what must be escaped), you can serialize binary data w/o loss of information. At least, the serializer/deserializer could have an option to switch from assuming UTF8 to munging of binary strings.

jaredgrubb · 2017-05-18T18:34:11Z

You're right that there are ways of encoding illegal bytes (for example, like Python does with surrogate escapes). Another great way to do it is to encode the binary in Base64 or Base96, or as a string of hex digits [A-F0-9]. There are infinite ways to put arbitrary binary into a string, but:

none of these bin->str protocols are specified by the JSON spec
they're not universal or portable (for example, if you put "A\x13B\x8B\0C" through the encoder you imagine, and then load that file.json through any other JSON parser, like Python or Ruby or whatever, then I guarantee that you will NOT get back the bytes "A\x13B\x8B\0C" from any of them)

Look, I know you're trying to solve a good problem, and what you're proposing is reasonable on its face. However, the decoding/encoding you're proposing has to take place on a layer higher than the JSON parser/encoder, and I don't think that layer belongs in this library -- but, in the end, that's not my call, so you dont have to convince me :)

Type1J · 2017-05-18T20:03:16Z

JSON is encoded uniformly. If you want to encode binary, then you need to use CBOR, Message Pack, or BSON. This library includes 2 of those options. If you encode std::vector as JSON, you might want to use an array, not a string, then use CBOR or Message Pack on the wire. (I forget what the function names are, but they're in the README.md).

nlohmann · 2017-05-18T22:20:26Z

I think I need to better understand the expectation for this.

In C++, a string is just a sequence of bytes, and their meaning depends on the encoding. The example "A\x13B\x8B\0C" is a sequence of the bytes 0x41 (A), 0x13 (\x13), 0x42 (B), 0x8b (\213), 0x00 (\0), 0x43 (C).

When we want to create an nlohmann::json value from this byte sequence, we assume a UTF-8 encoding (see the notes). The simple assignment from std::string to nlohmann::json has no check implemented which throws an exception in case the string isn't properly encoded (as it is the case here). Whether such a check should be implemented should be discussed, but seems not to be the subject of this issue.

However, the byte sequence is not valid UTF-8, UTF-16, or UTF-32. Therefore, I fail to understand what we should do with the input in the first place.

zvrba · 2017-05-19T05:55:06Z

I think I need to better understand the expectation for this. [...] However, the byte sequence is not valid UTF-8, UTF-16, or UTF-32. Therefore, I fail to understand what we should do with the input in the first place.

I can describe the expectation simply. Given a byte sequence (which may or may not be a valid UTF8 string), apply a minimally munging transformation which introduces UTF16 characters to encode groups of bytes with high bit set. The rest is passed through verbatim.

Concrete algorithm: when current are next character are both printable ASCII, pass both through unchanged. When any of them is a control character or has a high bit set, convert it to 16-bit Unicode value. The example would thus become `"\u4113\u428B\u0043". Given ordinary ASCII text like "ABCDE", the output would be "ABCDE". Mixture like "AB\xA9CEFGH" would become "AB\uA943EFGH".

Now any decoder which is told that the input should be decoded to big-endian UTF16 would reconstruct the same binary string. Yes, it would treat "AB" as a single UTF16 character, but that's of no consequence, since we're interested in the underlying binary string.

Does this seem like a reasonable expectation?

The proposal is controversial, so this should be enabled only when a flag is set on the encoder/decoder.

nlohmann · 2017-05-19T06:02:01Z

Now I understand. Indeed, the required "algorithm" would be similar to the one required in #330. However, I am not yet convinced that this should be done in an assignment from std::string to json - maybe an additional function would be more suited here.

zvrba · 2017-05-19T06:12:36Z

I have no preference for either way. Would maybe prefer to have an additional function instead of a flag to make it obvious what's going on. But what about decoding/extraction? The parser has to be told in some way to not convert Unicode escapes to UTF8.

nlohmann · 2017-05-19T06:19:07Z

I don't understand. I only support UTF-8.

zvrba · 2017-05-19T06:27:26Z

I assume that when your parser encounters an escape like \uA933 it will convert it to UTF-8 representation, right? For the above transformation to work, the parser has to output the two bytes verbatim.

jaredgrubb · 2017-05-19T06:55:36Z

When this json parser sees the JSON string "\uA933", the parser will be return the UTF8 string with 3 bytes: 0xEA, 0xA4, 0xB3. (See: http://www.utf8-chartable.de/unicode-utf8-table.pl?start=43312&view=3).

nlohmann · 2017-05-19T09:29:37Z

The parser has to be told in some way to not convert Unicode escapes to UTF8.

No, this does not make sense. It is the duty of the parser to convert JSON strings into the target encoding, which is UTF-8 in this library. If you want the strings to remain the same, why use a JSON parser in the first place?

zvrba · 2017-05-19T10:34:16Z

Yes, I'm well aware of that, hence my comment. Regardless of whether this transformation is implemented automatically in assignment from string to json, or as a separate function, the parser has to be told to not convert Unicode escapes to UTF8.

nlohmann · 2017-05-19T13:01:27Z

To make clear: the assignment from string to json has nothing to do with the parser, and I shall not change the parser in this regard. Parsing escaped Unicode shall always return an UTF-8 encoded string.

zvrba · 2017-05-19T16:15:45Z

Implementing this as a parser mode flag costs little, but greatly increases the utility of the library because it spares all users of implementing their own binary->string encodings, yet you dogmatically reject this. (Even in C++ it's usual to use std::string instead of std::vector<char> for binary blobs.) I guess it's fork time.

nlohmann · 2017-05-19T16:44:31Z

I won't touch the parser to not cope with UTF-8. This would be out of scope of this library.

Good look forking!

nlohmann added the confirmed label May 16, 2017

nlohmann added the state: please discuss please discuss the issue or vote for your favorite option label May 18, 2017

nlohmann removed the confirmed label May 19, 2017

nlohmann closed this as completed May 19, 2017

nlohmann removed the state: please discuss please discuss the issue or vote for your favorite option label May 19, 2017

nlohmann mentioned this issue Jun 18, 2018

CBOR byte string support #1129

Closed

kevinlul mentioned this issue Apr 6, 2020

The server roomlist, JSON, nlohmann/json, UTF-8, and UTF-16 edo9300/edopro#134

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buggy support for binary string data #587

Buggy support for binary string data #587

zvrba commented May 16, 2017 •

edited by nlohmann

Loading

nlohmann commented May 18, 2017

jaredgrubb commented May 18, 2017

zvrba commented May 18, 2017 •

edited

Loading

jaredgrubb commented May 18, 2017

zvrba commented May 18, 2017

jaredgrubb commented May 18, 2017

Type1J commented May 18, 2017 •

edited

Loading

nlohmann commented May 18, 2017

zvrba commented May 19, 2017 •

edited

Loading

nlohmann commented May 19, 2017

zvrba commented May 19, 2017 •

edited

Loading

nlohmann commented May 19, 2017

zvrba commented May 19, 2017

jaredgrubb commented May 19, 2017

nlohmann commented May 19, 2017

zvrba commented May 19, 2017

nlohmann commented May 19, 2017

zvrba commented May 19, 2017 •

edited

Loading

nlohmann commented May 19, 2017

Buggy support for binary string data #587

Buggy support for binary string data #587

Comments

zvrba commented May 16, 2017 • edited by nlohmann Loading

nlohmann commented May 18, 2017

jaredgrubb commented May 18, 2017

zvrba commented May 18, 2017 • edited Loading

jaredgrubb commented May 18, 2017

zvrba commented May 18, 2017

jaredgrubb commented May 18, 2017

Type1J commented May 18, 2017 • edited Loading

nlohmann commented May 18, 2017

zvrba commented May 19, 2017 • edited Loading

nlohmann commented May 19, 2017

zvrba commented May 19, 2017 • edited Loading

nlohmann commented May 19, 2017

zvrba commented May 19, 2017

jaredgrubb commented May 19, 2017

nlohmann commented May 19, 2017

zvrba commented May 19, 2017

nlohmann commented May 19, 2017

zvrba commented May 19, 2017 • edited Loading

nlohmann commented May 19, 2017

zvrba commented May 16, 2017 •

edited by nlohmann

Loading

zvrba commented May 18, 2017 •

edited

Loading

Type1J commented May 18, 2017 •

edited

Loading

zvrba commented May 19, 2017 •

edited

Loading

zvrba commented May 19, 2017 •

edited

Loading

zvrba commented May 19, 2017 •

edited

Loading