Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buggy support for binary string data #587

Closed
zvrba opened this issue May 16, 2017 · 19 comments
Closed

Buggy support for binary string data #587

zvrba opened this issue May 16, 2017 · 19 comments

Comments

@zvrba
Copy link

zvrba commented May 16, 2017

I wanted to check whether I could use this library to store binary data w/o prior encoding to base64. Therefore I wrote the following simple function

static void TestBinDump()
{
    const char* data = "A\x13""B\x8B\0C";
    std::string s(data, data + 6);
    
    json js;
    js["key"] = s;

    auto dump = js.dump(4);
    std::cout << s.size() << " " << dump << std::endl;

    auto jd = json::parse(dump.begin(), dump.end());
    std::string v = jd.at("key");
    std::cout << v.size();
}

This correctly outputs 6 { "key": "A\u0013Bï\u0000C" }. The first surprise is that byte data has been encoded to 16-bit Unicode, so I wanted to see whether "8-bit" Unicode characters would get translated back to 8-bit values. I didn't get so far because parsing threw invalid_argument exception with message "parse error - unexpected '\"'".

I'm using version 2.1.1.

@nlohmann
Copy link
Owner

I realized that on my machine, the key is serialized to "A\u0013B\213\u0000C". That byte \213 made me think: \213 or \x8B is 10001011 as binary, which is invalid UTF-8: bytes starting with 10 must follow bytes starting 11which is not the case for the B which is 01000010. That said, the given byte sequence in data is not valid UTF-8 and therefore its serialization cannot be expressed in this library (or maybe even JSON at all).

Your output shows ï which is Unicode U+00EF and would need to the string literal \u00ef rather than \213. That said, input A\x13""B\u00ef\0C yields serialization A\u0013Bï\u0000C which properly roundtrips (not the size of the input string is then 7 instead of 6 bytes).

I'm not entirely sure whether my take on this is right, so I would be happy for feedback on this.

@jaredgrubb
Copy link
Contributor

FWIW, Python rejects serializing it (throws "UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 3: invalid start byte").

@zvrba
Copy link
Author

zvrba commented May 18, 2017

I don't understand why you have to assume that the input is encoded as UTF8. ECMA-404 has to say the following on strings:

A string is a sequence of Unicode code points wrapped with quotation marks (U+0022). [...] Any code point may be represented as a hexadecimal number. The meaning of such a number is determined by ISO/IEC 10646. If the code point is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point. [...] To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair.

The first two sentences seem to imply that you can dump any binary data verbatim by treating pairs of bytes as Unicode characters, except if such pair of bytes would represent a character that MUST be quoted. If you end-up with a single-byte remainder of the string, represent it as \u00XX. Decoding would reverse the process.

If you put in meaningful Unicode, ASCII or UTF-8 data, the data is still meaningful (i.e., "readable text") after encoding. (Thus, you'd need to adopt "big-endian" convention for representing wide characters, but the spec doesn't say anything about this and there's even a special Unicode code point used for that purpose.) If you put in binary data, you get "garbage", but valid garbage after encoding. In both cases, decoding can reconstruct both length and data.

I.e., my proposal is to treat input as being 16-bit wide characters. With some care around quoting special characters, this can work nicely with ASCII/UTF8, wide characters and binary data, while also preserving length and not violating the spec in any way.

@jaredgrubb
Copy link
Contributor

As you quote, a JSON string is a sequence of "Unicode code points" -- but the technical term "code point" has NOTHING to do with bits and bytes. Bits and bytes happen once you start talking about encodings, like UTF-8 or UTF-16.

For example, if you have the "Unicode code point U+0058" (the letter X), then we still have no idea what that looks like in memory. It depends on what encoding you attach to your in-memory string. It could look like "58" (UTF8) or "0058" or "5800" (UTF16-LE, -BE), etc.

JSON cannot encode arbitrary binary, just sequences of code points -- which, again, has nothing to do with binary representations :)

The nlohmann library has chosen that strings be encoded in UTF8 (see the README); he could have chosen any of the Unicode encoding format (or support them all!), but he had to pick one of them, and UTF8 is a great one (see http://utf8everywhere.org/ for a great read on Unicode and encodings; it's a fascinating topic).

Therefore, your input is not valid UTF8, and therefore should have been rejected by the encoder.

And, in case you still don't believe me, do a quick search for "JSON encoding binary" and you'll see lots of people trying to figure out good ways to do this -- which should be a hint that it's not natively supported by JSON. (For example, this popular StackOverflow question: http://stackoverflow.com/questions/1443158/binary-data-in-json-string-something-better-than-base64)

@zvrba
Copy link
Author

zvrba commented May 18, 2017

What I'm saying is that you can pretend that your input data is big-endian UTF16. With minimal munging of data (escaping what must be escaped), you can serialize binary data w/o loss of information. At least, the serializer/deserializer could have an option to switch from assuming UTF8 to munging of binary strings.

@jaredgrubb
Copy link
Contributor

You're right that there are ways of encoding illegal bytes (for example, like Python does with surrogate escapes). Another great way to do it is to encode the binary in Base64 or Base96, or as a string of hex digits [A-F0-9]. There are infinite ways to put arbitrary binary into a string, but:

  • none of these bin->str protocols are specified by the JSON spec
  • they're not universal or portable (for example, if you put "A\x13B\x8B\0C" through the encoder you imagine, and then load that file.json through any other JSON parser, like Python or Ruby or whatever, then I guarantee that you will NOT get back the bytes "A\x13B\x8B\0C" from any of them)

Look, I know you're trying to solve a good problem, and what you're proposing is reasonable on its face. However, the decoding/encoding you're proposing has to take place on a layer higher than the JSON parser/encoder, and I don't think that layer belongs in this library -- but, in the end, that's not my call, so you dont have to convince me :)

@nlohmann nlohmann added the state: please discuss please discuss the issue or vote for your favorite option label May 18, 2017
@Type1J
Copy link
Contributor

Type1J commented May 18, 2017

JSON is encoded uniformly. If you want to encode binary, then you need to use CBOR, Message Pack, or BSON. This library includes 2 of those options. If you encode std::vector as JSON, you might want to use an array, not a string, then use CBOR or Message Pack on the wire. (I forget what the function names are, but they're in the README.md).

@nlohmann
Copy link
Owner

I think I need to better understand the expectation for this.

In C++, a string is just a sequence of bytes, and their meaning depends on the encoding. The example "A\x13B\x8B\0C" is a sequence of the bytes 0x41 (A), 0x13 (\x13), 0x42 (B), 0x8b (\213), 0x00 (\0), 0x43 (C).

When we want to create an nlohmann::json value from this byte sequence, we assume a UTF-8 encoding (see the notes). The simple assignment from std::string to nlohmann::json has no check implemented which throws an exception in case the string isn't properly encoded (as it is the case here). Whether such a check should be implemented should be discussed, but seems not to be the subject of this issue.

However, the byte sequence is not valid UTF-8, UTF-16, or UTF-32. Therefore, I fail to understand what we should do with the input in the first place.

@zvrba
Copy link
Author

zvrba commented May 19, 2017

I think I need to better understand the expectation for this. [...] However, the byte sequence is not valid UTF-8, UTF-16, or UTF-32. Therefore, I fail to understand what we should do with the input in the first place.

I can describe the expectation simply. Given a byte sequence (which may or may not be a valid UTF8 string), apply a minimally munging transformation which introduces UTF16 characters to encode groups of bytes with high bit set. The rest is passed through verbatim.

Concrete algorithm: when current are next character are both printable ASCII, pass both through unchanged. When any of them is a control character or has a high bit set, convert it to 16-bit Unicode value. The example would thus become `"\u4113\u428B\u0043". Given ordinary ASCII text like "ABCDE", the output would be "ABCDE". Mixture like "AB\xA9CEFGH" would become "AB\uA943EFGH".

Now any decoder which is told that the input should be decoded to big-endian UTF16 would reconstruct the same binary string. Yes, it would treat "AB" as a single UTF16 character, but that's of no consequence, since we're interested in the underlying binary string.

Does this seem like a reasonable expectation?

The proposal is controversial, so this should be enabled only when a flag is set on the encoder/decoder.

@nlohmann
Copy link
Owner

Now I understand. Indeed, the required "algorithm" would be similar to the one required in #330. However, I am not yet convinced that this should be done in an assignment from std::string to json - maybe an additional function would be more suited here.

@zvrba
Copy link
Author

zvrba commented May 19, 2017

I have no preference for either way. Would maybe prefer to have an additional function instead of a flag to make it obvious what's going on. But what about decoding/extraction? The parser has to be told in some way to not convert Unicode escapes to UTF8.

@nlohmann
Copy link
Owner

I don't understand. I only support UTF-8.

@zvrba
Copy link
Author

zvrba commented May 19, 2017

I assume that when your parser encounters an escape like \uA933 it will convert it to UTF-8 representation, right? For the above transformation to work, the parser has to output the two bytes verbatim.

@jaredgrubb
Copy link
Contributor

When this json parser sees the JSON string "\uA933", the parser will be return the UTF8 string with 3 bytes: 0xEA, 0xA4, 0xB3. (See: http://www.utf8-chartable.de/unicode-utf8-table.pl?start=43312&view=3).

@nlohmann
Copy link
Owner

The parser has to be told in some way to not convert Unicode escapes to UTF8.

No, this does not make sense. It is the duty of the parser to convert JSON strings into the target encoding, which is UTF-8 in this library. If you want the strings to remain the same, why use a JSON parser in the first place?

@zvrba
Copy link
Author

zvrba commented May 19, 2017

Yes, I'm well aware of that, hence my comment. Regardless of whether this transformation is implemented automatically in assignment from string to json, or as a separate function, the parser has to be told to not convert Unicode escapes to UTF8.

@nlohmann
Copy link
Owner

To make clear: the assignment from string to json has nothing to do with the parser, and I shall not change the parser in this regard. Parsing escaped Unicode shall always return an UTF-8 encoded string.

@zvrba
Copy link
Author

zvrba commented May 19, 2017

Implementing this as a parser mode flag costs little, but greatly increases the utility of the library because it spares all users of implementing their own binary->string encodings, yet you dogmatically reject this. (Even in C++ it's usual to use std::string instead of std::vector<char> for binary blobs.) I guess it's fork time.

@nlohmann
Copy link
Owner

I won't touch the parser to not cope with UTF-8. This would be out of scope of this library.

Good look forking!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants