-
-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Buggy support for binary string data #587
Comments
I realized that on my machine, the key is serialized to Your output shows I'm not entirely sure whether my take on this is right, so I would be happy for feedback on this. |
FWIW, Python rejects serializing it (throws "UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 3: invalid start byte"). |
I don't understand why you have to assume that the input is encoded as UTF8. ECMA-404 has to say the following on strings:
The first two sentences seem to imply that you can dump any binary data verbatim by treating pairs of bytes as Unicode characters, except if such pair of bytes would represent a character that MUST be quoted. If you end-up with a single-byte remainder of the string, represent it as If you put in meaningful Unicode, ASCII or UTF-8 data, the data is still meaningful (i.e., "readable text") after encoding. (Thus, you'd need to adopt "big-endian" convention for representing wide characters, but the spec doesn't say anything about this and there's even a special Unicode code point used for that purpose.) If you put in binary data, you get "garbage", but valid garbage after encoding. In both cases, decoding can reconstruct both length and data. I.e., my proposal is to treat input as being 16-bit wide characters. With some care around quoting special characters, this can work nicely with ASCII/UTF8, wide characters and binary data, while also preserving length and not violating the spec in any way. |
As you quote, a JSON string is a sequence of "Unicode code points" -- but the technical term "code point" has NOTHING to do with bits and bytes. Bits and bytes happen once you start talking about encodings, like UTF-8 or UTF-16. For example, if you have the "Unicode code point U+0058" (the letter X), then we still have no idea what that looks like in memory. It depends on what encoding you attach to your in-memory string. It could look like "58" (UTF8) or "0058" or "5800" (UTF16-LE, -BE), etc. JSON cannot encode arbitrary binary, just sequences of code points -- which, again, has nothing to do with binary representations :) The nlohmann library has chosen that strings be encoded in UTF8 (see the README); he could have chosen any of the Unicode encoding format (or support them all!), but he had to pick one of them, and UTF8 is a great one (see http://utf8everywhere.org/ for a great read on Unicode and encodings; it's a fascinating topic). Therefore, your input is not valid UTF8, and therefore should have been rejected by the encoder. And, in case you still don't believe me, do a quick search for "JSON encoding binary" and you'll see lots of people trying to figure out good ways to do this -- which should be a hint that it's not natively supported by JSON. (For example, this popular StackOverflow question: http://stackoverflow.com/questions/1443158/binary-data-in-json-string-something-better-than-base64) |
What I'm saying is that you can pretend that your input data is big-endian UTF16. With minimal munging of data (escaping what must be escaped), you can serialize binary data w/o loss of information. At least, the serializer/deserializer could have an option to switch from assuming UTF8 to munging of binary strings. |
You're right that there are ways of encoding illegal bytes (for example, like Python does with surrogate escapes). Another great way to do it is to encode the binary in Base64 or Base96, or as a string of hex digits [A-F0-9]. There are infinite ways to put arbitrary binary into a string, but:
Look, I know you're trying to solve a good problem, and what you're proposing is reasonable on its face. However, the decoding/encoding you're proposing has to take place on a layer higher than the JSON parser/encoder, and I don't think that layer belongs in this library -- but, in the end, that's not my call, so you dont have to convince me :) |
JSON is encoded uniformly. If you want to encode binary, then you need to use CBOR, Message Pack, or BSON. This library includes 2 of those options. If you encode std::vector as JSON, you might want to use an array, not a string, then use CBOR or Message Pack on the wire. (I forget what the function names are, but they're in the |
I think I need to better understand the expectation for this. In C++, a string is just a sequence of bytes, and their meaning depends on the encoding. The example When we want to create an However, the byte sequence is not valid UTF-8, UTF-16, or UTF-32. Therefore, I fail to understand what we should do with the input in the first place. |
I can describe the expectation simply. Given a byte sequence (which may or may not be a valid UTF8 string), apply a minimally munging transformation which introduces UTF16 characters to encode groups of bytes with high bit set. The rest is passed through verbatim. Concrete algorithm: when current are next character are both printable ASCII, pass both through unchanged. When any of them is a control character or has a high bit set, convert it to 16-bit Unicode value. The example would thus become `"\u4113\u428B\u0043". Given ordinary ASCII text like "ABCDE", the output would be "ABCDE". Mixture like "AB\xA9CEFGH" would become "AB\uA943EFGH". Now any decoder which is told that the input should be decoded to big-endian UTF16 would reconstruct the same binary string. Yes, it would treat "AB" as a single UTF16 character, but that's of no consequence, since we're interested in the underlying binary string. Does this seem like a reasonable expectation? The proposal is controversial, so this should be enabled only when a flag is set on the encoder/decoder. |
Now I understand. Indeed, the required "algorithm" would be similar to the one required in #330. However, I am not yet convinced that this should be done in an assignment from |
I have no preference for either way. Would maybe prefer to have an additional function instead of a flag to make it obvious what's going on. But what about decoding/extraction? The parser has to be told in some way to not convert Unicode escapes to UTF8. |
I don't understand. I only support UTF-8. |
I assume that when your parser encounters an escape like |
When this json parser sees the JSON string |
No, this does not make sense. It is the duty of the parser to convert JSON strings into the target encoding, which is UTF-8 in this library. If you want the strings to remain the same, why use a JSON parser in the first place? |
Yes, I'm well aware of that, hence my comment. Regardless of whether this transformation is implemented automatically in assignment from |
To make clear: the assignment from string to json has nothing to do with the parser, and I shall not change the parser in this regard. Parsing escaped Unicode shall always return an UTF-8 encoded string. |
Implementing this as a parser mode flag costs little, but greatly increases the utility of the library because it spares all users of implementing their own binary->string encodings, yet you dogmatically reject this. (Even in C++ it's usual to use |
I won't touch the parser to not cope with UTF-8. This would be out of scope of this library. Good look forking! |
I wanted to check whether I could use this library to store binary data w/o prior encoding to base64. Therefore I wrote the following simple function
This correctly outputs
6 { "key": "A\u0013Bï\u0000C" }
. The first surprise is that byte data has been encoded to 16-bit Unicode, so I wanted to see whether "8-bit" Unicode characters would get translated back to 8-bit values. I didn't get so far because parsing threwinvalid_argument
exception with message"parse error - unexpected '\"'"
.I'm using version 2.1.1.
The text was updated successfully, but these errors were encountered: