-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Binary strings #2736
base: master
Are you sure you want to change the base?
Binary strings #2736
Conversation
I'm concerned that the former approach (making blobs behave like strings) will be troublesome or confusing, basically because "strings" are already troublesome enough. (JSON? Raw? Sequences of codepoints? Valid UTF8? Invalid?) Consider for example Well, maybe that wouldn't be so bad, but consider another example: If the main goal of supporting blobs is to have a highly compact way of storing large arrays of small integers in a way that allows for various operations and transformations to be implemented efficiently, then your original intuition seems to me correct. No doubt I'm missing something important, but this does seem like a fine opportunity to plug for string[i] as shorthand for string[i: i+1] :-) |
As a possible alternative to the hex notation, I think supporting "\xHH" notation in string literals would be useful (I didn't want to add it in my PR because I wanted to avoid adding new features). I suspect it shouldn't be allowed in actual JSON string literals (since the notation is not allowed in JSON [0]), but it could be allowed in jq string literals. [0] Though I have wondered if it could make sense to have a flag to allow it and also to emit it, instead of emitting the illegal UTF-8 bytes. This is actually my biggest gripe against JSON: the fact that it has a notation for representing arbitrary sequences of UTF-16 code units but not arbitrary sequences of UTF-8 code units. Perhaps with some hindsight, there could have been an expectation for UTF-16 systems to interpret "\xHH" sequences as UTF-8 just as UTF-8 systems interpret "\uHHHH" sequences as UTF-16. |
While ("💩" | length) == 1 # like in Python
("💩" | tobinary8 | length) == 4 # like in C
("💩" | tobinary16 | length) == 2 # like in JavaScript |
I'm not following.
Yes,
The
Certainly there's nothing unnatural about representing blobs as arrays of bytes. But I think there's nothing unnatural about representing them as non-UTF-8 strings too, and in terms of what I would have to do to Ah, that's another thing, we currently have string slice syntax If we had So I think simply adding iteration and indexing support for strings and blobs is enough to get the semantics I'd originally had in mind for binary as arrays of small integers, but with the benefit that there would be no concerns like "what happens if you have a binary (array of bytes) and try to append or set a value that is not a byte value?". Also, thinking about it, representing binary blobs as arrays of bytes would have presented difficulties w.r.t. path expressions. Since string iteration/indexing wouldn't contribute to path expressions, I now think it's more natural to represent binary as a sub-type of strings. Also, in other languages binary is typically string-like, at least as to literal value syntax.
Yes! I was missing that. I'll add it. |
UTF-16 is proof that -or at least very strongly suggestive of- time machines don't exist, and never will exist, or are/will be too expensive to use, or that fear of paradoxes will limit their use to just observation. UTF-16 needs to die in a fire, and if jq not supporting it helps it die, so much the better! Now, more seriously, if we had a byteblob binary type, we would also then be able write jq code that uses that to implement UTF-16. Having a string sub-type that is UTF-16 might have some value, but I would like first to get experience with byte blobs before we add UTF-16 support. |
Indeed, I'm not interested in innovating in JSON. Having participated in IETF threads with thousands of posts about publishing RFC 7259, I'm not inclined to believe that we could alter JSON to support binary, and I do not relish the thought of repeating that experience.
With string sub-types indicating output options we could certainly allow oddball, not-quite-JSON formats like JSON w/ WTF-8, but for true binary I am only interested in either emitting errors or auto-base64-encoding for now. Eventually something like WTF-8b would indeed allow encoding of binary as something very close to UTF-8, if not actually UTF-8 (like, if we used private use codepoints to represent the WTF-8 encoding of broken surrogates then the result could be true UTF-8 rather than WTF-8). But even here we'd be stepping on the Unicode Consortium's toes -- it would be much much better, but also much much harder, to get the UC to allocate 128 codepoints for this purpose and then define something like a Unicode encoding of binary data. So you can see I'm reluctant to innovate on the JSON side and the Unicode fronts. I'm not resolutely opposed to it though: we could have command-line options to enable these for input/output, and we could label them experimental. But I'd like to get something a bit more standards-compliant done first. |
a36f347
to
42837a1
Compare
I now see that this approach and my old "binary as array of small integers" idea are... remarkably similar. The differences are:
As long as we add |
Hmmm. That's largely what I was trying to say :-) But let me outline two radical variations of the "blob as array of bytes" idea. For brevity, I'll use $ab to signify a JSON array of integers in range(0;256). The two variants are:
Of course, both techniques can be used if one wants to support |
It's not really possible to correctly do it this way. The only correct way to encode invalid Unicode in such a way that valid Unicode is passed through unchanged (ie, all valid UTF-8 strings have the same meaning in WTF-8) is to encode the ill-formed Unicode sequences into ill-formed Unicode sequences. WTF-8 works by encoding ill-formed UTF-16 (unpaired surrogates) into invalid[0] UTF-8 (invalid "generalised UTF-8" encodings of UTF-16 surrogate code points). Any valid UTF-16 already has a corresponding valid UTF-8 encoding, and vice versa—these encodings can't be reused. The "WTF-8b" extension additionally encodes ill-formed UTF-8 bytes as other invalid UTF-8 bytes. This includes all WTF-8-specific and WTF-8b-specific sequences (it's fundamentally not possible for this process to be idempotent, since it should not be possible to generate encoded UTF-16 errors from UTF-8 binary data). If ill-formed Unicode is encoded as valid Unicode, it won't be distinguishable from previously valid Unicode. It would be particularly incorrect to emit invalid Unicode in response to certain valid Unicode (eg, text that happens to contain these 128 hypothetical code points—they would still be Unicode scalar values, so they can appear in valid UTF-8 or UTF-16 text ... or binary data that just happens to look like such UTF-8 text). [0] I'm distinguishing here between "ill-formed" and "invalid", where "invalid" bytes would be an ill-formed sequence that never occurs as a substring of valid Unicode text—these forms are particularly useful in WTF-8/WTF-8b since they can not be generated accidentally through string concatenation |
That works provided other systems understand it. Encoding non-UTF-8 as valid UTF-8 with special codepoints also only works if other systems understand how to decode that, but it has the advantage that other systems that do not know how to decode it will pass it through unmolested. |
I think the purpose of any such encoding should only be for internal use. Except for debugging purposes, it should preferably not be possible to observe the internal byte representation of these strings. I think a reasonable way of exposing the error bytes/surrogates would be as negative code points when iterating, eg: |
@leonid-s-usov I'm trying to understand the jvp flags thing you added. What is the intent regarding adding new flags? Why not use the |
42837a1
to
fe7661f
Compare
jq will only every support JSON types -- new value types can't be added because they can't be represented in JSON. jq could add a typing mechanism that amounts to JSON schemas for data, and maybe typing for jq functions too (so we could do typechecking), but this is all way beyond the scope of this PR, and I don't think binary data support should wait for any of that. If we had to have a notion of "class" I'd do it a bit like Perl5: provide a way to bless a JSON object (and maybe arrays too) with a "class" and add a way to find out the class of one, but with the addition of a JSON schema and validation. Obviously a lot can be debated there, but definitely jq cannot add new value types.
String or array makes little difference now, but I much prefer string now. Again, the only real difference now would be what Maybe someone can make a convincing argument that allowing
There's no reason that 16-bit word strings couldn't be a flavor of So the only arguments I see here are about a) which is more natural for |
fe7661f
to
e015388
Compare
@nicowilliams wrote:
Precisely. That's the whole point of my two variations. You used the term "flavor", so by all means go with that if you prefer. To summarize: The first variation basically involves new filters and a convention about JSON objects. These can both be ignored entirely by the user; and if the user ignores them, there will be no impact on the user. The second variation is even less visible, as there is no convention, just some new filters and some behind-the-scenes stuff. |
a056ceb
to
c081f05
Compare
@pkoppstein you might want to kick the tires on this. It's starting to be usable!
Conversions to base64, byte array, or UTF-8 (w/ bad character mapping) happen on output or on |
b78a40b
to
7db1557
Compare
I might punt on WTF-8 and let @Maxdamantus implement that on top of this when this is done :) |
6923da2
to
7aecd7c
Compare
I'm getting three compilation errors:
Try again now?
|
Yay! [Unless you strenuously object, I propose deleting completely useless and ephemeral messages in this thread (and potentially others, too).] I noticed that you're proposing to extend + to allow both:
and
The other day, you were warning about the perils of polymorphism, so
or at least:
More importantly, #2 seems quite wrong from a jq-perspective: since a
|
Kicking the tires... What might be done about the proliferation of unwieldy names? Since you've introduced Agreed, "tobytearray" and "toutf8" are unreadable at best and |
@nicowilliams wrote:
Yes, please, even if it's only one such mode (for now)! It would fill a really big gap. |
@itchyny of course makes important points about this expression, and I would agree that allowing First, the Second, it should be remembered that in jq 1.6:
raises an error. So the new behavior makes a feature out of an error, Third, being able to write $string[$i] rather than $string[$i:$i+1] is really Fourth, there is a huge tension within the current version of the binary_strings branch:
Of course, if we do "revert" to allowing $string[$i], then the change |
In real use cases, users may want to get the first element from response fields whose type is |
I think you're very confused. Form-insensitive comparison means that every string which is distinct under The only good argument I can see against form-insensitive string comparisons (and normalizing-to-hash for object key hashing) is a performance argument. By and large all the Unicode strings you'll see are NFC, except that there are things like OS X's HFS+ which decompose into NFD (or something very close to it), so in fact there are ways to get equivalent-but-
See above.
Got it. Yes, well, I wouldn't use NFKC or NFKD for string comparison. I would use NFD because if you're normalizing character-by-character as needed for string comparison, then why canonically-decompose-then-canonically-compose -as NFC requires- when you can just canonically decompose and stop there?
I'm quite aware.
And what are Python's operators for comparison w/ normalization? I would consider |
I've a feeling that anything other than slurping all binary input will be a fair bit of work, both for the author and the reviewers. |
That's a good idea. The |
(Before looking at how that |
Hmm, I want to rewrite the section of the manual on destructuring. |
This commit adds `tobinary/0`, which converts a normal string (or an array of bytes, flattened) to a binary string that will be base64-encoded on output. The string will remain unencoded during the execution of the jq program -- encoding is not applied until the string is to be output, or until `tostring` is applied. Also added is an `encodeas/1` that takes a string argument of `"base64"`, `"hex"`, `"bytearray"`, or `"UTF-8"` and outputs its input altered so that on output the string will be encoded in base64, hex, as an array of bytes, or be converted to UTF-8 by applying UTF-8 validation and bad character mapping. As well, there is a `tobinary/1` that converts a stream of strings, numeric unsigned byte values, and arrays of bytes, to a binary. As well there is a `isbinary/0` which indicates whether the input is binary. As well there is a `stringtype/0` which indicates whether a string is binary or UTF-8. As well there is a `encoding/0` which indicates the output encoding of the input.
d85a8bd
to
3590d6c
Compare
That by itself would be fantastic! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
manual.yml -
(1) typo in sentence beginning:
Binary strings are encoded accoring to the encoding selected
(2) The following looks like something went wrong with copy/paste:
Arrays of unsigned byte values and arrays of .. unsigned byte values
(3) "when when"
(4) Change:
Outputs either "UTF-8"
, "base64"
, or "bytearray"
to s.t. like:
Outputs the encoding, currently one of "UTF-8"
, "base64"
, "bytearray"
, or "hex"
,
(5) Please consider breaking up the following "example" into separate, pithier pieces that will avoid unwanted line breaks:
- program: '[(tostring,tobinary,(tobinary|encodeas("bytearray")),(tobinary|encodeas("UTF-8")))|[type,stringtype,encoding]]'
input: '"foo"'
output: ['[["string","UTF-8","UTF-8"],["string","binary","base64"],["string","binary","bytearray"],["string","binary","UTF-8"]]']
For example:
- program: '[tostring,tobinary | [type,stringtype,encoding]]'
input: '"foo"'
output: ['[["string","UTF-8","UTF-8"],["string","binary","base64"]]']
- program: 'tobinary | encodeas("bytearray") | [type,stringtype,encoding]'
input: '"foo"'
output: ['["string","binary","bytearray"]']
I think I'm going to make it so Or maybe not, maybe for type inference someday we just add type assertions, and leave |
I feel like I should point out that if the existing encoding issues are fixed with #2314, the With those changes, I think it would make more sense to simply emit the binary strings in the same way. If someone wants to emit base64, It also feels a bit strange to me that |
You still need [as yet non-existent] software to decode the binary-as-WTF-8b encoding. That might be loss-less, but it's not convenient. It's also not standards-compliant in any way. Sure, someone could standardize WTF-8b, but that's far off in the future. What I'm doing here does not preclude adding WTF-8 and your WTF-8b. We're not in competition. Since WTF-8b is not useful to me, I am not interested in pursuing it myself at this time, though your PR rebased onto this one (if we merge this one!) would be acceptable. Your comment isn't addressing my question about whether having Using WTF-8b to represent arbitrary binary wouldn't be easy either, since jq programs would have implement that decoding themselves or use a decoder provided by someone else (maybe a builtin). Whereas using binary as in this PR is not unlike binary in any other language: it's just bytes. Or did I misunderstand your PR completely?
One can always explicitly pick an encoding, that's true. But I want to be able to work with binary and not have to update the place where the final value will be placed wherever it goes to encode it. I believe that having a "lazy" output encoding option as part of the
Yes, I could get rid of
Short of having schema that can tell us that values at certain paths are X-encoded binary values, that can't be helped. And that's true even if I remove all of the lazy output encoding functionality. At least with lazy output encoding I can make the output side easier to code in jq programs, even though I can't do anything about the input side without adding a scope-engiantening schema support -- practical considerations dictate that I cannot implement schema-awareness into jq, but nothing precludes that being implemented by someone eventually. Indeed, one could jq-code a schema-aware utility that applies |
0bbc021
to
8880187
Compare
Though getting rid of the bytearray encoding is certainly one good answer to my question about Choices I have:
I may be happy with all three of these. Maybe I'm missing other options? |
Sorry, I meant to imply that my preference would be for it to emit UTF-8 as proposed above (I'm assuming your proposal above means to make
I think there might be a misunderstanding. The use of WTF-8/WTF-8b is really an implementation detail, and the reason it uses those encodings instead of plain bytes is to additionally support encoding UTF-16 errors (so JSON such as Personally, I feel like preserving 8-bit binary data (invalid UTF-8) is more useful than preserving 16-bit binary data (invalid UTF-16), so the use of WTF-8 isn't that important to me (my PR could be modified to simply store the 8-bit data as is), but since JSON itself allows encoding arbitrary UTF-16 data it seems more faithful to JSON to preserve those errors. |
In this PR I'm only interested in binary data -as in protocol buffers, DER, XDR, etc-, not WTF-8 or any flavor of "broken UTF-8". Binary is just: an array of 8-bit bytes. And I'm only interested in binary with an efficient implementation, meaning: in-memory literally an array of bytes, with only O(1) overhead, and definitely no additional encoding (as that would not be O(1)), and with an interface that allows jq code to observe binary data as just a sequence or array of numeric byte values in the range 0..256 (again, efficiently). WTF-8 is a completely different thing. and it is not intended to represent binary data. WTF-8b would be just an encoding for binary that I would only be interested in using at the edges (input and output), not in memory, and only if there was interest in it elsewhere too. WTF-8b is simply not appropriate as a internal representation for binary data, and it is especially not appropriate to ask jq programmers to expect binary data to be represented as WTF-8b. The last sentence in the above quote is the reason that WTF-8b cannot satisfy my requirements for binary data. You can argue that my requirements are wrong, but most programming languages that supports binary data does it in an efficient manner w/o additional encoding, and they do for the obvious reason that it is most ergonomic and most efficient.
I think the "but since JSON itself..." part is a misconception. JSON uses UTF-16-style surrogate pair encoding of escaped codepoints only (and only for codepoints outside the BMP), but only if the sender chooses to escape them (they are NOT required to). JSON is UTF-8 for interchange, and there is no need to escape codepoints outside the BMP. JSON uses UTF-16 surrogate pairs for escaped non-BMP codepoints because ECMAScript does that, and ECMAScript does that for implementation reasons that jq does not suffer from. What's important is that JSON strings are always UTF-8 with some required escapes (e.g., all ASCII control characters, double quotes, etc), but most codepoints do not require escaping. See RFC 8259, section 7. WTF-8 exists to deal with truncated JSON that can leave you with truncated UTF-16 surrogate pairs when non-BMP codepoints were escaped in the original -- WTF-8 is not a generic binary data interchange encoding, let alone a useful internal representation. I'm not opposed to eventually supporting WTF-8 in some way, but it's just not the point of this PR. My only interest in this PR relating to WTF-8 is that this PR not preclude use of WTF-8 for external purposes, and also not to preclude its use internally to represent WTF-8 strings that have been input as WTF-8 from external sources. I believe this PR indeed does not preclude any of that. If you believe this PR does preclude that, please let me know how it does that. |
@nicowilliams - I'm looking at the revised documentation for
This is a bit too terse. First, it makes no mention that the input is expected to be a string; more important, it suggests that e.g. So perhaps the text should read s.t. like:
|
I don't think this is correct.
(EDIT: Oops, I misinterpreted the context of the above statement; they are using "character" consistently there, but I think the overall point still stands) The same section also explains that any "character" in the BMP range[0] can be represented using the
Implementations of JSON that use UTF-16, such as in JavaScript: normally accept unpaired surrogate escapes as code units without replacement:
JavaScript also emits unpaired surrogate escapes, though admittedly it has only done this relatively recently (within the last 6 years), and
Just to be clear, I'm not suggesting that JSON implementations need to handle unpaired surrogates, but they are technically valid JSON, just as [0] As an aside, UTF-16 surrogates are technically part of the BMP (https://unicode.org/roadmaps/bmp/), since Unicode allocates code points (but not abstract characters) for them, and like every other plane, the BMP contains exactly 65,536 code points. This point is probably not particularly relevant, since I'm not sure the RFC authors were concerned with this level of precision of Unicode terminology. |
I'm not interested in further debating on this PR whether RFC 8259 or ECMA 404 allow unpaired surrogates in any way. That topic is just not germane to this PR's topic: binary strings. |
In the past I've wanted to support binary blobs by treating them as arrays of small integers. I started a small experiment today and it looks to me like adding a sub-type of string that is binary and behaves like a string is much more natural than a sub-type of string that behaves like an array, especially if we were to have the ability use
.[]
to iterate (which would give us a streaming version ofexplode
).The goal is to be able to work with a) binary, non-text data, b) work with mangled UTF-8, such as WTF-8. For example of (a), one could try to write a codec for CBOR and other binary JSON formats, or ASN.1 DER, or protocol buffers, or flat buffers, etc.
I'd like to add the fewest possible command-line options, possibly none.
So here's the rough idea here, which this PR right now barely sketches:
0x80
-0xff
as overlong UTF-8 sequencestobinary/1
which makes a binary out of a stream of bytes that will be an error to output if it's not valid UTF-8tobinary/0
which makes a binary out of a string (this may seem silly, but.[]
on strings should output a stream of Unicode codepoints, while.[]
should output a stream of bytes)encodeas/1
which sets the encoding for the given value (currently only for strings and binary strings) to one of"UTF-8"
,"base64"
, or"bytearray"
encoding/0
which outputs the output encoding of its input string/binary valuetostring/0
work with binary strings of all types doing the usual bad codepoint replacement thinginput
andinputs
, but which let one read raw inputs, JSON w/ WTF-8, etc.The current state of this PR is pretty poor -- just a sketch, really. Here's the TODO list:
[ ] meld with theJVP_FLAGS
thing done for numbers?[ ] make string kinds (UTF-8, binary) and output encoding flags (base64, array of bytes, ...)JVP_FLAGS
, or[ ] moveJVP_FLAGS
to thepad_
char
field ofjv
that would now be called flags or subkindjv_binary_*()
functionsjv_get_string_kind()
jv_string_concat()
and others work with binary.[]
iterate the codepoints in a string.[]
iterate the bytes in a binary string[x] let(see commentary below).[$index]
address the$index
th codepoint in.
if it's a string.[$index]
address the$index
th byte in.
if it's a binary blobencodeas("base64")
, the default)encodeas("hex")
)encodeas("bytearray")
)--raw-output-binary
mode[ ] WTF-8?(punt for now)[ ] WTF-8b?(punt for now)[ ] other encodings?(punt for now)[0,[[[1,2],3],4],5]
in converting to binarystringtype/0
tobinary/0
encodeas/1
encoding/0
tobinary/1
[ ] addtowtf8/1
[ ] add(punt for now)towtf8/0
[ ] add(punt for now)tobase64/0
@base64d
base64 decoder should produce binary as if bytobinary_utf8
[ ] add afrombase64/0
that only produces binary to avoid having to check if the result is valid UTF-8tostring/0
accept binary strings and do bad codepoint replacement as usual[ ] add a family of functions like(let's leave this for later)input
andinputs
, but w/ caller control over the input formats (this is pretty ambitious, possibly not possible)[ ] add binary literal forms ((let's leave this for later)b"<hex-encoded>"
,b64"<base64-encoded>"
)? (not strictly needed, since one could use"<base64-encoded>"|frombase64
or some such, and we could even make the compiler constant-fold that)--raw-output-binary
mode--raw-input-mode BLOCK-SIZE
mode that produces binaryinput
andinputs
, with the default output encoding, reading raw binary strings of up toBLOCK-SIZE
bytes (and if--slurp
is given, concatenate all the blocks and run the jq program on the one slurped input)tobinary
andencodeas
Questions:
.[]
for strings a bad idea? (A: Apparently yes. See commentary below.)EDIT: We already have string slicing. Adding string indexing and iteration seems to complete the picture.