Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary strings #2736

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

nicowilliams
Copy link
Contributor

@nicowilliams nicowilliams commented Jul 20, 2023

In the past I've wanted to support binary blobs by treating them as arrays of small integers. I started a small experiment today and it looks to me like adding a sub-type of string that is binary and behaves like a string is much more natural than a sub-type of string that behaves like an array, especially if we were to have the ability use .[] to iterate (which would give us a streaming version of explode).

The goal is to be able to work with a) binary, non-text data, b) work with mangled UTF-8, such as WTF-8. For example of (a), one could try to write a codec for CBOR and other binary JSON formats, or ASN.1 DER, or protocol buffers, or flat buffers, etc.

I'd like to add the fewest possible command-line options, possibly none.

So here's the rough idea here, which this PR right now barely sketches:

  • binary data should be a sub-type of string
  • maybe there should be multiple sub-types of string where the sub-type denotes a) the kind of content, b) what to do on output:
    • we could have binary that is output in base64
    • we could have binary that is an error to output
    • we could have WTF-8 that is output as-is
    • we could have @Maxdamantus' WTF-8b that attempts to encode 0x80-0xff as overlong UTF-8 sequences
  • add tobinary/1 which makes a binary out of a stream of bytes that will be an error to output if it's not valid UTF-8
  • add tobinary/0 which makes a binary out of a string (this may seem silly, but .[] on strings should output a stream of Unicode codepoints, while .[] should output a stream of bytes)
  • add a encodeas/1 which sets the encoding for the given value (currently only for strings and binary strings) to one of "UTF-8", "base64", or "bytearray"
  • add encoding/0 which outputs the output encoding of its input string/binary value
  • make tostring/0 work with binary strings of all types doing the usual bad codepoint replacement thing
  • add a family of builtins that are like input and inputs, but which let one read raw inputs, JSON w/ WTF-8, etc.
  • add a command-line option(s) for input forms
    • one option to read raw input as binary
    • one option to read raw input as binary delimited by some byte value

The current state of this PR is pretty poor -- just a sketch, really. Here's the TODO list:

  • [ ] meld with the JVP_FLAGS thing done for numbers?
    • [ ] make string kinds (UTF-8, binary) and output encoding flags (base64, array of bytes, ...) JVP_FLAGS, or
    • [ ] move JVP_FLAGS to the pad_ char field of jv that would now be called flags or subkind
  • add jv_binary_*() functions
  • add a jv_get_string_kind()
  • let jv_string_concat() and others work with binary
  • let .[] iterate the codepoints in a string
  • let .[] iterate the bytes in a binary string
  • [x] let .[$index] address the $indexth codepoint in . if it's a string (see commentary below)
  • let .[$index] address the $indexth byte in . if it's a binary blob
  • implement JSON encoder options
    • base64 (encodeas("base64"), the default)
    • hex (encodeas("hex"))
    • array of byte values (encodeas("bytearray"))
      • properly indent these arrays (currently they're always compact)
    • convert to UTF-8 with bad character mappings
    • no encoding in --raw-output-binary mode
    • [ ] WTF-8? (punt for now)
    • [ ] WTF-8b? (punt for now)
    • [ ] other encodings? (punt for now)
  • support flattening of arrays of bytes and arrays of .. bytes (e.g., [0,[[[1,2],3],4],5]in converting to binary
  • add stringtype/0
  • add tobinary/0
  • add encodeas/1
  • add encoding/0
  • add tobinary/1
  • [ ] add towtf8/1
  • [ ] add towtf8/0 (punt for now)
  • [ ] add tobase64/0 (punt for now)
  • @base64d base64 decoder should produce binary as if by tobinary_utf8
  • [ ] add a frombase64/0 that only produces binary to avoid having to check if the result is valid UTF-8
  • make tostring/0 accept binary strings and do bad codepoint replacement as usual
  • [ ] add a family of functions like input and inputs, but w/ caller control over the input formats (this is pretty ambitious, possibly not possible) (let's leave this for later)
  • [ ] add binary literal forms (b"<hex-encoded>", b64"<base64-encoded>")? (not strictly needed, since one could use "<base64-encoded>"|frombase64 or some such, and we could even make the compiler constant-fold that) (let's leave this for later)
  • --raw-output-binary mode
  • --raw-input-mode BLOCK-SIZE mode that produces binary input and inputs, with the default output encoding, reading raw binary strings of up to BLOCK-SIZE bytes (and if --slurp is given, concatenate all the blocks and run the jq program on the one slurped input)
  • add docs
    • add mantests of tobinary and encodeas
    • add mantests of string codepoint iteration and indexing, and binary string byte iteration and indexing
  • add shtests

Questions:

  • is this a bad idea?
  • is .[] for strings a bad idea? (A: Apparently yes. See commentary below.)
    EDIT: We already have string slicing. Adding string indexing and iteration seems to complete the picture.
  • what's missing?
    • A binary input mode.

@pkoppstein
Copy link
Contributor

adding a sub-type of string that is binary and behaves like a string is much more natural than a sub-type of string that behaves like an array

I'm concerned that the former approach (making blobs behave like strings) will be troublesome or confusing, basically because "strings" are already troublesome enough. (JSON? Raw? Sequences of codepoints? Valid UTF8? Invalid?)

Consider for example .[]. You have envisioned string[] as yielding a stream of strings, which is useful and intuitive, but if binary data is represented in a way that makes it "behave like a string", wouldn't that mean that however blob[] is defined, it will be problematic in one way or another? If it iterates bytes as you envision, then string[] would have to iterate the codepoints.

Well, maybe that wouldn't be so bad, but consider another example: length. The length of a blob would surely just be the length of the corresponding array.

If the main goal of supporting blobs is to have a highly compact way of storing large arrays of small integers in a way that allows for various operations and transformations to be implemented efficiently, then your original intuition seems to me correct.

No doubt I'm missing something important, but this does seem like a fine opportunity to plug for string[i] as shorthand for string[i: i+1] :-)

@Maxdamantus
Copy link

add binary literal forms (b"", b64"")? (not strictly needed, since one could use ""|frombase64 or some such, and we could even make the compiler constant-fold that)

As a possible alternative to the hex notation, I think supporting "\xHH" notation in string literals would be useful (I didn't want to add it in my PR because I wanted to avoid adding new features). I suspect it shouldn't be allowed in actual JSON string literals (since the notation is not allowed in JSON [0]), but it could be allowed in jq string literals.

[0] Though I have wondered if it could make sense to have a flag to allow it and also to emit it, instead of emitting the illegal UTF-8 bytes. This is actually my biggest gripe against JSON: the fact that it has a notation for representing arbitrary sequences of UTF-16 code units but not arbitrary sequences of UTF-8 code units. Perhaps with some hindsight, there could have been an expectation for UTF-16 systems to interpret "\xHH" sequences as UTF-8 just as UTF-8 systems interpret "\uHHHH" sequences as UTF-16.

@Maxdamantus
Copy link

Maxdamantus commented Jul 20, 2023

what's missing?

While tobinary would be catering to 8-bit string processing (where the indexing operations work as in C, Go, Rust), it might also be worth adding something that caters to 16-bit string processing (where the indexing operations work as in JavaScript or Java). This would probably be a matter of adding tobinary16 in parallel (tobinary could actually be tobinary8 if we want to be extremely clear).

("💩" | length) == 1 # like in Python
("💩" | tobinary8 | length) == 4 # like in C
("💩" | tobinary16 | length) == 2 # like in JavaScript

@nicowilliams
Copy link
Contributor Author

adding a sub-type of string that is binary and behaves like a string is much more natural than a sub-type of string that behaves like an array

I'm concerned that the former approach (making blobs behave like strings) will be troublesome or confusing, basically because "strings" are already troublesome enough. (JSON? Raw? Sequences of codepoints? Valid UTF8? Invalid?)

I'm not following.

Consider for example .[]. You have envisioned string[] as yielding a stream of strings, which is useful and intuitive, but if binary data is represented in a way that makes it "behave like a string", wouldn't that mean that however blob[] is defined, it will be problematic in one way or another? If it iterates bytes as you envision, then string[] would have to iterate the codepoints.

Yes, $string[] should, will, and in this draft PR as it stands now does indeed iterate codepoints -- just like explode, but streaming.

$blob[] would iterate bytes.

[$blob[]] and [$string[]]would be like explode.

Well, maybe that wouldn't be so bad, but consider another example: length. The length of a blob would surely just be the length of the corresponding array.

The length of a blob would be the number of bytes in it, not codepoints or anything else. A binary datum being binary, colloquially meaning an array of bytes, this is the only natural thing to do.

If the main goal of supporting blobs is to have a highly compact way of storing large arrays of small integers in a way that allows for various operations and transformations to be implemented efficiently, then your original intuition seems to me correct.

Certainly there's nothing unnatural about representing blobs as arrays of bytes. But I think there's nothing unnatural about representing them as non-UTF-8 strings too, and in terms of what I would have to do to src/jv.c, I think the latter is better than the former.

Ah, that's another thing, we currently have string slice syntax "foo"[0:1] ("f"), but we have neither string iteration ("foo"[]) nor string indexing ("foo"[1]). Indexing of strings would have to be as for string slices: by codepoint number, not by byte number. Indexing of binary blobs would have to be by byte number.

If we had .[], .[$index], and .[$start:$end] for strings and blobs then they would feel a lot like arrays. The only thing is that iterating/indexing strings cannot be a path expression (and in the current state of this PR it happens to not be a path expression, so that's good and done).

So I think simply adding iteration and indexing support for strings and blobs is enough to get the semantics I'd originally had in mind for binary as arrays of small integers, but with the benefit that there would be no concerns like "what happens if you have a binary (array of bytes) and try to append or set a value that is not a byte value?".

Also, thinking about it, representing binary blobs as arrays of bytes would have presented difficulties w.r.t. path expressions. Since string iteration/indexing wouldn't contribute to path expressions, I now think it's more natural to represent binary as a sub-type of strings. Also, in other languages binary is typically string-like, at least as to literal value syntax.

No doubt I'm missing something important, but this does seem like a fine opportunity to plug for string[i] as shorthand for string[i: i+1] :-)

Yes! I was missing that. I'll add it.

@nicowilliams
Copy link
Contributor Author

what's missing?

While tobinary would be catering to 8-bit string processing (where the indexing operations work as in C, Go, Rust), it might also be worth adding something that caters to 16-bit string processing (where the indexing operations work as in JavaScript or Java). This would probably be a matter of adding tobinary16 in parallel (tobinary could actually be tobinary8 if we want to be extremely clear).

("💩" | length) == 1 # like in Python
("💩" | tobinary8 | length) == 4 # like in C
("💩" | tobinary16 | length) == 2 # like in JavaScript

UTF-16 is proof that -or at least very strongly suggestive of- time machines don't exist, and never will exist, or are/will be too expensive to use, or that fear of paradoxes will limit their use to just observation.

UTF-16 needs to die in a fire, and if jq not supporting it helps it die, so much the better!

Now, more seriously, if we had a byteblob binary type, we would also then be able write jq code that uses that to implement UTF-16. Having a string sub-type that is UTF-16 might have some value, but I would like first to get experience with byte blobs before we add UTF-16 support.

@nicowilliams
Copy link
Contributor Author

As a possible alternative to the hex notation, I think supporting "\xHH" notation in string literals would be useful (I didn't want to add it in my PR because I wanted to avoid adding new features). I suspect it shouldn't be allowed in actual JSON string literals (since the notation is not allowed in JSON [0]), but it could be allowed in jq string literals.

Indeed, I'm not interested in innovating in JSON. Having participated in IETF threads with thousands of posts about publishing RFC 7259, I'm not inclined to believe that we could alter JSON to support binary, and I do not relish the thought of repeating that experience.

[0] Though I have wondered if it could make sense to have a flag to allow it and also to emit it, instead of emitting the illegal UTF-8 bytes. This is actually my biggest gripe against JSON: the fact that it has a notation for representing arbitrary sequences of UTF-16 code units but not arbitrary sequences of UTF-8 code units. Perhaps with some hindsight, there could have been an expectation for UTF-16 systems to interpret "\xHH" sequences as UTF-8 just as UTF-8 systems interpret "\uHHHH" sequences as UTF-16.

With string sub-types indicating output options we could certainly allow oddball, not-quite-JSON formats like JSON w/ WTF-8, but for true binary I am only interested in either emitting errors or auto-base64-encoding for now. Eventually something like WTF-8b would indeed allow encoding of binary as something very close to UTF-8, if not actually UTF-8 (like, if we used private use codepoints to represent the WTF-8 encoding of broken surrogates then the result could be true UTF-8 rather than WTF-8). But even here we'd be stepping on the Unicode Consortium's toes -- it would be much much better, but also much much harder, to get the UC to allocate 128 codepoints for this purpose and then define something like a Unicode encoding of binary data.

So you can see I'm reluctant to innovate on the JSON side and the Unicode fronts. I'm not resolutely opposed to it though: we could have command-line options to enable these for input/output, and we could label them experimental. But I'd like to get something a bit more standards-compliant done first.

@nicowilliams
Copy link
Contributor Author

I now see that this approach and my old "binary as array of small integers" idea are... remarkably similar. The differences are:

  • what type is reported by type ("array" vs "string")
  • how binary blobs are encoded on output (array of bytes vs several options possibly including array of bytes)

As long as we add .[] and .[$index] for both, UTF-8 strings and binary strings, binary blobs as array of bytes or binary blobs as strings will work very much the same way.

@pkoppstein
Copy link
Contributor

... remarkably similar.

Hmmm. That's largely what I was trying to say :-)

But let me outline two radical variations of the "blob as array of bytes" idea.

For brevity, I'll use $ab to signify a JSON array of integers in range(0;256).

The two variants are:

  1. jq adopts a convention such as identifying JSON objects having the form {"class": "blob", "value": $ab} with elements of what we can think of as "class blob". This would allow for efficient handling of blobs, and provide a model for handling of non-JSON "types" in future.
  2. jq manages a quasi-hidden "is-a-blob" flag on arrays, and provides a bunch of new filters, e.g. for reporting whether an array is an $ab, and for transforming an $ab to other representations. (Many of these new filters would raise an error, e.g. if the input is expected to be an $ab but isn't, or if there's something about the $ab that prevents the requested transformation.)

Of course, both techniques can be used if one wants to support
both binary8 and binary16.

@Maxdamantus
Copy link

Maxdamantus commented Jul 21, 2023

But even here we'd be stepping on the Unicode Consortium's toes -- it would be much much better, but also much much harder, to get the UC to allocate 128 codepoints for this purpose and then define something like a Unicode encoding of binary data.

It's not really possible to correctly do it this way. The only correct way to encode invalid Unicode in such a way that valid Unicode is passed through unchanged (ie, all valid UTF-8 strings have the same meaning in WTF-8) is to encode the ill-formed Unicode sequences into ill-formed Unicode sequences.

WTF-8 works by encoding ill-formed UTF-16 (unpaired surrogates) into invalid[0] UTF-8 (invalid "generalised UTF-8" encodings of UTF-16 surrogate code points). Any valid UTF-16 already has a corresponding valid UTF-8 encoding, and vice versa—these encodings can't be reused.

The "WTF-8b" extension additionally encodes ill-formed UTF-8 bytes as other invalid UTF-8 bytes. This includes all WTF-8-specific and WTF-8b-specific sequences (it's fundamentally not possible for this process to be idempotent, since it should not be possible to generate encoded UTF-16 errors from UTF-8 binary data).

If ill-formed Unicode is encoded as valid Unicode, it won't be distinguishable from previously valid Unicode. It would be particularly incorrect to emit invalid Unicode in response to certain valid Unicode (eg, text that happens to contain these 128 hypothetical code points—they would still be Unicode scalar values, so they can appear in valid UTF-8 or UTF-16 text ... or binary data that just happens to look like such UTF-8 text).

[0] I'm distinguishing here between "ill-formed" and "invalid", where "invalid" bytes would be an ill-formed sequence that never occurs as a substring of valid Unicode text—these forms are particularly useful in WTF-8/WTF-8b since they can not be generated accidentally through string concatenation

@nicowilliams
Copy link
Contributor Author

It's not really possible to correctly do it this way. The only correct way to encode invalid Unicode in such a way that valid Unicode is passed through unchanged (ie, all valid UTF-8 strings have the same meaning in WTF-8) is to encode the ill-formed Unicode sequences into ill-formed Unicode sequences.

That works provided other systems understand it. Encoding non-UTF-8 as valid UTF-8 with special codepoints also only works if other systems understand how to decode that, but it has the advantage that other systems that do not know how to decode it will pass it through unmolested.

@Maxdamantus
Copy link

Maxdamantus commented Jul 21, 2023

That works provided other systems understand it

I think the purpose of any such encoding should only be for internal use. Except for debugging purposes, it should preferably not be possible to observe the internal byte representation of these strings.

I think a reasonable way of exposing the error bytes/surrogates would be as negative code points when iterating, eg:
("foo\xFF\uD800💩" | explode_raw) == [102, 111, 111, -255, -55296, 128169]
This way the errors can be detected with a simple . < 0 check, and they can also be passed as-is to an inverse implode_raw operation. Come to think of it, maybe my PR should also be doing this using the internal WTF-8b iteration function (it currently only uses negative code points for denoting the UTF-8 errors, not the UTF-16 errors).

@nicowilliams
Copy link
Contributor Author

@leonid-s-usov I'm trying to understand the jvp flags thing you added. What is the intent regarding adding new flags? Why not use the pad_ field for flags and leave the kind field alone?

@nicowilliams
Copy link
Contributor Author

1. jq adopts a convention such as identifying JSON objects having the form {"class": "blob", "value": $ab} with elements of what we can think of as "class blob".  This would allow for efficient handling of blobs, and provide a model for handling of non-JSON "types" in future.

jq will only every support JSON types -- new value types can't be added because they can't be represented in JSON. jq could add a typing mechanism that amounts to JSON schemas for data, and maybe typing for jq functions too (so we could do typechecking), but this is all way beyond the scope of this PR, and I don't think binary data support should wait for any of that.

If we had to have a notion of "class" I'd do it a bit like Perl5: provide a way to bless a JSON object (and maybe arrays too) with a "class" and add a way to find out the class of one, but with the addition of a JSON schema and validation. Obviously a lot can be debated there, but definitely jq cannot add new value types.

2. jq manages a quasi-hidden "is-a-blob" flag on arrays, and provides a bunch of new filters, e.g. for reporting whether an array is an $ab, and for transforming an $ab to other representations. (Many of these new filters would raise an error, e.g. if the input is expected to be an $ab but isn't, or if there's something about the $ab that prevents the requested transformation.)

String or array makes little difference now, but I much prefer string now. Again, the only real difference now would be what type reports. Internally (i.e., in the jv API, and in the implementation of the EACH, EACH_OPT, INDEX, and INDEX_OPT instructions) though I am now convinced that binary should be a flavor of string not of array.

Maybe someone can make a convincing argument that allowing .[] and .[idx] for strings breaks backwards compatibility seriously enough that we shouldn't have type for binary return "string". Certainly one could make an argument that it breaks backwards compatibility. For example one could use the fact that "string"[] raises an error as a way to check whether the type of a value is an iterable, but I wouldn't find that example convincing because we do provide type.

Of course, both techniques can be used if one wants to support both binary8 and binary16.

There's no reason that 16-bit word strings couldn't be a flavor of "string" too. If it's UTF-16 then it can be converted to UTF-8 on output. If it's not UTF-16 then it can be base64-encoded or encoded as an array of 16-bit unsigned integers just like 8-bit binary.

So the only arguments I see here are about a) which is more natural for type to report for a binary blob ("string" or "array"), and b) whether it's OK to add .[] and .[index] for values whose type is "string". I don't think (a) is very interesting but I now prefer the answer to be "string" if nothing else because .[] and .[index] on strings will not be path expressions, but they are path expressions for array value inputs, and it'd be rather strange to have some arrays for which they are not path expressions. I do think (b) is mildly interesting, but I don't have examples of how adding .[] and .[index] for strings would be an unacceptable change.

@pkoppstein
Copy link
Contributor

@nicowilliams wrote:

jq will only every support JSON types

Precisely. That's the whole point of my two variations. You used the term "flavor", so by all means go with that if you prefer.

To summarize: The first variation basically involves new filters and a convention about JSON objects. These can both be ignored entirely by the user; and if the user ignores them, there will be no impact on the user.

The second variation is even less visible, as there is no convention, just some new filters and some behind-the-scenes stuff.

@nicowilliams nicowilliams force-pushed the binary_strings branch 3 times, most recently from a056ceb to c081f05 Compare July 21, 2023 22:02
@nicowilliams
Copy link
Contributor Author

nicowilliams commented Jul 21, 2023

@pkoppstein you might want to kick the tires on this. It's starting to be usable!

: ; ./jq -cn '"foob"|tobinary|[type,stringtype]'
["string","binary"]
: ; ./jq -cn '"foob"|tobinary|256+.'
jq: error (at <unknown>): number (256) and string ("Zm9vYg==") cannot be added
: ; ./jq -cn '"foob"|tobinary|.+256'
jq: error (at <unknown>): string ("Zm9vYg==") and number (256) cannot be added because the latter is not a valid byte value
: ; ./jq -cnr '"foob"|tobinary|.+255' | base64 -d | od -t x1
0000000 66 6f 6f 62 ff
0000005
: ; ./jq -cnr '"foob"|tobinary_bytearray|.+255'
[102,111,111,98,255]
: ; ./jq -cnr '"foob"|tobinary_utf8|.+255'
foob�
: ; ./jq -cn '"foob"|tobinary_utf8'
"foob"
: ; ./jq -cn '["foob"|tobinary|(.+255)[]]'
["f","o","o","b","ÿ"]
: ; ./jq -cn '["foob"|tobinary|tostring[]]'
["Z","m","9","v","Y","g","=","="]
: ; ./jq -cn '["foob"|tobinary|tostring[]]'
["Z","m","9","v","Y","g","=","="]

Conversions to base64, byte array, or UTF-8 (w/ bad character mapping) happen on output or on tostring.

@nicowilliams nicowilliams force-pushed the binary_strings branch 2 times, most recently from b78a40b to 7db1557 Compare July 21, 2023 22:27
@nicowilliams
Copy link
Contributor Author

I might punt on WTF-8 and let @Maxdamantus implement that on top of this when this is done :)

@nicowilliams nicowilliams force-pushed the binary_strings branch 2 times, most recently from 6923da2 to 7aecd7c Compare July 22, 2023 02:55
@nicowilliams
Copy link
Contributor Author

nicowilliams commented Jul 22, 2023 via email

@pkoppstein
Copy link
Contributor

pkoppstein commented Jul 22, 2023

Try again now

Yay! [Unless you strenuously object, I propose deleting completely useless and ephemeral messages in this thread (and potentially others, too).]

I noticed that you're proposing to extend + to allow both:

tobinary|.+255   #1 

and

tobinary_bytearray|.+255 #2

The other day, you were warning about the perils of polymorphism, so
I'm a bit concerned about both for that kind of reason; more particularly,
though, since you want "binary" to be string-like, you'd expect something like:

tobinary | . + ([255]|tobinary)  #1'

or at least:

tobinary | . + (255|tobinary)  #1''

More importantly, #2 seems quite wrong from a jq-perspective: since a
bytearray prints as an array of integers, one would expect to have to write:

tobinary_bytearray|.+[255]   #2'

@pkoppstein
Copy link
Contributor

Kicking the tires...

What might be done about the proliferation of unwieldy names?

Since you've introduced tobinary and allow tobinary|tostring,
there's also an element of inconsistency with having to write
tobinary_bytearray and tobinary_utf8 (with more to come?).

Agreed, "tobytearray" and "toutf8" are unreadable at best and
unacceptable at worst, so I was wondering what alternatives there
might be. An underscore? camelCase? Or better, something with a tad
more extensibility, such as defining to/1 so we'd write
to("bytearray") or to("utf8"), etc.

@pkoppstein
Copy link
Contributor

@nicowilliams wrote:

add a raw binary input mode

Yes, please, even if it's only one such mode (for now)! It would fill a really big gap.

@pkoppstein
Copy link
Contributor

. as [$x] ?// $x | $x

@itchyny of course makes important points about this expression, and I would agree that allowing "abc" as [$x] would be to extend destructuring beyond what was probably originally envisioned, but I think the conclusions that have been drawn should be reconsidered.

First, the E as [$x] construct is advertised as "destructuring", not "pattern matching".

Second, it should be remembered that in jq 1.6:

   "abc" as [$x] | $x

raises an error. So the new behavior makes a feature out of an error,
which is typical of new features. (That's exactly what the new
behavior of $string[] does.) There is not much of a "backwards
compatibility" issue here, especially as expressions such as the
one in question (. as [$x] ?// $x | $x) are somewhat arcane.

Third, being able to write $string[$i] rather than $string[$i:$i+1] is really
useful, especially when the computation of $i is quite complex.
(In fact, if it turns out that $string[$i] will not be supported, then
I would hope we can come up with an alternative syntax for the same thing.)

Fourth, there is a huge tension within the current version of the binary_strings branch:
on the one hand, it supports $string[], but not $string[0]. This is really quite bizarre, not least because it violates the jq identity that otherwise holds:

[.[range(0;length)]] == [.[]] # if the RHS is supported

Of course, if we do "revert" to allowing $string[$i], then the change
of behavior w.r.t. destructuring and and //? should all be documented :-)

@itchyny
Copy link
Contributor

itchyny commented Jul 26, 2023

In real use cases, users may want to get the first element from response fields whose type is string | array[string]. I think we can separate destructuring and indexing operation. At least that's what I do in gojq; use array indexing for destructuring.

@nicowilliams
Copy link
Contributor Author

I don't think making string comparison operators form-insensitive would break backwards compatibility.

Good. But your earlier remark that:

I think you're very confused. Form-insensitive comparison means that every string which is distinct under memcmp() but memcmp()-equal after normalization. I.e., that which you're arguing against.

The only good argument I can see against form-insensitive string comparisons (and normalizing-to-hash for object key hashing) is a performance argument. By and large all the Unicode strings you'll see are NFC, except that there are things like OS X's HFS+ which decompose into NFD (or something very close to it), so in fact there are ways to get equivalent-but-memcmp()-distinct strings into an application, and so there's utility in having software be able to consider them equal.

ideally UTF-8 string equality should implement comparison with Unicode canonical equivalence by normalizing as needed as it goes

was disconcerting, as it seemed you were saying that canonical equivalence of two JSON strings, X and Y, should imply X == Y #=> true in jq.

See above.

Re: sanctioned notions of equivalence

Sorry, I meant "compatibility equivalence" [*1].

Got it. Yes, well, I wouldn't use NFKC or NFKD for string comparison. I would use NFD because if you're normalizing character-by-character as needed for string comparison, then why canonically-decompose-then-canonically-compose -as NFC requires- when you can just canonically decompose and stop there?

My point was that there are more than two "officially recognized" equivalence relations: "binary comparison" [*2], "canonical equivalence", and "compatibility equivalence".

I'm quite aware.

Note also that in Python and no doubt others, the three relations are dealt with separately:

binary => `==`
canonical => NF (NFC or NFD)
compatibility => NFK (NFKC or NFKD)

And what are Python's operators for comparison w/ normalization?

I would consider === and variations thereof for this. But assuming we can haz a suitable normalization library that is amenable to the optimization I mentioned (or which has it already), then I'm all for having canonical equivalence.

@nicowilliams
Copy link
Contributor Author

add a raw binary input mode

Yes, please, even if it's only one such mode (for now)! It would fill a really big gap.

I've a feeling that anything other than slurping all binary input will be a fair bit of work, both for the author and the reviewers.

@nicowilliams
Copy link
Contributor Author

In real use cases, users may want to get the first element from response fields whose type is string | array[string]. I think we can separate destructuring and indexing operation. At least that's what I do in gojq; use array indexing for destructuring.

That's a good idea. The INDEX in this program comes from gen_array_matcher(), so we can add an alternative version of INDEX that doesn't index strings that gen_array_matcher() could use, then we can have string indexing outside this context! Thank you.

@nicowilliams
Copy link
Contributor Author

nicowilliams commented Jul 26, 2023

Tomorrow I'll add I've added a separate INDEX for gen_array_matcher() to use and restored string indexing. EDIT: That fixes the issue!

(Before looking at how that INDEX was being generated I though this would be harder. Turns out to be easy enough. Thanks again for the idea, @itchyny!)

@nicowilliams
Copy link
Contributor Author

Hmm, I want to rewrite the section of the manual on destructuring.

This commit adds `tobinary/0`, which converts a normal string (or an
array of bytes, flattened) to a binary string that will be
base64-encoded on output.  The string will remain unencoded during the
execution of the jq program -- encoding is not applied until the string
is to be output, or until `tostring` is applied.

Also added is an `encodeas/1` that takes a string argument of
`"base64"`, `"hex"`, `"bytearray"`, or `"UTF-8"` and outputs its input
altered so that on output the string will be encoded in base64, hex, as
an array of bytes, or be converted to UTF-8 by applying UTF-8 validation
and bad character mapping.

As well, there is a `tobinary/1` that converts a stream of strings,
numeric unsigned byte values, and arrays of bytes, to a binary.

As well there is a `isbinary/0` which indicates whether the input is
binary.

As well there is a `stringtype/0` which indicates whether a string is
binary or UTF-8.

As well there is a `encoding/0` which indicates the output encoding of
the input.
@pkoppstein
Copy link
Contributor

slurping all binary input

That by itself would be fantastic!

Copy link
Contributor

@pkoppstein pkoppstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

manual.yml -

(1) typo in sentence beginning:

Binary strings are encoded accoring to the encoding selected

(2) The following looks like something went wrong with copy/paste:

Arrays of unsigned byte values and arrays of .. unsigned byte values

(3) "when when"

(4) Change:

Outputs either "UTF-8", "base64", or "bytearray"

to s.t. like:

Outputs the encoding, currently one of "UTF-8", "base64", "bytearray", or "hex",

(5) Please consider breaking up the following "example" into separate, pithier pieces that will avoid unwanted line breaks:

          - program: '[(tostring,tobinary,(tobinary|encodeas("bytearray")),(tobinary|encodeas("UTF-8")))|[type,stringtype,encoding]]'
            input: '"foo"'
            output: ['[["string","UTF-8","UTF-8"],["string","binary","base64"],["string","binary","bytearray"],["string","binary","UTF-8"]]']

For example:

          - program: '[tostring,tobinary | [type,stringtype,encoding]]'
            input: '"foo"'
            output: ['[["string","UTF-8","UTF-8"],["string","binary","base64"]]']


          - program: 'tobinary | encodeas("bytearray") | [type,stringtype,encoding]'
            input: '"foo"'
            output: ['["string","binary","bytearray"]']

@nicowilliams
Copy link
Contributor Author

I think I'm going to make it so tobinary|encodeas("bytearray")|tostring outputs UTF-8 because I'd like to make sure that tostring always outputs... strings so that someday we can possibly do some type inference. This would necessitate some other builtin to encode as bytearrays, which would be obnoxious, so...

Or maybe not, maybe for type inference someday we just add type assertions, and leave tostring as it is right now in this PR.

@Maxdamantus
Copy link

Maxdamantus commented Jul 26, 2023

I think I'm going to make it so tobinary|encodeas("bytearray")|tostring outputs UTF-8 because I'd like to make sure that tostring always outputs... strings

I feel like I should point out that if the existing encoding issues are fixed with #2314, the encodeas behaviour might overall be unnecessary. In particular, that PR makes it so regular strings are able to hold arbitrary bytes (treated as invalid UTF-8), and they are printed losslessly as regular JSON string literals (assuming --ascii-output is not used).

With those changes, I think it would make more sense to simply emit the binary strings in the same way. If someone wants to emit base64, @base64 can be used, either on a binary string or a regular string (the main difference between a binary string and a regular string would essentially be what happens when indexing, iterating and checking length), or they want a byte array, tobinary | [.[]] can be used (if the string is already a binary string, tobinary is not needed).

It also feels a bit strange to me that tojson | fromjson will produce a substantially different value: particularly, if the input contains a binary string with the default encodeas("base64"), that string will get turned into base64 in tojson which is interpreted as a substantially different string (of base64 characters, rather than the original bytes) by fromjson. Again with the #2314 fixes, tojson would preserve the bytes in the binary string when emitted as JSON, so it gets interpreted again as substantially the same string (though it will be tagged as a regular string instead of a binary string).

@nicowilliams
Copy link
Contributor Author

nicowilliams commented Jul 26, 2023

I feel like I should point out that if the existing encoding issues are fixed with #2314, the encodeas behaviour might overall be unnecessary. In particular, that PR makes it so regular strings are able to hold arbitrary bytes (treated as invalid UTF-8), and they are printed losslessly as regular JSON string literals (assuming --ascii-output is not used).

losslessly

You still need [as yet non-existent] software to decode the binary-as-WTF-8b encoding. That might be loss-less, but it's not convenient. It's also not standards-compliant in any way. Sure, someone could standardize WTF-8b, but that's far off in the future.

What I'm doing here does not preclude adding WTF-8 and your WTF-8b. We're not in competition. Since WTF-8b is not useful to me, I am not interested in pursuing it myself at this time, though your PR rebased onto this one (if we merge this one!) would be acceptable.

Your comment isn't addressing my question about whether having tobinary|encodeas("bytearray")|tostring output not-a-string is a good idea.

Using WTF-8b to represent arbitrary binary wouldn't be easy either, since jq programs would have implement that decoding themselves or use a decoder provided by someone else (maybe a builtin). Whereas using binary as in this PR is not unlike binary in any other language: it's just bytes.

Or did I misunderstand your PR completely?

It also feels a bit strange to me that tojson | fromjson will produce a substantially different value: particularly, if the input contains a binary string with the default encodeas("base64"), [...]

One can always explicitly pick an encoding, that's true. But I want to be able to work with binary and not have to update the place where the final value will be placed wherever it goes to encode it. I believe that having a "lazy" output encoding option as part of the jv is a useful feature.

tobinary | [.[]]

Yes, I could get rid of bytearray as an encoding because [.[]] applied to a binary value will do just that, but again, a "lazy" output encoding option is a useful feature.

It also feels a bit strange to me that tojson | fromjson will produce a substantially different value: [...]

Short of having schema that can tell us that values at certain paths are X-encoded binary values, that can't be helped. And that's true even if I remove all of the lazy output encoding functionality. At least with lazy output encoding I can make the output side easier to code in jq programs, even though I can't do anything about the input side without adding a scope-engiantening schema support -- practical considerations dictate that I cannot implement schema-awareness into jq, but nothing precludes that being implemented by someone eventually. Indeed, one could jq-code a schema-aware utility that applies @base64dbinary, @hexd, or tobinary/1 to values encoded as base64, hex, or bytearray respectively, and such a utility would also not preclude decoding of WTF-8b.

@nicowilliams
Copy link
Contributor Author

nicowilliams commented Jul 26, 2023

tobinary | [.[]]

Yes, I could get rid of bytearray as an encoding because [.[]] applied to a binary value will do just that, but again, a "lazy" output encoding option is a useful feature.

Though getting rid of the bytearray encoding is certainly one good answer to my question about tostring, even losing the bytearray lazy output encoding it's not as convenient as I'd like. I might just do that.

Choices I have:

  • leave it as it is, with tostring outputting not-a-string values (arrays) when the input is a binary with bytearray output encoding
  • remove bytearray encoding
  • make tostring applied to bytearray binary apply a different encoding, and if you really want a bytearray then you should use [.[]] (or if encoding == "bytearray" then [.[]] else tostring end)

I may be happy with all three of these. Maybe I'm missing other options?

@Maxdamantus
Copy link

Your comment isn't addressing my question about whether having tobinary|encodeas("bytearray")|tostring output not-a-string is a good idea.

Sorry, I meant to imply that my preference would be for it to emit UTF-8 as proposed above (I'm assuming your proposal above means to make tostring on a binary behave like encodeas("UTF-8") | tostring), but I wanted to expand this logic into other areas.

Using WTF-8b to represent arbitrary binary wouldn't be easy either, since jq programs would have implement that decoding themselves or use a decoder provided by someone else (maybe a builtin). Whereas using binary as in this PR is not unlike binary in any other language: it's just bytes.

Or did I misunderstand your PR completely?

I think there might be a misunderstanding. The use of WTF-8/WTF-8b is really an implementation detail, and the reason it uses those encodings instead of plain bytes is to additionally support encoding UTF-16 errors (so JSON such as "\uD800" is preserved). It could equivalently have used an array of 32-bit integers, where each integer represents a code point, an invalid UTF-8 byte, or an invalid UTF-16 code unit. This would just be less efficient for most uses (since it would use more memory, and it wouldn't be able to do a simple "is it valid UTF-8? okay, just memcpy/fwrite it" on conversion). WTF-8 is never consumed from input or written to output.

Personally, I feel like preserving 8-bit binary data (invalid UTF-8) is more useful than preserving 16-bit binary data (invalid UTF-16), so the use of WTF-8 isn't that important to me (my PR could be modified to simply store the 8-bit data as is), but since JSON itself allows encoding arbitrary UTF-16 data it seems more faithful to JSON to preserve those errors.

@nicowilliams
Copy link
Contributor Author

Your comment isn't addressing my question about whether having tobinary|encodeas("bytearray")|tostring output not-a-string is a good idea.

Sorry, I meant to imply that my preference would be for it to emit UTF-8 as proposed above (I'm assuming your proposal above means to make tostring on a binary behave like encodeas("UTF-8") | tostring), but I wanted to expand this logic into other areas.

tostring applied to tobinary|encodeas("bytearray") would indeed use some encoding other than bytearray, probably base64, but possibly UTF-8 (I've not decided which).

I think there might be a misunderstanding. The use of WTF-8/WTF-8b is really an implementation detail, and the reason it uses those encodings instead of plain bytes is to additionally support encoding UTF-16 errors (so JSON such as "\uD800" is preserved). It could equivalently have used an array of 32-bit integers, where each integer represents a code point, an invalid UTF-8 byte, or an invalid UTF-16 code unit. This would just be less efficient for most uses (since it would use more memory, and it wouldn't be able to do a simple "is it valid UTF-8? okay, just memcpy/fwrite it" on conversion). WTF-8 is never consumed from input or written to output.

In this PR I'm only interested in binary data -as in protocol buffers, DER, XDR, etc-, not WTF-8 or any flavor of "broken UTF-8". Binary is just: an array of 8-bit bytes. And I'm only interested in binary with an efficient implementation, meaning: in-memory literally an array of bytes, with only O(1) overhead, and definitely no additional encoding (as that would not be O(1)), and with an interface that allows jq code to observe binary data as just a sequence or array of numeric byte values in the range 0..256 (again, efficiently). WTF-8 is a completely different thing. and it is not intended to represent binary data. WTF-8b would be just an encoding for binary that I would only be interested in using at the edges (input and output), not in memory, and only if there was interest in it elsewhere too. WTF-8b is simply not appropriate as a internal representation for binary data, and it is especially not appropriate to ask jq programmers to expect binary data to be represented as WTF-8b.

The last sentence in the above quote is the reason that WTF-8b cannot satisfy my requirements for binary data. You can argue that my requirements are wrong, but most programming languages that supports binary data does it in an efficient manner w/o additional encoding, and they do for the obvious reason that it is most ergonomic and most efficient.

Personally, I feel like preserving 8-bit binary data (invalid UTF-8) is more useful than preserving 16-bit binary data (invalid UTF-16), so the use of WTF-8 isn't that important to me (my PR could be modified to simply store the 8-bit data as is), but since JSON itself allows encoding arbitrary UTF-16 data it seems more faithful to JSON to preserve those errors.

I think the "but since JSON itself..." part is a misconception. JSON uses UTF-16-style surrogate pair encoding of escaped codepoints only (and only for codepoints outside the BMP), but only if the sender chooses to escape them (they are NOT required to). JSON is UTF-8 for interchange, and there is no need to escape codepoints outside the BMP. JSON uses UTF-16 surrogate pairs for escaped non-BMP codepoints because ECMAScript does that, and ECMAScript does that for implementation reasons that jq does not suffer from. What's important is that JSON strings are always UTF-8 with some required escapes (e.g., all ASCII control characters, double quotes, etc), but most codepoints do not require escaping. See RFC 8259, section 7. WTF-8 exists to deal with truncated JSON that can leave you with truncated UTF-16 surrogate pairs when non-BMP codepoints were escaped in the original -- WTF-8 is not a generic binary data interchange encoding, let alone a useful internal representation.

I'm not opposed to eventually supporting WTF-8 in some way, but it's just not the point of this PR. My only interest in this PR relating to WTF-8 is that this PR not preclude use of WTF-8 for external purposes, and also not to preclude its use internally to represent WTF-8 strings that have been input as WTF-8 from external sources. I believe this PR indeed does not preclude any of that. If you believe this PR does preclude that, please let me know how it does that.

@pkoppstein
Copy link
Contributor

@nicowilliams - I'm looking at the revised documentation for encoding, which says simply:

Outputs the encoding set with `encodeas`.

This is a bit too terse. First, it makes no mention that the input is expected to be a string; more important, it suggests that e.g. "abc" | encoding would be an error or empty.

So perhaps the text should read s.t. like:

 Assuming the input is a string, it outputs its encoding as set with `encodeas` or else  `"UTF-8"`.

@Maxdamantus
Copy link

Maxdamantus commented Jul 27, 2023

I think the "but since JSON itself..." part is a misconception. JSON uses UTF-16-style surrogate pair encoding of escaped codepoints only [...]. See RFC 8259, section 7.

I don't think this is correct. The section you're referring to uses the term "Character" (arguably incorrectly) to mean both code units and abstract characters (quote below from RFC).

Alternatively, there are two-character sequence escape representations of some popular characters.

(EDIT: Oops, I misinterpreted the context of the above statement; they are using "character" consistently there, but I think the overall point still stands)

The same section also explains that any "character" in the BMP range[0] can be represented using the \uXXXX notation, which will include UTF-16 surrogates. It doesn't say that surrogates must be paired. A later section, 8.2 clarifies that they are allowed, but suggests they are not necessarily interoperable (the RFC uses "interoperable" in various places as a goal, not as a restriction on implementations):

However, the ABNF in this specification allows member names and string values to contain bit sequences that cannot encode Unicode characters; for example, "\uDEAD" (a single unpaired UTF-16 surrogate). [...] The behavior of software that receives JSON texts containing such values is unpredictable; for example, implementations might return different values for the length of a string value or even suffer fatal runtime exceptions.

Implementations of JSON that use UTF-16, such as in JavaScript: normally accept unpaired surrogate escapes as code units without replacement:

$ node -e 'console.log(JSON.parse("\"\\uDEAD\"").charCodeAt(0).toString(16));'
dead

JavaScript also emits unpaired surrogate escapes, though admittedly it has only done this relatively recently (within the last 6 years), and
I might have had some part in suggesting this behaviour:

$ node -e 'console.log(JSON.stringify(String.fromCharCode(0xDEAD)));'
"\udead"

Just to be clear, I'm not suggesting that JSON implementations need to handle unpaired surrogates, but they are technically valid JSON, just as 3.141592653589793238462643383279 is valid JSON (this is another example from the RFC of something that is allowed but might cause interoperability issues—though jq goes out of its way to pass this value through without precision loss).

[0] As an aside, UTF-16 surrogates are technically part of the BMP (https://unicode.org/roadmaps/bmp/), since Unicode allocates code points (but not abstract characters) for them, and like every other plane, the BMP contains exactly 65,536 code points. This point is probably not particularly relevant, since I'm not sure the RFC authors were concerned with this level of precision of Unicode terminology.

@nicowilliams
Copy link
Contributor Author

@Maxdamantus

I'm not interested in further debating on this PR whether RFC 8259 or ECMA 404 allow unpaired surrogates in any way. That topic is just not germane to this PR's topic: binary strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants