Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cksum: failing GNU tests #2930

Closed
Tracked by #3336
jtracey opened this issue Jan 27, 2022 · 13 comments
Closed
Tracked by #3336

cksum: failing GNU tests #2930

jtracey opened this issue Jan 27, 2022 · 13 comments

Comments

@jtracey
Copy link
Contributor

jtracey commented Jan 27, 2022

The latest version of GNU's tests add tests/misc/cksum, tests/misc/cksum-a, tests/misc/cksum-c, and tests/misc/sm3sum, all of which currently fail. This will track the respective problems from each.

@str4d
Copy link
Contributor

str4d commented Jan 29, 2022

uu_cksum correctly implements cksum 8.30, which only has --help and --version. However, the GNU Coreutils 9.0 docs for cksum list the following options:

  • -a/--algorithm
  • --debug
  • --untagged
  • All options supported by b2sum:
    • -l/--length
    • All options supported by md5sum:
      • -b/--binary (unsupported for cksum, always on)
      • -c/--check
        • cksum auto-detects the algorithm if given data in the --tag format.
        • cksum does not support --check with -a sysv, -a bsd, or -a crc.
      • --ignore-missing
      • --quiet
      • --status
      • --tag
        • cksum uses this as its default output format (use --untagged for old format).
      • -t/--text (unsupported for cksum, always off)
      • -w/--warn
      • --strict
      • -z/--zero

So uu_cksum is correct for Coreutils 8.30 compatibility, but for Coreutils 9.0 compatibility, we need to flesh it out.

@str4d
Copy link
Contributor

str4d commented Jan 29, 2022

Investigating the codebase a bit further, I see that both md5sum and b2sum are implemented via hashsum. This suggests we should make cksum also be implemented by hashsum. I've checked off above all the options currently implemented by hashsum.

@str4d
Copy link
Contributor

str4d commented Jan 29, 2022

Looks like test/misc/b2sum is also failing, both because hashsum doesn't support -l/--length, and cksum doesn't support -a blake2b.

I'm going to try implementing this.

@str4d
Copy link
Contributor

str4d commented Jun 4, 2022

I made a bunch of progress on this and then ran out of time to finish the PRs. Picking this back up now; I'll rebase my current local work and get some initial PRs up.

str4d added a commit to str4d/coreutils that referenced this issue Jun 4, 2022
Several binaries have been added to `hashsum` that have never been part
of GNU Coreutils:

- `sha3*sum` (uutils#869)
- `shake*sum` (uutils#987)
- `b3sum` (uutils#3108 and uutils#3164)

In particular, the `--bits` option, and the `--no-names` option added in
uutils#3361, are not valid for any GNU Coreutils `*sum` binary
(as of Coreutils 9.0).

This commit refactors the argument parsing so that `--bits` and
`--no-names` become invalid options for the binaries intended to match
the GNU Coreutils API, instead of being ignored options. It also
refactors the custom binary name handling to distinguish between
binaries intended to match the GNU Coreutils API, and binaries that
don't have that constraint.

Part of uutils#2930.
str4d added a commit to str4d/coreutils that referenced this issue Jun 4, 2022
Several binaries have been added to `hashsum` that have never been part
of GNU Coreutils:

- `sha3*sum` (uutils#869)
- `shake*sum` (uutils#987)
- `b3sum` (uutils#3108 and uutils#3164)

In particular, the `--bits` option, and the `--no-names` option added in
uutils#3361, are not valid for any GNU Coreutils `*sum` binary
(as of Coreutils 9.0).

This commit refactors the argument parsing so that `--bits` and
`--no-names` become invalid options for the binaries intended to match
the GNU Coreutils API, instead of being ignored options. It also
refactors the custom binary name handling to distinguish between
binaries intended to match the GNU Coreutils API, and binaries that
don't have that constraint.

Part of uutils#2930.
@str4d
Copy link
Contributor

str4d commented Jun 5, 2022

Ugh, this is annoying: b3sum was added in #3108, and its corresponding upstream binary defines -l/--length in bytes, whereas GNU Coreutils b2sum defines -l/--length in bits.

There are three options here:

  • Define -l/--length as bytes to match upstream b3sum. Given that compatibility with GNU Coreutils is being prioritised (per this issue), I think that would not be considered acceptable as b2sum would then never pass GNU tests.
  • Define -l/--length as bits to match GNU Coreutils. I really dislike this option, for two reasons:
    • --length being bits means it needs to be a multiple of 8, so 7/8 of values that upstream b3sum accepts will be rejected here. That's not so bad, it makes the difference discoverable.
    • In the cases where an upstream length is a multiple of 8, b3sum here would output 1/8 of the length, e.g. instead of a 16-byte hash the user gets a 16-bit hash (2 bytes). This is a TERRIBLE silent security issue.
  • Define -l/--length as bits when used with b2sum or cksum, and bytes when used with b3sum. This is awful for UX; in particular, it is undefined how hashsum should treat it. The least-bad option might be to just special-case the byte interpretation to b3sum, and not allow it to be used with -a blake3 on hashsum (otherwise hashsum would almost certainly end up conflicting with cksum if the latter adds BLAKE3 support).

We have a somewhat-similar issue in that hashsum itself defines a separate --bits flag for its SHA3 and SHAKE support, which is effectively duplicative of GNU Coreutils -l/--length. I'm not going to try solve that here.

@str4d
Copy link
Contributor

str4d commented Jun 5, 2022

cc @oconnor663 who wrote b3sum

@sylvestre
Copy link
Contributor

While I really want GNU's compatibility, we workaround with upstream testsuite from time to time. See
https://github.com/uutils/coreutils/blob/main/util/build-gnu.sh#L121=
and below

@tertsdiepraam
Copy link
Member

Note that the suprising behaviour you mentioned in option 2 not only happens between b3sum implementations but also between different hashing algorithms. I.e. if I use the external b3sum first and then (GNU) b2sum, then the length does not match either. Hence, I think we bring back the consistency and should build what a hypothetical GNU b3sum would look like, not what another b3sum already does, to ensure compatibility/consistency with GNU and with the other utils in uutils.

I like the first option too, but it would indeed break GNU compatibility, so like Sylvestre mentioned, we'd have to work around the tests. Not to mention that it could be surprising for GNU users (although we'd err on the secure side).

str4d added a commit to str4d/coreutils that referenced this issue Jun 5, 2022
Several binaries have been added to `hashsum` that have never been part
of GNU Coreutils:

- `sha3*sum` (uutils#869)
- `shake*sum` (uutils#987)
- `b3sum` (uutils#3108 and uutils#3164)

In particular, the `--bits` option, and the `--no-names` option added in
uutils#3361, are not valid for any GNU Coreutils `*sum` binary
(as of Coreutils 9.0).

This commit refactors the argument parsing so that `--bits` and
`--no-names` become invalid options for the binaries intended to match
the GNU Coreutils API, instead of being ignored options. It also
refactors the custom binary name handling to distinguish between
binaries intended to match the GNU Coreutils API, and binaries that
don't have that constraint.

Part of uutils#2930.
@oconnor663
Copy link
Contributor

oconnor663 commented Jun 5, 2022

Yeah, the inconsistency blows. Here are some scattered thoughts about why it ended up the way it is:

  • Unfortunately, different BLAKE2 interfaces have been inconsistent for a long time. CLIs and written naming conventions tend to use bits, but Python's hashlib.blake2b API takes the digest_size parameter in bytes. Similarly, C APIs usually take a size_t somewhere that counts bytes.

  • Measuring BLAKE2 output in bits makes sense from a security perspective. The security properties of BLAKE2b-512 and BLAKE2b-256 mirror those of SHA-512 and SHA-256. However, this relationship relies somewhat on BLAKE2b not allowing output sizes larger than its internal state size. For example, b2sum -l 1024 will just error out. The same doesn't apply to BLAKE3, though, which has a built-in XOF that'll give you as many output bytes as you like. "BLAKE3-512" (if we were to call it that) does not provide more security than the default 256-bit BLAKE3. I find this is a common source of confusion, so I discourage naming BLAKE3 instances in terms of bits, even though it can make sense on the small end.

  • Another way BLAKE2 is similar to SHA-2 is that BLAKE2b-512 and BLAKE2b-256 are independent ("domain separated") hash functions. Changing the output length gives you a completely unrecognizable hash. This is a natural design for those functions, since they can mix the output length into state initialization one way or another. But for an XOF like BLAKE3, it's pretty common for the output length to be "who knows, we're gonna use this as a stream cipher", so domain-separating outputs of different lengths isn't always possible. Rather than having two modes for this, as BLAKE2X did, in BLAKE3 shorter outputs are always prefixes of longer ones. This is another reason why I discourage bit-counting names like "BLAKE3-512", to avoid implying that variants with different names are independent of each other.

Which is all to say, that when there's a choice between counting bits and counting bytes, BLAKE3 pretty strongly prefers to count bytes. It keeps CLIs and library APIs consistent with each other, and it avoids some common points of confusion. But sometimes there isn't a choice, like our situation here, and then we suffer some inconsistency.

@tertsdiepraam
Copy link
Member

tertsdiepraam commented Jun 5, 2022

Interesting! That does make it clearer as to why b3sum is different. Maybe a solution could be to make a split API. So b2sum (and the other ones GNU supports) would accept a --bits option and -l would work but be deprecated/hidden. b3sum would accept a --bytes option and using the wrong argument would result in an error. However, a downside to this is that we break compat with both GNU and b3sum, although we could still accept the length argument.

@oconnor663
Copy link
Contributor

If you all decide to do something like --bytes, it might be reasonable to add a similar option upstream on b3sum, so that there's at least something we could tell people to do that'll work (or give a clear error message) everywhere?

sylvestre pushed a commit to str4d/coreutils that referenced this issue Jun 10, 2022
Several binaries have been added to `hashsum` that have never been part
of GNU Coreutils:

- `sha3*sum` (uutils#869)
- `shake*sum` (uutils#987)
- `b3sum` (uutils#3108 and uutils#3164)

In particular, the `--bits` option, and the `--no-names` option added in
uutils#3361, are not valid for any GNU Coreutils `*sum` binary
(as of Coreutils 9.0).

This commit refactors the argument parsing so that `--bits` and
`--no-names` become invalid options for the binaries intended to match
the GNU Coreutils API, instead of being ignored options. It also
refactors the custom binary name handling to distinguish between
binaries intended to match the GNU Coreutils API, and binaries that
don't have that constraint.

Part of uutils#2930.
@sylvestre sylvestre moved this to Todo in GNU Compatibility Aug 18, 2022
@sergeevabc
Copy link

sergeevabc commented Jan 4, 2024

Current coreutils b3sum generates hash *filename, but the original implementation by @oconnor663 uses a second space instead of an asterisk and refuses to check hashes if such character is missing as follows

$ type B3SUMS
79b4c2c09d6f3ac0beee530cb69262f81ce303144620728e8c98044dd543ceb2 *out1.png
7febfd4fd14642e0f70c50dd096ec684f6d05afce5d606aeb157ec93a96890ed *out2.png
17a7f68466e2b437042a5b2ebad2b3abde27f2b83d0c57358a2cbfbb31cf05ed *out3.png

$ b3sum -c B3SUMS
b3sum: Invalid space
b3sum: WARNING: 3 computed checksums did NOT match

We need to either discuss it or bring this part of coreutils into line.

P.S. I already have a dozen BLAKE3 implementations, mostly downloaded from Github, and the output format is different between them. The original one also refuses to hash lists with CRLF line breaks, typical for Windows (such lists are generated, for example, by RapidCRC). Obviously, this UX mess creates a difficulty for replacing SHA256.

Click to broaden your horizons even further (and to see a SFW picture!)…

2024-0105-0309 crayfish swan pike

When, among partners, concord there is not.
Successful issues scarce are got
And the result is loss, disaster and repining.

A crayfish, swan and pike combining.
Resolve to draw a cart and freight;
In harness soon, their efforts ne'er abate.
However much they work, the load to stir refuses.
It seems to be perverse with selfwill vast endowed;
The swan makes upward for a cloud.
The crayfish falls behind, the pike the river uses;
To judge of each one's merit lies beyond my will;
I know the cart remains there, still.

— Ivan Krylov, “Crayfish, Swan and Pike” (1814),
translated from Russian by Charles Fillingham Coxwell.

@jtracey
Copy link
Contributor Author

jtracey commented Sep 28, 2024

It would appear this was fixed via #6499.

@jtracey jtracey closed this as completed Sep 28, 2024
@github-project-automation github-project-automation bot moved this from Todo to Done in GNU Compatibility Sep 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

6 participants