Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

b3sum can't read hashes from file with non-unix endings #222

Closed
megapro17 opened this issue Jan 19, 2022 · 5 comments
Closed

b3sum can't read hashes from file with non-unix endings #222

megapro17 opened this issue Jan 19, 2022 · 5 comments

Comments

@megapro17
Copy link

I'm using powershell

b3sum * > hash.txt
b3sum -c hash.txt

Output for each file is:
: FAILED (Syntax error in file name, folder name or volume label. (Os error 123))

But after running busybox dos2unix hash.txt it works correctly

Similar #108

@oconnor663
Copy link
Member

I'm not very familiar with PS, but I can repro this bug on my Windows box. It looks like it's not (only) a line endings issue, but actually a Unicode encoding issue, UTF-8 vs UTF-16. Here's how you can see it:

# I've prepared a "test" directory with two files, "a" and "b"
PS C:\Users\oconn\tmp> b3sum test\*
5d3b41143cf73b55ecf4d90f2ce0bf1d3f8e4b305d45d9f9361a746eb09040f0  test/a
5d3b41143cf73b55ecf4d90f2ce0bf1d3f8e4b305d45d9f9361a746eb09040f0  test/b

# We can see that b3sum's output is UTF-8, using Python.
PS C:\Users\oconn\tmp> python
Python 3.10.1 (tags/v3.10.1:2cd268a, Dec  6 2021, 19:10:37) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import subprocess
>>> subprocess.run("b3sum test\*", shell=True, stdout=subprocess.PIPE).stdout
b'5d3b41143cf73b55ecf4d90f2ce0bf1d3f8e4b305d45d9f9361a746eb09040f0  test/a\n5d3b41143cf73b55ecf4d90f2ce0bf1d3f8e4b305d45d9f9361a746eb09040f0  test/b\n'
>>> _.decode("utf-8")
'5d3b41143cf73b55ecf4d90f2ce0bf1d3f8e4b305d45d9f9361a746eb09040f0  test/a\n5d3b41143cf73b55ecf4d90f2ce0bf1d3f8e4b305d45d9f9361a746eb09040f0  test/b\n'

# However, that's not what we get if we redirect the output in PowerShell.
# It looks like PowerShell reencodes the output as UTF-16.
PS C:\Users\oconn\tmp> b3sum test\* > out.txt
PS C:\Users\oconn\tmp> python
Python 3.10.1 (tags/v3.10.1:2cd268a, Dec  6 2021, 19:10:37) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> open("out.txt", "rb").read()
b'\xff\xfe5\x00d\x003\x00b\x004\x001\x001\x004\x003\x00c\x00f\x007\x003\x00b\x005\x005\x00e\x00c\x00f\x004\x00d\x009\x000\x00f\x002\x00c\x00e\x000\x00b\x00f\x001\x00d\x003\x00f\x008\x00e\x004\x00b\x003\x000\x005\x00d\x004\x005\x00d\x009\x00f\x009\x003\x006\x001\x00a\x007\x004\x006\x00e\x00b\x000\x009\x000\x004\x000\x00f\x000\x00 \x00 \x00t\x00e\x00s\x00t\x00/\x00a\x00\r\x00\n\x005\x00d\x003\x00b\x004\x001\x001\x004\x003\x00c\x00f\x007\x003\x00b\x005\x005\x00e\x00c\x00f\x004\x00d\x009\x000\x00f\x002\x00c\x00e\x000\x00b\x00f\x001\x00d\x003\x00f\x008\x00e\x004\x00b\x003\x000\x005\x00d\x004\x005\x00d\x009\x00f\x009\x003\x006\x001\x00a\x007\x004\x006\x00e\x00b\x000\x009\x000\x004\x000\x00f\x000\x00 \x00 \x00t\x00e\x00s\x00t\x00/\x00b\x00\r\x00\n\x00'
>>> _.decode("utf-16")
'5d3b41143cf73b55ecf4d90f2ce0bf1d3f8e4b305d45d9f9361a746eb09040f0  test/a\r\n5d3b41143cf73b55ecf4d90f2ce0bf1d3f8e4b305d45d9f9361a746eb09040f0  test/b\r\n'

We can see at the end there that there are Windows-style newlines, and I expect those would cause a problem in b3sum. But we never get to that problem, because we try to decode the checkfile as UTF-8 and fail immediately.

I'm not sure what the best workaround is for this. Is there a way to tell PowerShell to redirect raw bytes?

@megapro17
Copy link
Author

Which version of powershell you're using? I'm running the latest PowerShell 7.2.1 and seems it outputs UTF-8 correctly:

b3sum *.txt > sum.hash
b3sum -c sum.hash
: FAILED (Syntax error in file name, folder name or volume label. (Os error 123))
: FAILED (Syntax error in file name, folder name or volume label. (Os error 123))
python
Python 3.9.10 (tags/v3.9.10:f2f3f53, Jan 17 2022, 15:14:21) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> open("sum.hash", "rb").read()
b'd63bd9a826af91c1fea371965a64e11ee20f13e46b5f52c59901136605b3a487  1.txt\r\n813e9b729141e7f385afa0a2d0df3e6c3789e427ffe4aeef566a565bc8f2fe3d  2.txt\r\n'
>>> _.decode("utf-16")
'㙤戳㥤㡡㘲晡ㄹㅣ敦㍡ㄷ㘹愵㐶ㅥ攱㉥昰㌱㑥戶昵㈵㕣㤹\u3130㌱㘶㔰㍢㑡㜸†⸱硴൴㠊㌱㥥㝢㤲㐱攱昷㠳愵慦愰搲搰㍦㙥㍣㠷改㈴昷敦愴敥㕦㘶㕡㔶换昸昲㍥\u2064㈠琮瑸\u0a0d'
...
>>> _.decode("utf-8")
'd63bd9a826af91c1fea371965a64e11ee20f13e46b5f52c59901136605b3a487  1.txt\r\n813e9b729141e7f385afa0a2d0df3e6c3789e427ffe4aeef566a565bc8f2fe3d  2.txt\r\n'

But cmd with cmd it's outputting correct file:

 ERROR megapro17@megapro17-pc  R:  test  cmd
Microsoft Windows [Version 10.0.22000.466]
(c) Microsoft Corporation. All rights reserved.

R:\test>b3sum *.txt > sum.hash

R:\test>b3sum -c sum.hash
1.txt: OK
2.txt: OK

R:\test>python
Python 3.9.10 (tags/v3.9.10:f2f3f53, Jan 17 2022, 15:14:21) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> open("sum.hash", "rb").read()
b'd63bd9a826af91c1fea371965a64e11ee20f13e46b5f52c59901136605b3a487  1.txt\n813e9b729141e7f385afa0a2d0df3e6c3789e427ffe4aeef566a565bc8f2fe3d  2.txt\n'
>>>

I'm not sure what the best workaround is for this. Is there a way to tell PowerShell to redirect raw bytes?

Just adding ability to read any file endings. Because it's possible to create txt file from notepad, paste hashes here, and it will still not work

@oconnor663
Copy link
Member

Just adding ability to read any file endings.

This is a reasonable idea, but I want to clarify that it's backwards-incompatible with what b3sum currently does. File names are allowed to contain the \r carriage return character, and b3sum (like md5sum) prints that character without escaping. Because of that, stripping a trailing \r\n from each line could potentially change the meaning of currently a valid checkfile. To do this properly, we'd probably need to start escaping \r just like we currently escape \n.

@n8w8
Copy link

n8w8 commented Sep 27, 2023

@oconnor663 ,

[...] the \r carriage return character, and b3sum (like md5sum) prints that character without escaping

On Debian 12, I find that GNU coreutils (md5sum/sha*sum/b2sum), do escape CR to "\r".
Though I also find that busybox *sum and perl shasum, cannot verify those coreutils checksums.

I found a relevant commit here, with some reasoning about it, also involving Windows.

Anyway, it seemed relevant to mention here.

@AnselmD
Copy link

AnselmD commented Nov 26, 2023

I would like to cross reference:

(1) blake3 / incompatibilities with cli (original) b3sum implementation - Total Commander
https://ghisler.ch/board/viewtopic.php?t=80593

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants