-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
syscall: Windows filenames with unpaired surrogates are not handled correctly #32334
Comments
/cc @alexbrainman |
@hundt thank you very much for creating this issue. I can reproduce your problem here on my Windows 10. I also tried opening file with 'corrupted' file name with notepead++, and notepead++ can read and write the file. Mind you notepead++ uses standard system 'Open' dialogue to get the filename. So, I agree. Go should, probably, deal with these file names somehow. Unfortunately I don't have time to deal with this issue. So leaving for others. Alex |
I lack the ability to tag issues, but I wonder if this issue has security implications. I am thinking of things like:
I am not a security expert so maybe these scenarios are far-fetched or not that important. |
@gopherbot add "help wanted" |
This is caused by that Go replace invalid sequences to 0xFFFD with
https://play.golang.org/p/IonQ_Sk2U8n This is right behavior in UTF-8. 0xED indicate range 0x80-0x9F for next byte. But 0xB0 is out of the range. If utf16.Decode accept this invalid range for second byte (by any chance?), 0x80 is in range of 0x80-0xBF, then codepoint should be:
|
Using U+FFFD in this context will break the assumption,, which is universal, that in a given directory there cannot be two files with the same name. If you have files whose names are 0xD800 and 0xD801, they will both be returned as "\UFFFD". |
There is a well-known issue with Windows/NTFS (see rust-lang/rust#12056 and https://lwn.net/Articles/684181/) where filenames are treated as UTF-16 but are allowed to contain unpaired surrogates. But
syscall_windows.go
assumes that the input and output to the Windows syscalls is valid UTF-16. This breaks some of the high-level APIs; for example,File.Readdir
on a directory containing files with unpaired surrogates in the names will return FileInfo results with incorrect names (valid filenames but referring to different or nonexistent files). A demonstration is included below.I'm not sure what a reasonable solution would be. I guess essentially something like WTF-8 where the strings that come back from these syscalls on Windows are generally valid UTF-8 but might not be?
I'm not a Windows developer so I'm not sure how often this issue comes up in real life, but I happened to notice it so I thought I'd flag it in case anyone finds it worth taking action, or so people can find this documentation of the issue if they encounter it.
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Created a file named
<unpaired surrogate>
.txt ([]uint16=[0xdcc0 0x2e 0x74 0x78 0x74]
) and attempt to read it by callingioutil.ReadDir
and reading all the files that come back.Code snippet
What did you expect to see?
The code successfully opens the file.
What did you see instead?
The code attempts to open a file with name
[0xfffd 0x2e 0x74 0x78 0x74]
instead.With the test cases above it fails with an error:
If you add 0xfffd to
testCases
(to create the "replacement character" file that it is looking for) it will actually open that same file twice:The text was updated successfully, but these errors were encountered: