Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows paths allow unpaired surrogates #2565

Closed
daurnimator opened this issue May 27, 2019 · 7 comments · Fixed by #19005
Closed

Windows paths allow unpaired surrogates #2565

daurnimator opened this issue May 27, 2019 · 7 comments · Fixed by #19005
Labels
bug Observed behavior contradicts documented or intended behavior os-windows standard library This issue involves writing Zig code for the standard library.
Milestone

Comments

@daurnimator
Copy link
Contributor

Windows allows unpaired surrogates in paths; which are not valid UTF-16LE.

Originally posted by @daurnimator in #2527

@andrewrk andrewrk added bug Observed behavior contradicts documented or intended behavior standard library This issue involves writing Zig code for the standard library. labels May 27, 2019
@andrewrk andrewrk added this to the 0.6.0 milestone May 27, 2019
@emekoi
Copy link
Contributor

emekoi commented May 27, 2019

see here.

@tgschultz
Copy link
Contributor

The only way I can think to deal with this universally would be to consider all file paths to be raw bytes and ignore encoding. I'm not sure that's viable in a cross-platform api.

@shawnl
Copy link
Contributor

shawnl commented May 27, 2019

@tgschultz these can be encoded with WTF-8, if someone cares enough to do the work. I haven't even gotten my UTF-8 improvements merged yet. The problem is that then a whole bunch of APIs have to accept WTF-8 instead of UTF-8, and have flags to tell them apart, so its a huge maintenance burden to support junk. As Windows already has a bunch of nutty path-name restrictions, I think erroring out on non-valid unicode is fine.

@mikdusan
Copy link
Member

came across this article might be relevant: https://googleprojectzero.blogspot.com/2016/02/the-definitive-guide-on-win32-to-nt.html

@tgschultz
Copy link
Contributor

tgschultz commented May 27, 2019

Actually yeah, I think you're right. Erroring out on the edge case of unpaired surrogates is probably better, and then the programmer can use the OS API directly if they really need to support it.

Does anyone have information about how often such cases occur in reality? My first guess would be in Asian locales where 2-byte character representations were used pre-UCS2.

Another alternative would be having a real string type in std in the form of a tagged byte buffer with the encoding specified. Of course, that also means rewriting a whole lot of APIs.

@daurnimator
Copy link
Contributor Author

Actually yeah, I think you're right. Erroring out on the edge case of unpaired surrogates is probably better, and then the programmer can use the OS API directly if they really need to support it.

I think erroring out is the wrong option, converting to/from WTF-8 is the well-established solution in other languages.

Does anyone have information about how often such cases occur in reality?

They get onto the file system often enough if someone is truncating strings in Java or Javascript (languages where strings are UCS-2) without realising they could be astral plane characters (e.g. emojis).
I would not expect e.g. directory iteration in zig to suddenly fail in the presence of such characters. Or that a path returned from a directory iteration is unable to be passed to delete.

@andrewrk andrewrk modified the milestones: 0.6.0, 0.7.0 Jan 5, 2020
@andrewrk andrewrk modified the milestones: 0.7.0, 0.8.0 Oct 30, 2020
@andrewrk andrewrk modified the milestones: 0.8.0, 0.8.1 Jun 4, 2021
@andrewrk andrewrk modified the milestones: 0.8.1, 0.9.1 Sep 1, 2021
@andrewrk andrewrk modified the milestones: 0.9.1, 0.9.0, 0.10.0 Nov 20, 2021
@andrewrk andrewrk modified the milestones: 0.10.0, 0.11.0 Apr 16, 2022
@andrewrk andrewrk modified the milestones: 0.11.0, 0.12.0 Jun 19, 2023
@jdalton
Copy link

jdalton commented Feb 9, 2024

Reference for anyone wanting to tackle a support PR bun handles it here:

https://github.com/oven-sh/bun/blob/f77b217abf2f33fdea4a50298d07cd601a655b0d/src/string_immutable.zig#L1835-L1848

squeek502 added a commit to squeek502/zig that referenced this issue Feb 19, 2024
Windows paths now use WTF-16 <-> WTF-8 conversion everywhere, which is lossless. Previously, conversion of ill-formed UTF-16 paths would either fail or invoke illegal behavior.

WASI paths must be valid UTF-8, and the relevant function calls have been updated to handle the possibility of failure due to paths not being encoded/encodable as valid UTF-8.

Closes ziglang#18694
Closes ziglang#1774
Closes ziglang#2565
squeek502 added a commit to squeek502/zig that referenced this issue Feb 19, 2024
Windows paths now use WTF-16 <-> WTF-8 conversion everywhere, which is lossless. Previously, conversion of ill-formed UTF-16 paths would either fail or invoke illegal behavior.

WASI paths must be valid UTF-8, and the relevant function calls have been updated to handle the possibility of failure due to paths not being encoded/encodable as valid UTF-8.

Closes ziglang#18694
Closes ziglang#1774
Closes ziglang#2565
@andrewrk andrewrk modified the milestones: 0.13.0, 0.12.0 Feb 25, 2024
Rexicon226 pushed a commit to Rexicon226/zig that referenced this issue Feb 25, 2024
Windows paths now use WTF-16 <-> WTF-8 conversion everywhere, which is lossless. Previously, conversion of ill-formed UTF-16 paths would either fail or invoke illegal behavior.

WASI paths must be valid UTF-8, and the relevant function calls have been updated to handle the possibility of failure due to paths not being encoded/encodable as valid UTF-8.

Closes ziglang#18694
Closes ziglang#1774
Closes ziglang#2565
RossComputerGuy pushed a commit to ExpidusOS-archive/zig that referenced this issue Mar 20, 2024
Windows paths now use WTF-16 <-> WTF-8 conversion everywhere, which is lossless. Previously, conversion of ill-formed UTF-16 paths would either fail or invoke illegal behavior.

WASI paths must be valid UTF-8, and the relevant function calls have been updated to handle the possibility of failure due to paths not being encoded/encodable as valid UTF-8.

Closes ziglang#18694
Closes ziglang#1774
Closes ziglang#2565
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Observed behavior contradicts documented or intended behavior os-windows standard library This issue involves writing Zig code for the standard library.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants