Windows paths allow unpaired surrogates #2565

daurnimator · 2019-05-27T07:07:44Z

Windows allows unpaired surrogates in paths; which are not valid UTF-16LE.

Originally posted by @daurnimator in #2527

emekoi · 2019-05-27T14:54:10Z

tgschultz · 2019-05-27T15:34:22Z

The only way I can think to deal with this universally would be to consider all file paths to be raw bytes and ignore encoding. I'm not sure that's viable in a cross-platform api.

shawnl · 2019-05-27T15:57:34Z

@tgschultz these can be encoded with WTF-8, if someone cares enough to do the work. I haven't even gotten my UTF-8 improvements merged yet. The problem is that then a whole bunch of APIs have to accept WTF-8 instead of UTF-8, and have flags to tell them apart, so its a huge maintenance burden to support junk. As Windows already has a bunch of nutty path-name restrictions, I think erroring out on non-valid unicode is fine.

mikdusan · 2019-05-27T16:13:05Z

came across this article might be relevant: https://googleprojectzero.blogspot.com/2016/02/the-definitive-guide-on-win32-to-nt.html

tgschultz · 2019-05-27T16:20:26Z

Actually yeah, I think you're right. Erroring out on the edge case of unpaired surrogates is probably better, and then the programmer can use the OS API directly if they really need to support it.

Does anyone have information about how often such cases occur in reality? My first guess would be in Asian locales where 2-byte character representations were used pre-UCS2.

Another alternative would be having a real string type in std in the form of a tagged byte buffer with the encoding specified. Of course, that also means rewriting a whole lot of APIs.

daurnimator · 2019-05-28T06:03:41Z

Actually yeah, I think you're right. Erroring out on the edge case of unpaired surrogates is probably better, and then the programmer can use the OS API directly if they really need to support it.

I think erroring out is the wrong option, converting to/from WTF-8 is the well-established solution in other languages.

Does anyone have information about how often such cases occur in reality?

They get onto the file system often enough if someone is truncating strings in Java or Javascript (languages where strings are UCS-2) without realising they could be astral plane characters (e.g. emojis).
I would not expect e.g. directory iteration in zig to suddenly fail in the presence of such characters. Or that a path returned from a directory iteration is unable to be passed to delete.

jdalton · 2024-02-09T19:18:39Z

Reference for anyone wanting to tackle a support PR bun handles it here:

https://github.com/oven-sh/bun/blob/f77b217abf2f33fdea4a50298d07cd601a655b0d/src/string_immutable.zig#L1835-L1848

Windows paths now use WTF-16 <-> WTF-8 conversion everywhere, which is lossless. Previously, conversion of ill-formed UTF-16 paths would either fail or invoke illegal behavior. WASI paths must be valid UTF-8, and the relevant function calls have been updated to handle the possibility of failure due to paths not being encoded/encodable as valid UTF-8. Closes ziglang#18694 Closes ziglang#1774 Closes ziglang#2565

daurnimator mentioned this issue May 27, 2019

rework the API layers between the standard library and the operating system #2527

Merged

5 tasks

andrewrk added bug Observed behavior contradicts documented or intended behavior standard library This issue involves writing Zig code for the standard library. labels May 27, 2019

andrewrk added this to the 0.6.0 milestone May 27, 2019

andrewrk modified the milestones: 0.6.0, 0.7.0 Jan 5, 2020

andrewrk added the os-windows label Jan 5, 2020

hryx mentioned this issue Jan 7, 2020

json: disallow overlong and out-of-range UTF-8 #4097

Merged

andrewrk modified the milestones: 0.7.0, 0.8.0 Oct 30, 2020

andrewrk modified the milestones: 0.8.0, 0.8.1 Jun 4, 2021

andrewrk modified the milestones: 0.8.1, 0.9.1 Sep 1, 2021

andrewrk modified the milestones: 0.9.1, 0.9.0, 0.10.0 Nov 20, 2021

andrewrk modified the milestones: 0.10.0, 0.11.0 Apr 16, 2022

andrewrk modified the milestones: 0.11.0, 0.12.0 Jun 19, 2023

jdalton mentioned this issue Feb 9, 2024

Add win32 path.toNamespacedPath and align rest of node:path with Node oven-sh/bun#8469

Merged

1 task

squeek502 mentioned this issue Feb 19, 2024

Fix handling of Windows (WTF-16) and WASI (UTF-8) paths, etc #19005

Merged

andrewrk modified the milestones: 0.13.0, 0.12.0 Feb 25, 2024

andrewrk closed this as completed in #19005 Feb 25, 2024

andrewrk closed this as completed in 68b8791 Feb 25, 2024

kiroxas mentioned this issue Nov 29, 2024

Improve parse_utf8 performance godotengine/godot#99826

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows paths allow unpaired surrogates #2565

Windows paths allow unpaired surrogates #2565

daurnimator commented May 27, 2019

emekoi commented May 27, 2019

tgschultz commented May 27, 2019

shawnl commented May 27, 2019

mikdusan commented May 27, 2019

tgschultz commented May 27, 2019 •

edited

Loading

daurnimator commented May 28, 2019

jdalton commented Feb 9, 2024

Windows paths allow unpaired surrogates #2565

Windows paths allow unpaired surrogates #2565

Comments

daurnimator commented May 27, 2019

emekoi commented May 27, 2019

tgschultz commented May 27, 2019

shawnl commented May 27, 2019

mikdusan commented May 27, 2019

tgschultz commented May 27, 2019 • edited Loading

daurnimator commented May 28, 2019

jdalton commented Feb 9, 2024

tgschultz commented May 27, 2019 •

edited

Loading