-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows paths allow unpaired surrogates #2565
Comments
see here. |
The only way I can think to deal with this universally would be to consider all file paths to be raw bytes and ignore encoding. I'm not sure that's viable in a cross-platform api. |
@tgschultz these can be encoded with WTF-8, if someone cares enough to do the work. I haven't even gotten my UTF-8 improvements merged yet. The problem is that then a whole bunch of APIs have to accept WTF-8 instead of UTF-8, and have flags to tell them apart, so its a huge maintenance burden to support junk. As Windows already has a bunch of nutty path-name restrictions, I think erroring out on non-valid unicode is fine. |
came across this article might be relevant: https://googleprojectzero.blogspot.com/2016/02/the-definitive-guide-on-win32-to-nt.html |
Actually yeah, I think you're right. Erroring out on the edge case of unpaired surrogates is probably better, and then the programmer can use the OS API directly if they really need to support it. Does anyone have information about how often such cases occur in reality? My first guess would be in Asian locales where 2-byte character representations were used pre-UCS2. Another alternative would be having a real string type in std in the form of a tagged byte buffer with the encoding specified. Of course, that also means rewriting a whole lot of APIs. |
I think erroring out is the wrong option, converting to/from WTF-8 is the well-established solution in other languages.
They get onto the file system often enough if someone is truncating strings in Java or Javascript (languages where strings are UCS-2) without realising they could be astral plane characters (e.g. emojis). |
Reference for anyone wanting to tackle a support PR bun handles it here: |
Windows paths now use WTF-16 <-> WTF-8 conversion everywhere, which is lossless. Previously, conversion of ill-formed UTF-16 paths would either fail or invoke illegal behavior. WASI paths must be valid UTF-8, and the relevant function calls have been updated to handle the possibility of failure due to paths not being encoded/encodable as valid UTF-8. Closes ziglang#18694 Closes ziglang#1774 Closes ziglang#2565
Windows paths now use WTF-16 <-> WTF-8 conversion everywhere, which is lossless. Previously, conversion of ill-formed UTF-16 paths would either fail or invoke illegal behavior. WASI paths must be valid UTF-8, and the relevant function calls have been updated to handle the possibility of failure due to paths not being encoded/encodable as valid UTF-8. Closes ziglang#18694 Closes ziglang#1774 Closes ziglang#2565
Windows paths now use WTF-16 <-> WTF-8 conversion everywhere, which is lossless. Previously, conversion of ill-formed UTF-16 paths would either fail or invoke illegal behavior. WASI paths must be valid UTF-8, and the relevant function calls have been updated to handle the possibility of failure due to paths not being encoded/encodable as valid UTF-8. Closes ziglang#18694 Closes ziglang#1774 Closes ziglang#2565
Windows paths now use WTF-16 <-> WTF-8 conversion everywhere, which is lossless. Previously, conversion of ill-formed UTF-16 paths would either fail or invoke illegal behavior. WASI paths must be valid UTF-8, and the relevant function calls have been updated to handle the possibility of failure due to paths not being encoded/encodable as valid UTF-8. Closes ziglang#18694 Closes ziglang#1774 Closes ziglang#2565
Windows allows unpaired surrogates in paths; which are not valid UTF-16LE.
Originally posted by @daurnimator in #2527
The text was updated successfully, but these errors were encountered: