-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect decoding from JS strings to USV strings #15324
Comments
Patches more than welcome if you feel like fixing :) I'm not sure what the best approach is but it seems like the intent was to at least issue a warning, so making that work would be the first fix I guess? And testing it. I guess this does happen in practice since nobody is complaining about this in the wild? |
Oh sure, it was just something I saw while doing something else. I'm not claiming the issue but may get to it. I do find it very informative though that because of this bug, nothing that uses emscripten relies on preserving the values of isolated surrogates -- an interesting data point when it comes to designing built-in string interfaces for WebAssembly. |
Yup, maximum value of the computation is 0x10FFFF, so never fired. The computation is correct for surrogate pairs, otherwise it combines a lone surrogate (low or high) with a non-surrogate or |
Greets, @dcodeIO :) Interesting to see that this bug might have propagated farther afield...
I am not sure what the problem here is and would be happy to be corrected if I misunderstand. From what I can tell:
I concluded that Emscripten users don't care about isolated codepoint values, because if they did this bug would have been noticed and fixed. Is there something wrong in my argument, or maybe are those users using another interface or something? |
Historically the JS<->C string marshalling has been implemented foremost from the focus of 1. performance and 2. code size, in that order. This was because a large part of asm.js and wasm history was about proving that Wasm can be a performant platform compared to native, and that it is not bloated compared to native code. The applications that people have developed on top of Emscripten have largely been non-mission-critical: games, visualizations, utilities, tools, etc. As such, from that backstory, Emscripten built-in provided string marshalling has had two design implications: That being said, since the growth of adoption of Wasm, a lot more people are using Emscripten for different purposes, so the original assumptions to optimize performance and code size at the expense of strict spec-compliant conversion is no longer a fit for everyone. We should definitely add more rigorous string conversion - however that should not sacrifice the performance and code size benefits that existing users have come to expect. That all being said,
it should be carefully noted that Emscripten itself does not use/enforce UTF8ToString() when passing strings to C functions. When implementing JS <-> Wasm communication layers, the developer is expected to manually need to marshal JS strings to Wasm before performing the function call. I.e. Emscripten does not automatically do this, and it does not enforce what the conversion algorithm should be. Users are free to write their own methods to convert strings across the boundary. Though there are some built-in libraries, such as the filesystem or the WebGL support library, and embind (a bindings generator), that do make a fixed choice to utilize UTF8ToString() to do the marshalling, that one can't easily change. However, if you build with I think a good way forward would be to expand the JS library with new families of string marshalling functions (some might want the marshalling functions to throw, others would like it fast, and others would like it to be strict, but no-throw?). Then have specific build time settings to allow configuring e.g. embind and other libraries as to which marshalling API they should use. |
@wingo Hey :) I guess one can make this argument, but it is also important to keep in mind that Emscripten already assumes UTF-8 and typically is used to port applications as a whole, which is very different from the requirements of languages and tools that are closely coupled with JS and its semantics, say eventually via ESM integration or similar where code can be a mix of JS and Wasm. As such it would be very sad if this rather specific bug would result in
in general, ultimately leading to the more closely coupled with JS case remaining either unsupported or having to deal with incorrectness or inefficiency cliffs in, say, a |
FWIW I don't think so, as this is more a property of TextEncoder than TextDecoder, and Emscripten itself doesn't use TextEncoder outside library_sockfs.js.
Options are good, but this case above is just a bug that should be fixed (probably U+FFFD would be the change in spirit with the existing code) and the code size concern is minimal. The broader questions of what conclusions can we draw about the ecosystem are a bit of noodling on my part, apologies for the noise! |
Oh right.. yeah, adding a code path that uses TextEncoder would be good approach as well. |
Emscripten implements its own transcoder from JavaScript strings to UTF-8, and uses it when passing strings to C functions. https://github.com/emscripten-core/emscripten/blob/main/src/runtime_strings.js#L158. This transcoder logically decodes the stream of JS code units as UTF-16, producing an abstract stream of unicode scalar values, which it then encodes to UTF-8.
However if the JS string contains malformed UTF-16, the decode neither throws an error nor produces the replacement character U+FFFD.
If we extract the decoder, it looks like this:
Here are some erroneous examples:
It doesn't seem that the assertion would ever fire for any codepoint values, but I am not sure about that.
Related to WebAssembly/gc#145.
The text was updated successfully, but these errors were encountered: