Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect decoding from JS strings to USV strings #15324

Open
wingo opened this issue Oct 19, 2021 · 8 comments
Open

Incorrect decoding from JS strings to USV strings #15324

wingo opened this issue Oct 19, 2021 · 8 comments

Comments

@wingo
Copy link
Contributor

wingo commented Oct 19, 2021

Emscripten implements its own transcoder from JavaScript strings to UTF-8, and uses it when passing strings to C functions. https://github.com/emscripten-core/emscripten/blob/main/src/runtime_strings.js#L158. This transcoder logically decodes the stream of JS code units as UTF-16, producing an abstract stream of unicode scalar values, which it then encodes to UTF-8.

However if the JS string contains malformed UTF-16, the decode neither throws an error nor produces the replacement character U+FFFD.

If we extract the decoder, it looks like this:

function firstUSV(str) {
  var i = 0
  var u = str.charCodeAt(i); // possibly a lead surrogate
  if (u >= 0xD800 && u <= 0xDFFF) {
    var u1 = str.charCodeAt(++i);
    u = 0x10000 + ((u & 0x3FF) << 10) | (u1 & 0x3FF);
  }
    
  if (u > 0x10FFFF)
    throw new Error('Bad codepoint: 0x' + u.toString(16));

  return '0x' + u.toString(16);
}

Here are some erroneous examples:

> firstUSV('\uD800')
"0x10000"
> firstUSV('\uD800a')
"0x10061"
> firstUSV('\uDC00')
"0x10000"
> firstUSV('\uDC00\uD800')
"0x10000"
> firstUSV('\uDFFF\uDFFF')
"0x10ffff"

It doesn't seem that the assertion would ever fire for any codepoint values, but I am not sure about that.

Related to WebAssembly/gc#145.

@sbc100
Copy link
Collaborator

sbc100 commented Oct 19, 2021

Patches more than welcome if you feel like fixing :) I'm not sure what the best approach is but it seems like the intent was to at least issue a warning, so making that work would be the first fix I guess? And testing it.

I guess this does happen in practice since nobody is complaining about this in the wild?

@wingo
Copy link
Contributor Author

wingo commented Oct 20, 2021

Oh sure, it was just something I saw while doing something else. I'm not claiming the issue but may get to it. I do find it very informative though that because of this bug, nothing that uses emscripten relies on preserving the values of isolated surrogates -- an interesting data point when it comes to designing built-in string interfaces for WebAssembly.

@dcodeIO
Copy link
Contributor

dcodeIO commented Nov 9, 2021

It doesn't seem that the assertion would ever fire for any codepoint values

Yup, maximum value of the computation is 0x10FFFF, so never fired. The computation is correct for surrogate pairs, otherwise it combines a lone surrogate (low or high) with a non-surrogate or NaN to a supplementary character >0xFFFF. Interestingly, it does not only affect lone surrogates, but also non-surrogates following one, indicating that the conclusion above might suffer from confirmation bias. If a resilient data point is the goal, one has to look at cases where preserving the values of isolated surrogates is important.

@wingo
Copy link
Contributor Author

wingo commented Nov 11, 2021

Greets, @dcodeIO :) Interesting to see that this bug might have propagated farther afield...

because of this bug, nothing that uses emscripten relies on preserving the values of isolated surrogates

the conclusion above might suffer from confirmation bias. If a resilient data point is the goal, one has to look at cases where preserving the values of isolated surrogates is important.

I am not sure what the problem here is and would be happy to be corrected if I misunderstand. From what I can tell:

  1. C/C++ functions exported to JS that take strings receive their strings in memory
  2. That memory is encoded using stringToUTF8Array, that this bug is about
  3. If the string contains isolated surrogates, the result is still valid UTF-8, but does not encode the isolated surrogates
    1. the isolated surrogate's encoding is indistinguishable from a valid astral-plane codepoint
    2. the following u16 if any is also consumed
    3. the original string can't be reconstructed from the encoding

I concluded that Emscripten users don't care about isolated codepoint values, because if they did this bug would have been noticed and fixed. Is there something wrong in my argument, or maybe are those users using another interface or something?

@juj
Copy link
Collaborator

juj commented Nov 11, 2021

Historically the JS<->C string marshalling has been implemented foremost from the focus of 1. performance and 2. code size, in that order. This was because a large part of asm.js and wasm history was about proving that Wasm can be a performant platform compared to native, and that it is not bloated compared to native code.

The applications that people have developed on top of Emscripten have largely been non-mission-critical: games, visualizations, utilities, tools, etc. As such, from that backstory, Emscripten built-in provided string marshalling has had two design implications:
a. well-defined and performant on well-formed data, unspecified results on malformed data,
b. since C strings are being generated, zeros inside a string are not expected to occur, (C strings terminate with a zero byte)
c. since only well-formed unicode strings are expected to occur, unicode code points outside the defined planes, and also stray surrogates are not expected to be present - results being unspecified.

That being said, since the growth of adoption of Wasm, a lot more people are using Emscripten for different purposes, so the original assumptions to optimize performance and code size at the expense of strict spec-compliant conversion is no longer a fit for everyone. We should definitely add more rigorous string conversion - however that should not sacrifice the performance and code size benefits that existing users have come to expect.

That all being said,

Emscripten implements its own transcoder from JavaScript strings to UTF-8, and uses it when passing strings to C functions.

it should be carefully noted that Emscripten itself does not use/enforce UTF8ToString() when passing strings to C functions. When implementing JS <-> Wasm communication layers, the developer is expected to manually need to marshal JS strings to Wasm before performing the function call. I.e. Emscripten does not automatically do this, and it does not enforce what the conversion algorithm should be. Users are free to write their own methods to convert strings across the boundary.

Though there are some built-in libraries, such as the filesystem or the WebGL support library, and embind (a bindings generator), that do make a fixed choice to utilize UTF8ToString() to do the marshalling, that one can't easily change.

However, if you build with -s TEXTDECODER=2 linker flag, then Emscripten changes the implementation of these functions to always route to the JavaScript TextDecoder API. I wonder if that build flag will give a better surrogate handling for what is needed here?

I think a good way forward would be to expand the JS library with new families of string marshalling functions (some might want the marshalling functions to throw, others would like it fast, and others would like it to be strict, but no-throw?). Then have specific build time settings to allow configuring e.g. embind and other libraries as to which marshalling API they should use.

@dcodeIO
Copy link
Contributor

dcodeIO commented Nov 11, 2021

@wingo Hey :) I guess one can make this argument, but it is also important to keep in mind that Emscripten already assumes UTF-8 and typically is used to port applications as a whole, which is very different from the requirements of languages and tools that are closely coupled with JS and its semantics, say eventually via ESM integration or similar where code can be a mix of JS and Wasm. As such it would be very sad if this rather specific bug would result in

an interesting data point when it comes to designing built-in string interfaces for WebAssembly

in general, ultimately leading to the more closely coupled with JS case remaining either unsupported or having to deal with incorrectness or inefficiency cliffs in, say, a stringref, which I so far generally consider a valuable addition as it, unlike Interface Types, could finally take others' needs into account as well. A good first step would be to listen to those others instead of eagerly trying to find arguments to disregard their concerns where there likely are none. I guess I'm just worried that history will repeat, and that this time it will be stringref :(

@wingo
Copy link
Contributor Author

wingo commented Nov 11, 2021

However, if you build with -s TEXTDECODER=2 linker flag, then Emscripten changes the implementation of these functions to always route to the JavaScript TextDecoder API. I wonder if that build flag will give a better surrogate handling for what is needed here?

FWIW I don't think so, as this is more a property of TextEncoder than TextDecoder, and Emscripten itself doesn't use TextEncoder outside library_sockfs.js.

I think a good way forward would be to expand the JS library with new families of string marshalling functions (some might want the marshalling functions to throw, others would like it fast, and others would like it to be strict, but no-throw?). Then have specific build time settings to allow configuring e.g. embind and other libraries as to which marshalling API they should use.

Options are good, but this case above is just a bug that should be fixed (probably U+FFFD would be the change in spirit with the existing code) and the code size concern is minimal. The broader questions of what conclusions can we draw about the ecosystem are a bit of noodling on my part, apologies for the noise!

@juj
Copy link
Collaborator

juj commented Nov 11, 2021

this is more a property of TextEncoder than TextDecoder

Oh right.. yeah, adding a code path that uses TextEncoder would be good approach as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants