Support core unicode functionality #229

sand1k · 2015-06-23T16:48:48Z

The patch does the following:

makes utf-8 internal representation for strings and changes all string processing routines accordingly
changes API to properly process non zero-terminated strings and expect strings of arbitrary encoding (for now, only utf-8)

egavrin · 2015-06-23T16:56:09Z

jerry-core/ecma/base/ecma-helpers-conversion.cpp

+   *   Currently, only utf-8 external encoding is supported, so not decoding is needed.
+   */
+  return ecma_utf8_string_to_number (str_p, str_size);
+}


Missing comment

ruben-ayrapetyan · 2015-06-25T19:16:37Z

jerry-core/lit/lit-strings.cpp

+
+  const lit_code_point max_one_byte_code_point = 0x7F;
+  const lit_code_point max_two_bytes_code_point = 0x7FF;
+  const lit_code_point max_three_bytes_code_point = 0xFFFF;


Could you, please, introduce definitions for the contants too?

#define UTF8_TWO_BYTE_CODE_POINT_MAX (0x7FF)

zherczeg · 2015-06-29T09:30:38Z

Updating this patch likely has a large maintenance burden. I would land it soon.

What is the decision: do you use CESU8 or not?

egavrin · 2015-06-29T10:19:20Z

@zherczeg me too, the main issue right now is RegExps and upcoming JSON - these things are postponing this patch.

zherczeg · 2015-06-29T10:27:33Z

Regexp is landed, JSON hopefully lands today. These barriers should be eliminated very soon.

sand1k · 2015-06-29T12:22:02Z

@zherczeg, the decision is to use ordinary UTF-8.

zherczeg · 2015-06-29T14:24:09Z

jerry-core/ecma/base/ecma-helpers-string.cpp

-ecma_string_t*
-ecma_new_ecma_string (const ecma_char_t *string_p) /**< zero-terminated string */
+ecma_string_t *
+ecma_new_ecma_string_from_code_unit (ecma_char_t code_unit) /**< code unit */


If I understand correctly, ecma_char_t is uint16_t. How can we create code UTF-8 octets from characters > 0xffff?

How do we combine code points > 0xffff? E.g. if string1 = "0xd804"; string2 = "0xdc00"; string1+string2 should be the utf representation of 0x11000 in our UTF-8 based system. Otherwise the comparison of the result to another string, which directly contains a 0x11000 character will fail.

@zherczeg, see the comment in lit-globals.h:

/* Scalar values from 0xD800 to 0xDFFF are permanently reserved by Unicode standard to encode high and low * surrogates in UTF-16 (Code points 0x10000 - 0x10FFFF are encoded via pair of surrogates in UTF-16). * Despite that the official Unicode standard says that no UTF forms can encode these code points, we allow * them to be encoded inside strings. The reason for that is compatibility with ECMA standard. * * For example, assume a string which consists one Unicode character: 0x1D700 (Mathematical Italic Small Epsilon). * It has the following representation in UTF-16: 0xD835 0xDF00. * * ECMA standard allows extracting a substring from this string: * > var str = String.fromCharCode (0xD835, 0xDF00); // Create a string containing one character: 0x1D700 * > str.length; // 2 * > var str1 = str.substring (0, 1); * > str1.length; // 1 * > str1.charCodeAt (0); // 55349 (this equals to 0xD835) * * Internally original string would be represented in UTF-8 as the following byte sequence: 0xF0 0x9D 0x9C 0x80. * After substring extraction high surrogate 0xD835 should be encoded via UTF-8: 0xED 0xA0 0xB5. * * Pair of low and high surrogates encoded separately should never occur in internal string representation, * it should be encoded as any code point and occupy 4 bytes. So, when constructing a string from two surrogates, * it should be processed gracefully; * > var str1 = String.fromCharCode (0xD835); // 0xED 0xA0 0xB5 - internal representation * > var str2 = String.fromCharCode (0xDF00); // 0xED 0xBC 0x80 - internal representation * > var str = str1 + str2; // 0xF0 0x9D 0x9C 0x80 - internal representation, * // !!! not 0xED 0xA0 0xB5 0xED 0xBC 0x80 */

Support for such cases is assumed to be added in next patch. This one is already too huge.

zherczeg · 2015-06-29T14:38:26Z

+lgtm
I see issues with splitting/combining character codes, but a fix is promised later. The patch is big, but it is going in the right direction. We could just land it.

egavrin · 2015-06-29T14:44:57Z

Great! make push

…iteral component. JerryScript-DCO-1.0-Signed-off-by: Andrey Shitov [email protected]

JerryScript-DCO-1.0-Signed-off-by: Andrey Shitov [email protected]

Add utf-8 processing routines. Change ecma_char_t from char/uint16_t to uint16_t. Apply all utf-8 processing routines. Change char to jerry_api_char in API functions' declarations. JerryScript-DCO-1.0-Signed-off-by: Andrey Shitov [email protected]

egavrin changed the title ~~Core unicode support~~ Support core unicode functionality Jun 23, 2015

egavrin reviewed Jun 23, 2015
View reviewed changes

ruben-ayrapetyan reviewed Jun 25, 2015
View reviewed changes

sand1k force-pushed the Andrey-unicode-dev branch 2 times, most recently from cce15c3 to 2939973 Compare June 28, 2015 18:30

sand1k force-pushed the Andrey-unicode-dev branch from 2939973 to f8d6306 Compare June 29, 2015 11:34

zherczeg reviewed Jun 29, 2015
View reviewed changes

sand1k added 3 commits June 29, 2015 23:27

Move char type definitions and magic string processing functions to l…

a0c5974

…iteral component. JerryScript-DCO-1.0-Signed-off-by: Andrey Shitov [email protected]

Change ecma_length_t and jerry_api_length_t from uint16_t to uint32_t.

c4b0cd2

JerryScript-DCO-1.0-Signed-off-by: Andrey Shitov [email protected]

Add core unicode functionality.

fd9ff8e

Add utf-8 processing routines. Change ecma_char_t from char/uint16_t to uint16_t. Apply all utf-8 processing routines. Change char to jerry_api_char in API functions' declarations. JerryScript-DCO-1.0-Signed-off-by: Andrey Shitov [email protected]

sand1k force-pushed the Andrey-unicode-dev branch from f8d6306 to fd9ff8e Compare June 29, 2015 20:29

sand1k merged commit fd9ff8e into master Jun 29, 2015

sand1k deleted the Andrey-unicode-dev branch June 29, 2015 21:00

somang-park unassigned ruben-ayrapetyan Nov 25, 2016

This was referenced May 17, 2020

stack-overflow in vm_loop #3750

Closed

stack-overflow in ecma_regexp_match #3753

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support core unicode functionality #229

Support core unicode functionality #229

sand1k commented Jun 23, 2015

egavrin Jun 23, 2015

ruben-ayrapetyan Jun 25, 2015

zherczeg commented Jun 29, 2015

egavrin commented Jun 29, 2015

zherczeg commented Jun 29, 2015

sand1k commented Jun 29, 2015

zherczeg Jun 29, 2015

sand1k Jun 29, 2015

sand1k Jun 29, 2015

zherczeg commented Jun 29, 2015

egavrin commented Jun 29, 2015

Support core unicode functionality #229

Support core unicode functionality #229

Conversation

sand1k commented Jun 23, 2015

egavrin Jun 23, 2015

Choose a reason for hiding this comment

ruben-ayrapetyan Jun 25, 2015

Choose a reason for hiding this comment

zherczeg commented Jun 29, 2015

egavrin commented Jun 29, 2015

zherczeg commented Jun 29, 2015

sand1k commented Jun 29, 2015

zherczeg Jun 29, 2015

Choose a reason for hiding this comment

sand1k Jun 29, 2015

Choose a reason for hiding this comment

sand1k Jun 29, 2015

Choose a reason for hiding this comment

zherczeg commented Jun 29, 2015

egavrin commented Jun 29, 2015