-
Notifications
You must be signed in to change notification settings - Fork 677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support core unicode functionality #229
Conversation
* Currently, only utf-8 external encoding is supported, so not decoding is needed. | ||
*/ | ||
return ecma_utf8_string_to_number (str_p, str_size); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing comment
|
||
const lit_code_point max_one_byte_code_point = 0x7F; | ||
const lit_code_point max_two_bytes_code_point = 0x7FF; | ||
const lit_code_point max_three_bytes_code_point = 0xFFFF; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you, please, introduce definitions for the contants too?
#define UTF8_TWO_BYTE_CODE_POINT_MAX (0x7FF)
cce15c3
to
2939973
Compare
Updating this patch likely has a large maintenance burden. I would land it soon. What is the decision: do you use CESU8 or not? |
@zherczeg me too, the main issue right now is RegExps and upcoming JSON - these things are postponing this patch. |
Regexp is landed, JSON hopefully lands today. These barriers should be eliminated very soon. |
2939973
to
f8d6306
Compare
@zherczeg, the decision is to use ordinary UTF-8. |
ecma_string_t* | ||
ecma_new_ecma_string (const ecma_char_t *string_p) /**< zero-terminated string */ | ||
ecma_string_t * | ||
ecma_new_ecma_string_from_code_unit (ecma_char_t code_unit) /**< code unit */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, ecma_char_t is uint16_t. How can we create code UTF-8 octets from characters > 0xffff?
How do we combine code points > 0xffff? E.g. if string1 = "0xd804"; string2 = "0xdc00"; string1+string2 should be the utf representation of 0x11000 in our UTF-8 based system. Otherwise the comparison of the result to another string, which directly contains a 0x11000 character will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zherczeg, see the comment in lit-globals.h:
/* Scalar values from 0xD800 to 0xDFFF are permanently reserved by Unicode standard to encode high and low
* surrogates in UTF-16 (Code points 0x10000 - 0x10FFFF are encoded via pair of surrogates in UTF-16).
* Despite that the official Unicode standard says that no UTF forms can encode these code points, we allow
* them to be encoded inside strings. The reason for that is compatibility with ECMA standard.
*
* For example, assume a string which consists one Unicode character: 0x1D700 (Mathematical Italic Small Epsilon).
* It has the following representation in UTF-16: 0xD835 0xDF00.
*
* ECMA standard allows extracting a substring from this string:
* > var str = String.fromCharCode (0xD835, 0xDF00); // Create a string containing one character: 0x1D700
* > str.length; // 2
* > var str1 = str.substring (0, 1);
* > str1.length; // 1
* > str1.charCodeAt (0); // 55349 (this equals to 0xD835)
*
* Internally original string would be represented in UTF-8 as the following byte sequence: 0xF0 0x9D 0x9C 0x80.
* After substring extraction high surrogate 0xD835 should be encoded via UTF-8: 0xED 0xA0 0xB5.
*
* Pair of low and high surrogates encoded separately should never occur in internal string representation,
* it should be encoded as any code point and occupy 4 bytes. So, when constructing a string from two surrogates,
* it should be processed gracefully;
* > var str1 = String.fromCharCode (0xD835); // 0xED 0xA0 0xB5 - internal representation
* > var str2 = String.fromCharCode (0xDF00); // 0xED 0xBC 0x80 - internal representation
* > var str = str1 + str2; // 0xF0 0x9D 0x9C 0x80 - internal representation,
* // !!! not 0xED 0xA0 0xB5 0xED 0xBC 0x80
*/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Support for such cases is assumed to be added in next patch. This one is already too huge.
+lgtm |
Great! |
…iteral component. JerryScript-DCO-1.0-Signed-off-by: Andrey Shitov [email protected]
JerryScript-DCO-1.0-Signed-off-by: Andrey Shitov [email protected]
Add utf-8 processing routines. Change ecma_char_t from char/uint16_t to uint16_t. Apply all utf-8 processing routines. Change char to jerry_api_char in API functions' declarations. JerryScript-DCO-1.0-Signed-off-by: Andrey Shitov [email protected]
f8d6306
to
fd9ff8e
Compare
The patch does the following: