Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support core unicode functionality #229

Merged
merged 3 commits into from
Jun 29, 2015
Merged

Support core unicode functionality #229

merged 3 commits into from
Jun 29, 2015

Conversation

sand1k
Copy link
Contributor

@sand1k sand1k commented Jun 23, 2015

The patch does the following:

  • makes utf-8 internal representation for strings and changes all string processing routines accordingly
  • changes API to properly process non zero-terminated strings and expect strings of arbitrary encoding (for now, only utf-8)

@egavrin egavrin changed the title Core unicode support Support core unicode functionality Jun 23, 2015
* Currently, only utf-8 external encoding is supported, so not decoding is needed.
*/
return ecma_utf8_string_to_number (str_p, str_size);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing comment


const lit_code_point max_one_byte_code_point = 0x7F;
const lit_code_point max_two_bytes_code_point = 0x7FF;
const lit_code_point max_three_bytes_code_point = 0xFFFF;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you, please, introduce definitions for the contants too?

#define UTF8_TWO_BYTE_CODE_POINT_MAX (0x7FF)

@sand1k sand1k force-pushed the Andrey-unicode-dev branch 2 times, most recently from cce15c3 to 2939973 Compare June 28, 2015 18:30
@zherczeg
Copy link
Member

Updating this patch likely has a large maintenance burden. I would land it soon.

What is the decision: do you use CESU8 or not?

@egavrin
Copy link
Contributor

egavrin commented Jun 29, 2015

@zherczeg me too, the main issue right now is RegExps and upcoming JSON - these things are postponing this patch.

@zherczeg
Copy link
Member

Regexp is landed, JSON hopefully lands today. These barriers should be eliminated very soon.

@sand1k sand1k force-pushed the Andrey-unicode-dev branch from 2939973 to f8d6306 Compare June 29, 2015 11:34
@sand1k
Copy link
Contributor Author

sand1k commented Jun 29, 2015

@zherczeg, the decision is to use ordinary UTF-8.

ecma_string_t*
ecma_new_ecma_string (const ecma_char_t *string_p) /**< zero-terminated string */
ecma_string_t *
ecma_new_ecma_string_from_code_unit (ecma_char_t code_unit) /**< code unit */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, ecma_char_t is uint16_t. How can we create code UTF-8 octets from characters > 0xffff?

How do we combine code points > 0xffff? E.g. if string1 = "0xd804"; string2 = "0xdc00"; string1+string2 should be the utf representation of 0x11000 in our UTF-8 based system. Otherwise the comparison of the result to another string, which directly contains a 0x11000 character will fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zherczeg, see the comment in lit-globals.h:

/* Scalar values from 0xD800 to 0xDFFF are permanently reserved by Unicode standard to encode high and low
 * surrogates in UTF-16 (Code points 0x10000 - 0x10FFFF are encoded via pair of surrogates in UTF-16).
 * Despite that the official Unicode standard says that no UTF forms can encode these code points, we allow
 * them to be encoded inside strings. The reason for that is compatibility with ECMA standard.
 *
 * For example, assume a string which consists one Unicode character: 0x1D700 (Mathematical Italic Small Epsilon).
 * It has the following representation in UTF-16: 0xD835 0xDF00.
 *
 * ECMA standard allows extracting a substring from this string:
 * > var str = String.fromCharCode (0xD835, 0xDF00); // Create a string containing one character: 0x1D700
 * > str.length; // 2
 * > var str1 = str.substring (0, 1);
 * > str1.length; // 1
 * > str1.charCodeAt (0); // 55349 (this equals to 0xD835)
 *
 * Internally original string would be represented in UTF-8 as the following byte sequence: 0xF0 0x9D 0x9C 0x80.
 * After substring extraction high surrogate 0xD835 should be encoded via UTF-8: 0xED 0xA0 0xB5.
 *
 * Pair of low and high surrogates encoded separately should never occur in internal string representation,
 * it should be encoded as any code point and occupy 4 bytes. So, when constructing a string from two surrogates,
 * it should be processed gracefully;
 * > var str1 = String.fromCharCode (0xD835); // 0xED 0xA0 0xB5 - internal representation
 * > var str2 = String.fromCharCode (0xDF00); // 0xED 0xBC 0x80 - internal representation
 * > var str = str1 + str2; // 0xF0 0x9D 0x9C 0x80 - internal representation,
 *                          // !!! not 0xED 0xA0 0xB5 0xED 0xBC 0x80
 */

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Support for such cases is assumed to be added in next patch. This one is already too huge.

@zherczeg
Copy link
Member

+lgtm
I see issues with splitting/combining character codes, but a fix is promised later. The patch is big, but it is going in the right direction. We could just land it.

@egavrin
Copy link
Contributor

egavrin commented Jun 29, 2015

Great! make push

sand1k added 3 commits June 29, 2015 23:27
…iteral component.

JerryScript-DCO-1.0-Signed-off-by: Andrey Shitov [email protected]
Add utf-8 processing routines.
Change ecma_char_t from char/uint16_t to uint16_t.
Apply all utf-8 processing routines.
Change char to jerry_api_char in API functions' declarations.

JerryScript-DCO-1.0-Signed-off-by: Andrey Shitov [email protected]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
critical Raises security concerns ecma core Related to core ECMA functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants