-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculate byte-order correctly #13
Conversation
if is_python_3(): | ||
return [c for c in character.encode('utf-16-be')] | ||
else: | ||
return [ord(c) for c in character.encode('utf-16-be')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change required?
The previous approach had one flaw - using `len`. Depending on whether the python interpreter build is narrow or wide (or in < py3.3 where the string encoding is based off of the highest code-point), the behaviour of len changes if the input is a code-point that's greater than what can be fit in 2 bytes (UTF-16 encodes them in 2 2-byte pairs, which we need to detect, since UCS-2 has a fixed 2 bytes limit).
In the narrow build - py2.7.12 (u'\U0001f600' is the unicode code point for 😀) -
>>> len(u'\U0001f600')
2
In wide build - py 2.7.12
>>> len(u'\U0001f600')
1
In python 3.5 (No need for u string prefix)-
>>> len('\U0001f600')
1
The issue here is that we want the len to be 2 everywhere. So we need to encode it to utf-16 explicitly.
Why the be
in utf-16-be
though?
Encodings can be done using big endian or little endian form (for the given 2 bytes one will have the higher byte first and the other will have the lower byte first). If we don't explicitly state that, then will be 2 additional bytes in the output which will be used to explicitly say what the endianess the method has used. That information is not required and We can directly tell the interpreter to use big-endian when encoding.
Why do we need to handle py2 and py3 separately?
str.encode behaves differently in python 2 and python 3. In python 2 encode returns an 8-bit string version of the Unicode string, encoded in the requested encoding (Source). In python 3, encode returns an encoded version of the string as a bytes object. (Source)
We basically want the list of byte values for the given character here. In python 3, we directly get it by iterating through every entry of encode. In py2, since the value is a string version, we can get the value by ord
ing it.
But the original implementation that the old code is based off of, seems to work well. How is that?
Yep, that'll always work because JS stores strings in 16-bits. It seems to be using UTF-16 and UCS-2 at different places, but a string character is 16 bits. Source. "😀".length
always seems to return 2. So, since surrogate pairs are explicitly returned as 2 separate entries, that code will always work. Since we have to support different builds, we need to be more careful.
@@ -52,22 +52,22 @@ def test_message_profiling_gsm7(self, short_gsm7_text, long_gsm7_text, byte_stri | |||
def test_message_profiling_ucs2(self, short_ucs2_text, long_ucs2_text): | |||
profiled_message = profiling.profile_message(short_ucs2_text) | |||
assert profiled_message["num_segments"] == 1 | |||
assert profiled_message["message_length"] == 14 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are all changes going back to the way it originally was before it was rewritten here. After speaking with the author, The rewrite was unintentional and just meant to make the tests pass.
275b59c
to
1bea18d
Compare
DO NOT MERGE!
This draft depends on these PRs to be merged first. Will convert it into a PR after those are merged -
Background
The following weirdness was observed -
The reason for all of this was the handling of emojis(which are surrogate pairs) by python. From Python 2.2 to python 3.3, every python interpreter can be compiled in 2 flavors - narrow and wide builds. Narrow builds store each character in 2 bytes, while wide build stores them in 4 bytes. The python version that comes with homebrew is built using the narrow version (2 bytes, which I have on my mac), while the ones used in the github actions is the wide build (Here's an issue which talks about that). (Also here's a link with more background for this narrow-wide build idea, refer "What does this have to do with Python?" section)
We need to handle emojis (surrogate pairs) as characters with 2 separate entries, because that is how UCS-2 handles it, and consequently that is how SMS will be sent.
Why is this useful?
Currently we are estimating a surrogate-pair emoji as 1 character in prod. This is incorrect since when sending messages, the mobile vendor will send it as UCS-2 in which this will be treated as 2 characters. If we estimate a message to be 1 segment, while the mobile-vendor estimates it to be 2 segments, and we send the message to ~100 people, we are off by -100 segments. This PR tries to correct that.
Changes
Tests