Calculate byte-order correctly #15

gnithin · 2020-09-28T18:08:01Z

Background

The following weirdness was observed -

Some tests (those containing emojis) failed in python 2.7.(12|15|18) locally on my mac machine, while they all seem to pass in the py2 CI.
This local tests failures happened to others too - (Noah and Vidhu when they ran this locally).
Running the tests on python3 worked everywhere.
Running the test-cases on the mobile on-demand machine seemed to mimic the CI behaviour

The reason for all of this was the handling of emojis(which are surrogate pairs) by python. From Python 2.2 to python 3.3, every python interpreter can be compiled in 2 flavors - narrow and wide builds. Narrow builds store each character in 2 bytes, while wide build stores them in 4 bytes. The python version that comes with homebrew is built using the narrow version (2 bytes, which I have on my mac), while the ones used in the github actions is the wide build (Here's an issue which talks about that). (Also here's a link with more background for this narrow-wide build idea, refer "What does this have to do with Python?" section)

We need to handle emojis (surrogate pairs) as characters with 2 separate entries, because that is how UCS-2 handles it, and consequently that is how SMS will be sent.

Why is this useful?

Currently we are estimating a surrogate-pair emoji as 1 character in prod. This is incorrect since when sending messages, the mobile vendor will send it as UCS-2 in which this will be treated as 2 characters. If we estimate a message to be 1 segment, while the mobile-vendor estimates it to be 2 segments, and we send the message to ~100 people, we are off by -100 segments. This PR tries to correct that.

Changes

Explicitly encode the input character Utf-16

Tests

Revert all the existing tests
Add more tests for explicitly checking the byte-order, with both surrogate-pairs and the BMP emojis.

gnithin · 2020-09-28T18:08:32Z

sms_toolkit/utils/encoding.py

+    if is_python_3():
+        return [c for c in character.encode('utf-16-be')]
+    else:
+        return [ord(c) for c in character.encode('utf-16-be')]


Why is this change required?

The previous approach had one flaw - using `len`. Depending on whether the python interpreter build is narrow or wide (or in < py3.3 where the string encoding is based off of the highest code-point), the behaviour of len changes if the input is a code-point that's greater than what can be fit in 2 bytes (UTF-16 encodes them in 2 2-byte pairs, which we need to detect, since UCS-2 has a fixed 2 bytes limit).
In the narrow build - py2.7.12 (u'\U0001f600' is the unicode code point for 😀) -

>>> len(u'\U0001f600') 2

In wide build - py 2.7.12

>>> len(u'\U0001f600') 1

In python 3.5 (No need for u string prefix)-

>>> len('\U0001f600') 1

The issue here is that we want the len to be 2 everywhere. So we need to encode it to utf-16 explicitly.

Why the be in utf-16-be though?

Encodings can be done using big endian or little endian form (for the given 2 bytes one will have the higher byte first and the other will have the lower byte first). If we don't explicitly state that, then will be 2 additional bytes in the output which will be used to explicitly say what the endianess the method has used. That information is not required and We can directly tell the interpreter to use big-endian when encoding.

Why do we need to handle py2 and py3 separately?

str.encode behaves differently in python 2 and python 3. In python 2 encode returns an 8-bit string version of the Unicode string, encoded in the requested encoding (Source). In python 3, encode returns an encoded version of the string as a bytes object. (Source)
We basically want the list of byte values for the given character here. In python 3, we directly get it by iterating through every entry of encode. In py2, since the value is a string version, we can get the value by ording it.

But the original implementation that the old code is based off of, seems to work well. How is that?

Yep, that'll always work because JS stores strings in 16-bits. It seems to be using UTF-16 and UCS-2 at different places, but a string character is 16 bits. Source. "😀".length always seems to return 2. So, since surrogate pairs are explicitly returned as 2 separate entries, that code will always work. Since we have to support different builds, we need to be more careful.

This is super thorough, great work here.

gnithin · 2020-09-28T18:10:46Z

tests/test_message_segmenting.py

        profiled_message = profiling.profile_message(short_ucs2_text)
        assert profiled_message["num_segments"] == 1
-        assert profiled_message["message_length"] == 14
+        assert profiled_message["message_length"] == 15


Some of these changes are basically reverting what it was, before being rewritten here. After speaking with the author Vidhu (the author), The test-rewrite was just meant to make the tests pass in CI. Building the tests locally on his machine failed too, indicating the same problem.

gnithin · 2020-09-28T18:12:04Z

tests/conftest.py

@@ -56,3 +66,426 @@ def unicode_gsm7_chars():
 @pytest.fixture
 def byte_string_gsm7_chars(unicode_gsm7_chars):
    return unicode_gsm7_chars.encode('utf-8')
+
+
+@pytest.fixture


Explicitly compare bytes so that this issue can be debugged faster if it occurs again

sms_toolkit/utils/encoding.py

chrisconlon-klaviyo

This looks great, nice work!

ndurell · 2020-09-29T02:23:54Z

sms_toolkit/utils/encoding.py

+        A list of the bytes
+    """
+    if is_python_3():
+        return [c for c in character.encode('utf-16-be')]


You don't need ord here?

Nope. Encode returns bytes in py3. From the comment above -

str.encode behaves differently in python 2 and python 3.
In python 2 encode returns an 8-bit string version of the Unicode string, encoded in the requested encoding (Source).
In python 3, encode returns an encoded version of the string as a bytes object. (Source)

We basically want the list of byte values for the given character here. In python 3, we directly get it by iterating through every entry of encode. In py2, since the value is a string version, we can get the value by ording it.

ndurell

Let's rip the bandaid off.

gnithin added 7 commits September 28, 2020 01:04

Combine MMS and UCS-2 calculations

275b59c

Use a standard way to convert into utf-16

48d9fac

Fix the message segments tests

ecb0d2a

Add byte-array tests

7b65e53

Fix the test for long-ucs

c7d799c

Add profile tests for ucs2

5cf1353

Merge branch 'master' into nithin/fix/py-version

c0c8129

gnithin commented Sep 28, 2020

View reviewed changes

Remove unused imports

89aa775

gnithin marked this pull request as ready for review September 28, 2020 18:15

gnithin requested review from chrisconlon-klaviyo and ndurell September 28, 2020 18:15

chrisconlon-klaviyo reviewed Sep 28, 2020

View reviewed changes

sms_toolkit/utils/encoding.py Outdated Show resolved Hide resolved

Fix typo

fde5cad

chrisconlon-klaviyo approved these changes Sep 28, 2020

View reviewed changes

ndurell reviewed Sep 29, 2020

View reviewed changes

ndurell approved these changes Sep 29, 2020

View reviewed changes

gnithin merged commit db668e2 into master Sep 30, 2020

gnithin deleted the nithin/fix/py-version branch September 30, 2020 15:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculate byte-order correctly #15

Calculate byte-order correctly #15

gnithin commented Sep 28, 2020

gnithin Sep 28, 2020

chrisconlon-klaviyo Sep 28, 2020

gnithin Sep 28, 2020

gnithin Sep 28, 2020

chrisconlon-klaviyo left a comment

ndurell Sep 29, 2020

gnithin Sep 29, 2020

ndurell left a comment

Calculate byte-order correctly #15

Calculate byte-order correctly #15

Conversation

gnithin commented Sep 28, 2020

Background

Why is this useful?

Changes

Tests

gnithin Sep 28, 2020

Choose a reason for hiding this comment

chrisconlon-klaviyo Sep 28, 2020

Choose a reason for hiding this comment

gnithin Sep 28, 2020

Choose a reason for hiding this comment

gnithin Sep 28, 2020

Choose a reason for hiding this comment

chrisconlon-klaviyo left a comment

Choose a reason for hiding this comment

ndurell Sep 29, 2020

Choose a reason for hiding this comment

gnithin Sep 29, 2020

Choose a reason for hiding this comment

ndurell left a comment

Choose a reason for hiding this comment