Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicodeToShortcode() removes mdash symbol #36

Closed
KarelWintersky opened this issue Nov 22, 2022 · 4 comments
Closed

unicodeToShortcode() removes mdash symbol #36

KarelWintersky opened this issue Nov 22, 2022 · 4 comments

Comments

@KarelWintersky
Copy link
Contributor

// text before:
// <p>А ведь тут совсем другой смысл заложен. Mdash — это черточка шириной с букву М. В русской типографике ее называют длинным тире. Ndash — соответственно более короткая черточка, часто даже уже, чем буква N.</p>\n

$content = self::unicodeToShortcode($content);

// text after
// "<p>А ведь тут совсем другой смысл заложен. Mdash  это черточка шириной с букву М. В русской типографике ее называют длинным тире. Ndash  соответственно более короткая черточка, часто даже уже, чем буква N.</p>\n"

Mdash symbol copypasted from this article: https://medium.com/@sergeisoloviev/mdash-31c331397e46 (2nd paragraph)

@i-just
Copy link

i-just commented Mar 16, 2023

The same thing happens to various other punctuation marks; e.g. left & right double quotation marks, left & right single quotation marks, en dash and others.

@KarelWintersky KarelWintersky changed the title unicodeToShortcode() removed mdash symbol unicodeToShortcode() removes mdash symbol Mar 16, 2023
brandonkelly added a commit to craftcms/cms that referenced this issue Mar 16, 2023

Verified

This commit was signed with the committer’s verified signature.
brandonkelly Brandon Kelly
Works around elvanto/litemoji#36 by only calling LitEmoji::unicodeToShortcode() for 4-byte character sequences
@brandonkelly
Copy link
Contributor

This was introduced in LitEmoji 4.3 via PR #35. The new regex in unicode-patterns.php is matching several unintended non-emoji characters, and those are getting discarded by the foreach loop in unicodeToShortcode() if there is no matching emoji for them.

I’ve added a workaround in Craft CMS, where we now find all 4+ -byte character sequences in the string first, and only pass those into unicodeToShortcode(), leaving the rest of the string in-tact. Anyone experiencing this issue is welcome to copy that code if you need it, as we wait for the official LitEmoji fix.

@joshmcrae
Copy link
Member

Hi all, we don't have a lot of time to be dedicating to this project at the moment but I've implemented a change to how we're doing unicode to shortcode conversion in #38. This should prevent any characters which are not known emoji from being discarded (e.g. em dash, other punctuation) since there's now a direct str_replace happening.

This new approach will yield better results but at a performance cost. Unless you're trying to convert 100s of kilobytes or more of text at a given time, you probably won't notice anything.

I'll report back here when an alpha release is ready.

@joshmcrae
Copy link
Member

The above changes are now available on version 5.0.0-alpha.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants