Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Localize "shortcodes.ts" #53

Closed
MayeulC opened this issue Apr 24, 2020 · 16 comments
Closed

Localize "shortcodes.ts" #53

MayeulC opened this issue Apr 24, 2020 · 16 comments

Comments

@MayeulC
Copy link

MayeulC commented Apr 24, 2020

It is somewhat awkward to type an english description of an emoji to enter it, or know its "meaning" when typing in another language. Moreover, people not proficient at English are unduly penalized when using otherwise-translated pieces of software that leverage this database (I'm thinking of riot.im at least). People are not going to teach English to their relatives (grandparents, etc) because those would like to type 🌛 or something similar.

I guess the easiest way would be to provide more shortcodes.ts files, but crowd-sourcing could also be handled trough weblate (ideally providing the emoji as context).

See: element-hq/element-web#11013 and possibly https://github.com/vector-im/riot-web/issues/9298

@milesj
Copy link
Owner

milesj commented May 13, 2020

I've thought about it and it's a bit complicated at the moment, since there can be so many permutations, especially with i18n. #49

@MayeulC
Copy link
Author

MayeulC commented May 14, 2020

Well, at least there are only so many i18n choices to pick from, and if a user sets the consumer software to Spanish, that software can decide to provide Spanish short codes.

element-hq/element-web#49 is on another level entirely, in my opinion... as it stands, multiple pieces of software currently seem to expect that users are going to learn their specific terminology. They can probably be helped along the learning curve by providing suggestions based on what they type, while being shown emojibase's "cannonical" interpretation. But that, of course, is up for the developer to decide.

After looking at https://github.com/milesj/emojibase/blob/master/packages/data/fr/raw.json, I see that English shortcodes are present, where the en/raw.json could be used as a fallback. Moreover, some tags only appear once, and could be considered shortcodes, as such (Unless I am misunderstanding something)? Example: "blasé" for unamused 😒

With the current format, the first array index could be considered the canonical emojibase shortcode, I guess.

I'm a bit confused where the data comes from, and wwho decides what. If I were to make a PR to clean up shortcodes for a language, would that be acceptable, for instance?

@milesj
Copy link
Owner

milesj commented May 14, 2020

@MayeulC Shortcodes are hardcoded (https://github.com/milesj/emojibase/blob/master/packages/generator/src/resources/shortcodes.ts) and are not derived from the raw annotations/tags. That's what makes them a bit hard to maintain.

We'd need to figure out a strategy to properly support shortcodes for all locales, instead of just 1 locale.

@strixaluco
Copy link

Have you considered basing localizations off of Unicode CLDR?

@milesj
Copy link
Owner

milesj commented Aug 1, 2020

@strixaluco All of the annotations/labels are based on that data. I chose not to use them for shortcodes since they're... not really short, most of them are super long.

@strixaluco
Copy link

Does seeing only English labels/annotations no matter the chosen language mean that I should report issue to downstream project?

@milesj
Copy link
Owner

milesj commented Aug 3, 2020

@strixaluco Not really, nothing much they can do about it. It's just something I'd need to figure out and there isn't really a best option at the moment.

@strixaluco
Copy link

@milesj I'm sorry for bugging you with this again, just wanted to clear the things out. Could you please tell if following statements are correct?

  1. issue: localization of shortcodes.ts is currently impossible because they are hardcoded and there should be a strategy for supporting multiple locales.
  2. issue: annotations/labels and their translations are based off of Unicode CLDR, but there is a separate issue with them, which doesn't allow showing their translations. Only English labels/annotations are available at the moment.

@milesj
Copy link
Owner

milesj commented Aug 11, 2020

@strixaluco 1 is correct, 2 is slightly off.

The annotation field for each emoji object is based on the CLDR data, as seen here: https://raw.githubusercontent.com/milesj/emojibase/master/packages/data/de/raw.json This is fully localized.

The shortcodes field is based on the hard-coded file that's in English, which is used for all locales.

Now the following questions arise:

  1. Do we localize the hard-coded shortcodes file? This is a TON of work.

  2. Do we base shortcodes off the localized annotation? We would end up with really LONG shortcodes like :grinsendes_gesicht_mit_großen_augen:.

  3. Do Slack/Discord/etc localize shortcodes when languages change? If so, how does that work/look like?

@strixaluco
Copy link

Thank you for the clarification.

Re 2.
Taking into account that annotation and tags fields are fully localized, as we can see from German translation, I've done more testing with Element Android and Element Desktop as both these projects are using emojibase. So I've tried changing UI to German, clearing cache and reloading, but neither lustig nor gesicht didn't invoke any emoji and all the annotations and keywords were always in English, and it leads me to conclusion that there is a problem with downstream, contrary to what you suggested in this response above: #53 (comment). That might've been caused by unclear question from my side in the first place, so I'm sorry for possible confusion.

Re 1.
First, to address your questions,

  1. as mentioned by @MayeulC above, localization of shortcodes can be crowd-sourced via online translation service like Weblate, and although it's going to be ton of work, slow but eventual localization is better than nothing.

  2. basing shortcodes off of localized annotation is also better than nothing, but I imagine that fixing keyword translations in downstream projects (see Re 2.) could be a good intermediary solution as well.

  3. Unfortunately, I don't use any other shortcode-supplying software, than Element, so can't comment on this one.

Further thoughts:

As was discussed in element-hq/element-web#49, lack of standardization is the main problem here and in my opinion the best strategy would be to stick to Unicode as close as possible. Upon reading UTS vector-im/element-web#51 in particular and unicode.org resources in general, it seems that terminology use is slightly confusing there, but I assume that the label scheme is following:

names:

  • Unicode name
  • CLDR short name/TTS name

annotations:

  • CLDR short name/TTS name
  • keyword

Per UTS vector-im/element-web#51:

As noted in Section 2.1 Names, there is one further kind of annotation, called a CLDR short name. This is also referred to as the TTS name, for use in text-to-speech processing such as providing a short, descriptive emoji name when reading text for accessibility purposes. In this case the CLDR names provide several advantages over formal Unicode character names:

  • They can be shorter and less cumbersome than the formal name, whose requirement for name uniqueness often results in names that are overly long, such as BLACK RIGHT-POINTING TRIANGLE WITH DOUBLE VERTICAL BAR for ⏯.
  • They can apply to emoji that are represented by sequences as well as those represented by single characters.
  • They can be updated to better reflect current emoji depictions and usage.

TTS names are also outside the current scope of this document.

Unfortunately, I couldn't find other document where TTS names would be inside its scope, but it seems they might be shorter than CLDR names and that would fit the aim of this discussion. Obviously bringing changes to Unicode standards will require lots of collaborative efforts, discussions and time, but it can be a future solution.

@milesj
Copy link
Owner

milesj commented Aug 11, 2020

@strixaluco All good points. This is my current thought process on how to solve this.

1 - Generate shortcodes for each locale using the localized annotation. This would create a file like so: packages/data/<locale>/shortcodes.json.

2 - Create shortcode presets based off popular platforms, and move the hard-coded emojibase shortcodes into a preset.

packages/data/shortcodes/emojibase.json
packages/data/shortcodes/slack.json
packages/data/shortcodes/github.json

3 - Mark the current hard-coded shortcodes as "legacy" and create a new emojibase preset that more aligns with the unicode name, instead of an emotion.

4 - Update APIs to stitch multiple shortcode presets together into a single dataset. This allows consumers to use emojibase + slack + localized shortcodes for example.

fetchFromCDN('de/data.json', 'latest', { shortcodes: ['emojibase', 'slack', 'locale'] });

flattenEmojiData(data, [emojibaseCodes, slackCodes, localeCodes]);

1, 3, and 4 are rather easy. Could probably knock those out in a day. 2 is the complicated one, as I'm not sure where to fetch those platform specific shortcodes from.

The final open question is whether the presets (in 2) should also be localized? I'm leaning yes, which is where crowdsourcing might come into play.

@strixaluco
Copy link

Super! From a "non-English user of Element" perspective, I'd imagine 1. is already a win and even more than that if annotation + keyword problem is fixed downstream.
2. and 4. seem to be great for emojibase itself, while 3. is also nice from adherence-to-standards point of view.

@milesj
Copy link
Owner

milesj commented Aug 15, 2020

This will be resolved in the next major. Will publish a pre-release and do some testing.

@milesj milesj closed this as completed Aug 15, 2020
@MayeulC
Copy link
Author

MayeulC commented Aug 15, 2020

Thanks a lot for looking into this 🎉

If I may add something regarding crowdsourcing, I feel it is quite important to provide native speakers with an avenue to improve their shortcodes (be it new coinage, synonyms, fixing typos) in a place that is easy enough to find :)

Thanks a lot again, I'll recommend this project around when translated shortcodes are asked for!

@KovalevArtem
Copy link
Contributor

KovalevArtem commented Sep 18, 2020

image
When translating "american_samoa" strings should I also use "_" to separate words?
американское самоа or американское_самоа ?
When translating a string consisting of one word to an expression consisting of several words, should each word be separated by the "_"?

@milesj
Copy link
Owner

milesj commented Sep 18, 2020

@KovalevArtem Correct! For "shortcodes", they don't support spaces, so underscores are used instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants