Is there a good mechanism for selecting between "a" & "an"? #601

SimonClark · 2024-01-15T20:35:49Z

SimonClark
Jan 15, 2024

If I have a phrase like

You found a {$thingType}!

Sometimes I want "You found a user!", sometimes I want "You found an ungulate!", and sometimes I want "You found underwear!"

Has anyone found a good mechanism for determining the word that should proceed the placeholder?

eemeli · 2024-01-15T20:47:15Z

eemeli
Jan 15, 2024
Maintainer

I'm not sure if it's "good", but a while back I put together a regexp solution for selecting between "a" and "an".

It doesn't work for your underwear, though.

0 replies

aphillips · 2024-01-15T20:48:08Z

aphillips
Jan 15, 2024
Maintainer

@grhoten had a lovely bit about this in his UTW breakout session. Alas, I don't think those were recorded. It's more complicated than it looks, because it's based on how the word is pronounced, which is only approximated by how it is spelled. (My session was recorded and I talk about this obliquely as "the bone dragon problem" near timecode 14:40. "Bone dragon" is a reference that longtime MF2ers will recall from our past conversations)

I think it is important not to think of it being "determining the word that should precede the placeholder" but instead as "determining which pattern to use". Concatenation-like applications that work in an uninflected language like English turn into a mess in other languages and that's what this is. If the replacement variables can be constrained, the solution can be programmed using additional data, lexicons, etc. If the replacements are unconstrained, the solution will be a bit more complicated.

0 replies

SimonClark · 2024-01-16T21:10:41Z

SimonClark
Jan 16, 2024
Author

It doesn't work for your underwear, though.

See, this is what I was really fishing for. @eemeli gets it!

I'm getting to the point that I think the optimal solution for things like this, sentence-start capitalization, etc, is to come up with rules that address 95% of the situations, and allow for an override flag for the 5%

0 replies

macchiati · 2024-01-16T23:37:39Z

macchiati
Jan 16, 2024
Maintainer

Funny you should say 95%

See https://unicode-org.atlassian.net/browse/CLDR-14621 and the related https://unicode-org.atlassian.net/browse/CLDR-15725 (which focuses on a narrower use case of unit prefixes).

I did some prototyping of the latter, and it is quite promising: in a large majority of cases it was better, and (as expected) in some cases was worse, but a fraction of the 'better' cases.

unicode-org/cldr#2156

As Addison said, cases where it is worse are typically where changes need to reflect the pronunciation, and the orthography doesn't easily let you get to the pronunciation without a dictionary lookup. And the worse cases are where a dictionary lookup won't help (such as English "a unionized company" v "an unionized particle" — union-ized vs un-ionized); however, the latter cases are pretty rare in many languages; and any rule that didn't have a high good:bad ratio one simply wouldn't include.

Thus I think that (logically) postprocessing the data model to perform boundary adjustments will (overall) result in improvements for users in the formatted string.

0 replies

aphillips · 2024-01-17T15:02:10Z

aphillips
Jan 17, 2024
Maintainer

I find it amusing to be talking about English here. Other languages have similar but different problems.

For example, German has three genders (masculine, feminine, neuter) and four cases (nominative, accusative, genitive and dative). The combination of these affect articles (ein/eine/etc.). You need a dictionary to solve this, since, unlike the English example, nothing is encoded in the words themselves.

An example more like English would be Turkish vowel harmony. Turkish has a rarely used indefinite article (that is, the equivalent of a/an) and its form depends on the last vowel in the noun. The code for this would obviously be different from the English code...

Japanese or Chinese translators, meanwhile, are looking at removing placeholders for article generation or at selectors that don't do anything. Translation checking tools might complain about this, to their vast annoyance.

Rinse and repeat times languages.

If we had a mechanism or mechanisms for these cases inside MF2, the resulting messages would be complicated to set up and translate. Either source message authors would need to understand the problem and include appropriate placeholders/selectors or target language translators would have to insert them.

Mark is probably closer to the mark in suggesting:

Thus I think that (logically) postprocessing the data model to perform boundary adjustments will (overall) result in improvements for users in the formatted string.

There are other means to practice avoidance or to set up messages to work appropriately (100% of the time) for constrained cases and without invoking NLP.

0 replies

SimonClark · 2024-01-17T18:02:22Z

SimonClark
Jan 17, 2024
Author

Yeah, for any solution, I don't think MF2 should try and deliver a solution.

I think it should provide the flexibility to enable reasonable solutions, and I think it does via the function registry, and some of the message syntax.

0 replies

grhoten · 2024-01-17T18:39:52Z

grhoten
Jan 17, 2024
Maintainer

I welcome this discussion. Some of my examples from my talk can be found here: https://www.youtube.com/watch?v=C2e7hYIkqoM (around 3:36-4:47).

This topic flows into a discussion that I had with @macchiati in between sessions at UTW. For all of the exceptional cases, you need a structured lexicon. If you have phonetic information, which is available from places like Wiktionary, you can derive a lot of the properties. For simplicity, I tend to convert the phonetic properties to just property bits for a given word.

For a given language, this is the important phonetic information.

Language	Grammeme
en	vowel-start consonant-start
fr	vowel-start consonant-start
ko	vowel-end consonant-end rieul-end
tr	vowel-start consonant-start front-unround back-unround front-round back-round

I'm sure that there is more, but those are the ones that come to mind.

If a given word is missing from the lexicon or is unannotated, the default behavior kicks in, which is something similar to the following:

Find the first/last letter of the word. A framework needs to check whether it starts or ends with a vowel, consonant, or other language dependent sound property.
Decompose it
Find the base letter (this is a configurable step with a UnicodeSet)
Does it case insensitive match the set of start or end vowel properties (they're slightly different even within a language). There is a default set, but French, Turkish and Dutch do use variations on the default set.

There are some more nuances, but that's the gist of the algorithm.

Regular expressions are not flexible enough. The algorithm that I have makes extensive use of multiple UnicodeSet objects and Unicode normalizer.

Now this discussion leads to a larger topic that you will eventually ask. "How do you make a phrase definite, indefinite or construct in a given language?" Well now you need to additional properties from a lexicon. You need to know what the gender is (masculine, feminine, neuter, common or epicene). You need to know the grammatical number (singular, dual or plural). You need to know the grammatical case (see German), and there can be other language specific grammatical properties involved. Then when you get to Swedish, you realize the definiteness is not an article, but it's attached as a suffix to the word (usually a morphological transformation), and that requires word inflection. Our framework already handles that stuff.

The framework that I helped to create can also handle heteronyms (same spelling, different meaning and different pronunciation), but that goes back to my previous presentation that I gave to this group before I disengaged from this group. That framework helps with ambiguity of words or custom words. I don't think MF2 is set up well to handle solutions to these kinds of problems.

So if you got this far in my response, and you're still interested in this topic, I'll reiterate that I'm very interested in participating in a group to create and maintain structured lexicons. That would be a way to represent the grammatical category values (grammemes) of the words, and the relationship between surface forms of the same lemma (important for word inflection). Then you don't force the localizers into annotating redundant information for words that are inherently known for a speaker of a language.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a good mechanism for selecting between "a" & "an"? #601

{{title}}

Replies: 7 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Is there a good mechanism for selecting between "a" & "an"? #601

SimonClark Jan 15, 2024

Replies: 7 comments

eemeli Jan 15, 2024 Maintainer

aphillips Jan 15, 2024 Maintainer

SimonClark Jan 16, 2024 Author

macchiati Jan 16, 2024 Maintainer

aphillips Jan 17, 2024 Maintainer

SimonClark Jan 17, 2024 Author

grhoten Jan 17, 2024 Maintainer

SimonClark
Jan 15, 2024

eemeli
Jan 15, 2024
Maintainer

aphillips
Jan 15, 2024
Maintainer

SimonClark
Jan 16, 2024
Author

macchiati
Jan 16, 2024
Maintainer

aphillips
Jan 17, 2024
Maintainer

SimonClark
Jan 17, 2024
Author

grhoten
Jan 17, 2024
Maintainer