Is there a good mechanism for selecting between "a" & "an"? #601
Replies: 7 comments
-
I'm not sure if it's "good", but a while back I put together a regexp solution for selecting between "a" and "an". It doesn't work for your underwear, though. |
Beta Was this translation helpful? Give feedback.
-
@grhoten had a lovely bit about this in his UTW breakout session. Alas, I don't think those were recorded. It's more complicated than it looks, because it's based on how the word is pronounced, which is only approximated by how it is spelled. (My session was recorded and I talk about this obliquely as "the bone dragon problem" near timecode 14:40. "Bone dragon" is a reference that longtime MF2ers will recall from our past conversations) I think it is important not to think of it being "determining the word that should precede the placeholder" but instead as "determining which pattern to use". Concatenation-like applications that work in an uninflected language like English turn into a mess in other languages and that's what this is. If the replacement variables can be constrained, the solution can be programmed using additional data, lexicons, etc. If the replacements are unconstrained, the solution will be a bit more complicated. |
Beta Was this translation helpful? Give feedback.
-
See, this is what I was really fishing for. @eemeli gets it! I'm getting to the point that I think the optimal solution for things like this, sentence-start capitalization, etc, is to come up with rules that address 95% of the situations, and allow for an override flag for the 5% |
Beta Was this translation helpful? Give feedback.
-
Funny you should say 95% See https://unicode-org.atlassian.net/browse/CLDR-14621 and the related https://unicode-org.atlassian.net/browse/CLDR-15725 (which focuses on a narrower use case of unit prefixes). I did some prototyping of the latter, and it is quite promising: in a large majority of cases it was better, and (as expected) in some cases was worse, but a fraction of the 'better' cases. As Addison said, cases where it is worse are typically where changes need to reflect the pronunciation, and the orthography doesn't easily let you get to the pronunciation without a dictionary lookup. And the worse cases are where a dictionary lookup won't help (such as English "a unionized company" v "an unionized particle" — union-ized vs un-ionized); however, the latter cases are pretty rare in many languages; and any rule that didn't have a high good:bad ratio one simply wouldn't include. Thus I think that (logically) postprocessing the data model to perform boundary adjustments will (overall) result in improvements for users in the formatted string. |
Beta Was this translation helpful? Give feedback.
-
I find it amusing to be talking about English here. Other languages have similar but different problems. For example, German has three genders (masculine, feminine, neuter) and four cases (nominative, accusative, genitive and dative). The combination of these affect articles (ein/eine/etc.). You need a dictionary to solve this, since, unlike the English example, nothing is encoded in the words themselves. An example more like English would be Turkish vowel harmony. Turkish has a rarely used indefinite article (that is, the equivalent of a/an) and its form depends on the last vowel in the noun. The code for this would obviously be different from the English code... Japanese or Chinese translators, meanwhile, are looking at removing placeholders for article generation or at selectors that don't do anything. Translation checking tools might complain about this, to their vast annoyance. Rinse and repeat times languages. If we had a mechanism or mechanisms for these cases inside MF2, the resulting messages would be complicated to set up and translate. Either source message authors would need to understand the problem and include appropriate placeholders/selectors or target language translators would have to insert them. Mark is probably closer to the mark in suggesting:
There are other means to practice avoidance or to set up messages to work appropriately (100% of the time) for constrained cases and without invoking NLP. |
Beta Was this translation helpful? Give feedback.
-
Yeah, for any solution, I don't think MF2 should try and deliver a solution. I think it should provide the flexibility to enable reasonable solutions, and I think it does via the function registry, and some of the message syntax. |
Beta Was this translation helpful? Give feedback.
-
I welcome this discussion. Some of my examples from my talk can be found here: https://www.youtube.com/watch?v=C2e7hYIkqoM (around 3:36-4:47). This topic flows into a discussion that I had with @macchiati in between sessions at UTW. For all of the exceptional cases, you need a structured lexicon. If you have phonetic information, which is available from places like Wiktionary, you can derive a lot of the properties. For simplicity, I tend to convert the phonetic properties to just property bits for a given word. For a given language, this is the important phonetic information.
I'm sure that there is more, but those are the ones that come to mind. If a given word is missing from the lexicon or is unannotated, the default behavior kicks in, which is something similar to the following:
There are some more nuances, but that's the gist of the algorithm. Regular expressions are not flexible enough. The algorithm that I have makes extensive use of multiple UnicodeSet objects and Unicode normalizer. Now this discussion leads to a larger topic that you will eventually ask. "How do you make a phrase definite, indefinite or construct in a given language?" Well now you need to additional properties from a lexicon. You need to know what the gender is (masculine, feminine, neuter, common or epicene). You need to know the grammatical number (singular, dual or plural). You need to know the grammatical case (see German), and there can be other language specific grammatical properties involved. Then when you get to Swedish, you realize the definiteness is not an article, but it's attached as a suffix to the word (usually a morphological transformation), and that requires word inflection. Our framework already handles that stuff. The framework that I helped to create can also handle heteronyms (same spelling, different meaning and different pronunciation), but that goes back to my previous presentation that I gave to this group before I disengaged from this group. That framework helps with ambiguity of words or custom words. I don't think MF2 is set up well to handle solutions to these kinds of problems. So if you got this far in my response, and you're still interested in this topic, I'll reiterate that I'm very interested in participating in a group to create and maintain structured lexicons. That would be a way to represent the grammatical category values (grammemes) of the words, and the relationship between surface forms of the same lemma (important for word inflection). Then you don't force the localizers into annotating redundant information for words that are inherently known for a speaker of a language. |
Beta Was this translation helpful? Give feedback.
-
If I have a phrase like
Sometimes I want "You found a user!", sometimes I want "You found an ungulate!", and sometimes I want "You found underwear!"
Has anyone found a good mechanism for determining the word that should proceed the placeholder?
Beta Was this translation helpful? Give feedback.
All reactions