Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Local" text transformations (contextual changes) #160

Closed
mihnita opened this issue Mar 23, 2021 · 13 comments
Closed

"Local" text transformations (contextual changes) #160

mihnita opened this issue Mar 23, 2021 · 13 comments
Labels
resolve-candidate This issue appears to have been answered or resolved, and may be closed soon.

Comments

@mihnita
Copy link
Collaborator

mihnita commented Mar 23, 2021

After placeholders are replaced with the final text the result should in some cases be "adjusted to match"

Common English example: "I received a {item}"
The "a" there should become "an" for words that start with a vowel.

Might even want to be fancier, imagine handling "I received a <a href="link">{item}</a>", where the "a" should account for a vowel pretty far away, inside an html tag.

There are similar needs in other languages, for example in French la and le become l’ before a word that starts with a vowel. See https://en.wikipedia.org/wiki/Elision_(French)

"Contractions" similar to the French one are also common in Italian, Spanish, Portuguese.
And this kind of "local transformations" might also help with Belorussian, Korean, others.
(for Spanish see https://www.thoughtco.com/pronunciation-based-changes-y-and-o-3078176)

With a more powerful system these would be handled by specifying that "I want the definite article of the noun", and that would solve "a grape" vs "an "apple".
But such systems are not always available.
This can be a "poor's man" alternative.

In order to be able to do this "the smart way" one would need to know AFTER FORMAT that a certain part of the message comes from a placeholder.
That way we don't "repair" text that was put there (intentionally) by the translator, but only text that was "created" by placeholders.
This is why I named this "local" text transformation / contextual changes, as opposed to issue #38, which seems to apply indiscriminately on the whole (final) message.

The exact transformations are probably not part of the standard itself (but might be "registered"?)

I think that this can be implemented on top of the already proposed format-to-parts feature (issues #41)
So I don't think that there is anything special to do with the data model if we already support that feature.

But better to list this explicitly.

@mihnita
Copy link
Collaborator Author

mihnita commented Mar 24, 2021

@eemeli
Copy link
Collaborator

eemeli commented May 2, 2021

I think the right thing to do in MessageFormat 2.0 would be to ensure that it's possible to implement transformations on formatted parts, while leaving out the actual implementation of such transformations.

One category of such transformations would be ones that are triggered explicitly at the boundary between two parts, e.g. for a {thing} to get formatted as "an example". In English this is relatively simple as the a/an difference is only based on morphology, but for others it's more complicated. For instance, take le {chose} in French; in many cases it's practically speaking required to be able to give the chose value not just as a string, but also with its gender attached, so that the output could be "la covid" but "le corona virus".

In order to make that work, the formatted parts would need to look something like this:

[
  { type: 'literal', value: 'le ' },
  { type: 'dynamic', value: 'covid', meta: { count: 'one', gender: 'feminine' } }
]

Given that input and the French locale, it'd be relatively straightforward to detect the le as a generic definite article, and to replace it with la on account of the one count and feminine gender. For l' some heuristic or other would still be needed to determine if covid starts with a vowel sound in French.

Now, working backwards from the formatted-parts output, how do we get the metadata attached to the part in the first place? One option would be a preceding dictionary-based transformation, which would indeed know that it's "la covid" and assign the metadata according to its own logic.

But should we do more? In particular, what do we do when "covid" is coming from a message reference? An obvious precedent to look at is Fluent and its message attributes, which cover exactly this use case. In the MF2 data model, such attributes would logically end up in the meta value of the relevant message, from which they could be included almost directly in the formatted part.

In other words while we have support for all this in the data model, it's underlining a need to account for this also in the syntax.

@eemeli eemeli added the syntax Issues related with MF Syntax label May 2, 2021
@eemeli
Copy link
Collaborator

eemeli commented May 2, 2021

Another thought: This connects with the previously-mentioned data registry for common selector types. Specifically, I'm reminded of this example #3 (comment) of how Mozilla uses Fluent selectors:

-brand-name = { $case ->
    [nominative] Firefox
    [genitive] Firefoksa
    ...
}

If we know that a case is expected to take values such as "nominative", "genitive", and so on, we could use a heuristic to set a value for the case metadata of the formatted part based on a usage pattern as above.

@aphillips
Copy link
Member

I suspect that this is either (a) the province of a function or selector or (b) out of scope.

That is, one might implement what @eemeli mentions above to create a :case selector (by associating gender, count, and article agreement with the data) or using some other means of selecting between patterns. If such a function is not part of the default registry, then that would make it out of scope.

@mihnita Any thoughts on where to go next here?

@aphillips aphillips added the resolve-candidate This issue appears to have been answered or resolved, and may be closed soon. label Jul 16, 2023
@mihnita
Copy link
Collaborator Author

mihnita commented Jul 20, 2023

I think that various language features will be supported by selectors / formatters / post-processing.
Post processing might be done either on the "formatted to parts" result, or maybe earlier (internally in MF2, with the format-to-parts already having the transformations applied).

My expectation (hope?) is that we will see more and more ML at work, and the role of MF2 will be to provide hooks and hints.

Even apparently straight-forward rules are in fact pretty tricky.
For example in French one would think that le/la followed by a word starting with a consonant becomes l’ (l apostrophe).

s / followed by a word starting with a consonant / followed by a word starting with a vowel /
(thanks Asmus)

But that about "la sauvage" ("the wild ")?
If we look at the text, "sauvage" starts with a consonant, and we should keep "la" as is.
But what if the <img> (or an emoji) is is intended as "text"? Imagine it's a bee ("abeille")
The text "la 🐝 sauvage" is really "l'abeille sauvage" (the wild bee).

@asmusf
Copy link

asmusf commented Jul 20, 2023

For example in French one would think that le/la followed by a word starting with a consonant becomes l’ (l apostrophe). But that about "la sauvage" ("the wild ")?

Don't you mean "vowel" here?

I agree with the wider point about inline, pronouncable, images.

However, this would assume that you provide the alternative text so it directly substitutes for a noun or noun phrase and in a way that universally matches what a native user would choose...

@eemeli
Copy link
Collaborator

eemeli commented Jul 21, 2023

I would think that specific language rules are outside our scope. My point regarding a selection on case is that with some selectors, a post-processor applying such heuristics or logic could find the selected key or keys to have significant value.

As a toy example, let's say that we have a language model or or other system that's capable of "fixing" or "improving" slightly broken messages, but that running it is expensive. Now, let's suppose that we have a message like

match {$foo :person-gender}
when female {She did a thing}
when * {They did a thing}

and we call this with a $foo for which :person-gender would prefer male, but that's not available for this message in this locale, so we end up with the fallback un-gendered message.

Now, if we could include that selection preference and result in our output somehow, some pretty simple logic could see that we've ended up with a fallback message, and use the expensive system to ask for the "male" version of "They did a thing".

Alternatively, with the same message but without the semi-magical LLM, we could be formatting a whole set of messages together. Then, we could note that for some of them, the preferred gender message is not available, and it's using a fallback. If we wanted all the messages to correspond with each other, we could then re-format the gendered ones to be neutral, rather than presenting the user with a mix.

@stasm
Copy link
Collaborator

stasm commented Jul 22, 2023

There are a few requests bundled here. Let me try to untangle them:

(I'm using "should" to define the requirements, not voice my opinion.)

The formatted parts should allow to identify their origin.

I.e. whether they were literal text or a product of a placeholder.

We don't have a concrete design of the interface of formatted parts (#41), and I think we also prefer to stay agnostic wrt. the exact data shapes, but I think it's safe to assume that all formatToParts implementations will want to do this.

Recommendation: add this to the requirements for formatToParts implementations.

The formatted parts should be decorated with grammatical data.

Or more broadly: data relevant to the formatting of the placeholder, to allow further context-aware transformations.

I don't remember if we explicitly discussed this. My position is that it would be great to allow formatToParts implementations to extend parts with such data. Exact data shapes are most likely implementation-specific. An example of what this could look like, based on my implementation: message2//grammatical_agreement.ts.

Recommendation: add this to the requirements for formatToParts implementations.

It should be possible to run text transformers on formatted messages.

The logic of text transformers is outside the scope of MF2, but we'd like to make sure it's possible to run text transformers on formatted parts.

This is satisfied to some extent just by formatToParts existing. Transformation layers can be built on top formatToParts, just like any other extra logic.

We can also consider ways of plugging transformers into the formatting runtime, similar to how we provide extension points for custom formatting and matching functions.

I think this could be the same mechanism for both "local" transformations and pattern-level ones, which is #38.

Recommendation: Discuss a new kind of extension points for patter-level transformers, possibly defined in the registry.

The selected variant should be decorated with its keys.

This is similar to the second requirement above, but for the whole selected variant. For example, formatToParts could not only return an iterator over formatted parts, but also other data:

interface MessageFormat {
    formatToParts(): FormattedVariant;
}

interface FormattedVariant {
    parts: Iterator<FormattedPart>;
    keys: Array<string>;
}

While a lot of the same information can probably be extracted from the decorated formatted parts, I think decorating the variant itself can be particularly useful for variants with no placeholders. E.g. when 1 {One thing}.

Recommendation: add this to the requirements for formatToParts implementations.

@macchiati
Copy link
Member

macchiati commented Jul 22, 2023

I think it helps focus our thinking if we have several examples of of any principle. For example:

  1. We need to know boundaries and origin information, recursively. These are for placeholders, but also placeholders within placeholders.

Examples:

  • Embolden the month in a message that contains a date:
    You last visited on March 3, 2022.

  • Line up messages visually in a column, on the right side of the integer part of numbers (whether integers or fractions)

666.3 credit
  4 debit

...

  • Perform orthographic fixes on insertion boundaries
    You bought a apple.
    (Internally You bought a {apple})

    You bought an apple.

Note that for placeholders we are dependent on the formatting functions to supply the information.

Visually, something like:
Screenshot 2023-07-22 at 10 33 33

@aphillips
Copy link
Member

Thanks @macchiati for the lovely illustration (which mirrors my thinking regarding format-to-parts behavior).

I'm removing resolve-candidate for now, but observe that this issue is somewhat ill-defined in what it is asking for. @mihnita can you clarify or consider breaking up the issue into specific requests for syntax, formatting, data-model, or registry changes?

@aphillips aphillips removed the resolve-candidate This issue appears to have been answered or resolved, and may be closed soon. label Jul 29, 2023
@mihnita
Copy link
Collaborator Author

mihnita commented Jul 31, 2023

can you clarify or consider breaking up the issue into specific requests for syntax, formatting, data-model, or registry changes?

I think it is a bit early to split into syntax / formatting / etc.
That is already close to a solution.
And I don't think we have an agreement (yet)

I think the items Stas added are good.

What I am thinking might help is to add another flag to the placeholders (propagated all the wain the "format to parts") saying "this is not rendered as visible / audible text" or "this is decoration only" or something like that.

Meaning that post-processing steps can ignore that whole "set of parts" because it is not visible.

For example in "a {+b}{+i}{fruit}{-i}{-b}" the bold and italic are not "rendered as text", so they don't affect the linguistic processing of "a" (to become "an" or not). From text processing perspective they behave as if they are not there.
As opposed to an image.

IF there is agreement on "yes, such a flag would be good", then we get to debate if it belongs in the placeholder or registry, syntax, etc.
I don't think it belongs in registry, because we don't want / need the "grammatical processing engine" to have to know about the registry, at runtime.

@eemeli
Copy link
Collaborator

eemeli commented Aug 1, 2023

I think the topic of this issue is pretty clear in the original post (formatted parts should identify which placeholder, if any, they come from), but the discussion since then has wandered quite a bit.

It would be useful to record any/all such wanderings as separate issues, so that they don't get lost if/when this issue is closed by addressing the original suggestion.

@aphillips
Copy link
Member

I can't tell what this issue is about (what is being specifically proposed). I agree with @eemeli's observation about formatted parts identifying their placeholder. Does that mean we've addressed this? If not, can we make a specific issue or issues for what it needed? Thanks.

@aphillips aphillips added the resolve-candidate This issue appears to have been answered or resolved, and may be closed soon. label Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
resolve-candidate This issue appears to have been answered or resolved, and may be closed soon.
Projects
None yet
Development

No branches or pull requests

7 participants