-
-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Local" text transformations (contextual changes) #160
Comments
I think the right thing to do in MessageFormat 2.0 would be to ensure that it's possible to implement transformations on formatted parts, while leaving out the actual implementation of such transformations. One category of such transformations would be ones that are triggered explicitly at the boundary between two parts, e.g. for In order to make that work, the formatted parts would need to look something like this: [
{ type: 'literal', value: 'le ' },
{ type: 'dynamic', value: 'covid', meta: { count: 'one', gender: 'feminine' } }
] Given that input and the French locale, it'd be relatively straightforward to detect the Now, working backwards from the formatted-parts output, how do we get the metadata attached to the part in the first place? One option would be a preceding dictionary-based transformation, which would indeed know that it's "la covid" and assign the metadata according to its own logic. But should we do more? In particular, what do we do when "covid" is coming from a message reference? An obvious precedent to look at is Fluent and its message attributes, which cover exactly this use case. In the MF2 data model, such attributes would logically end up in the In other words while we have support for all this in the data model, it's underlining a need to account for this also in the syntax. |
Another thought: This connects with the previously-mentioned data registry for common selector types. Specifically, I'm reminded of this example #3 (comment) of how Mozilla uses Fluent selectors:
If we know that a |
I suspect that this is either (a) the province of a function or selector or (b) out of scope. That is, one might implement what @eemeli mentions above to create a @mihnita Any thoughts on where to go next here? |
For example in French one would think that le/la followed by a word starting with a consonant becomes l’ (l apostrophe). But that about "la sauvage" ("the wild ")? Don't you mean "vowel" here? I agree with the wider point about inline, pronouncable, images. However, this would assume that you provide the alternative text so it directly substitutes for a noun or noun phrase and in a way that universally matches what a native user would choose... |
I would think that specific language rules are outside our scope. My point regarding a selection on case is that with some selectors, a post-processor applying such heuristics or logic could find the selected key or keys to have significant value. As a toy example, let's say that we have a language model or or other system that's capable of "fixing" or "improving" slightly broken messages, but that running it is expensive. Now, let's suppose that we have a message like
and we call this with a Now, if we could include that selection preference and result in our output somehow, some pretty simple logic could see that we've ended up with a fallback message, and use the expensive system to ask for the "male" version of "They did a thing". Alternatively, with the same message but without the semi-magical LLM, we could be formatting a whole set of messages together. Then, we could note that for some of them, the preferred gender message is not available, and it's using a fallback. If we wanted all the messages to correspond with each other, we could then re-format the gendered ones to be neutral, rather than presenting the user with a mix. |
There are a few requests bundled here. Let me try to untangle them:
The formatted parts should allow to identify their origin.I.e. whether they were literal text or a product of a placeholder. We don't have a concrete design of the interface of formatted parts (#41), and I think we also prefer to stay agnostic wrt. the exact data shapes, but I think it's safe to assume that all
The formatted parts should be decorated with grammatical data.Or more broadly: data relevant to the formatting of the placeholder, to allow further context-aware transformations. I don't remember if we explicitly discussed this. My position is that it would be great to allow
It should be possible to run text transformers on formatted messages.The logic of text transformers is outside the scope of MF2, but we'd like to make sure it's possible to run text transformers on formatted parts. This is satisfied to some extent just by We can also consider ways of plugging transformers into the formatting runtime, similar to how we provide extension points for custom formatting and matching functions. I think this could be the same mechanism for both "local" transformations and pattern-level ones, which is #38.
The selected variant should be decorated with its keys.This is similar to the second requirement above, but for the whole selected variant. For example,
While a lot of the same information can probably be extracted from the decorated formatted parts, I think decorating the variant itself can be particularly useful for variants with no placeholders. E.g.
|
I think it helps focus our thinking if we have several examples of of any principle. For example:
Examples:
...
Note that for placeholders we are dependent on the formatting functions to supply the information. |
Thanks @macchiati for the lovely illustration (which mirrors my thinking regarding format-to-parts behavior). I'm removing |
I think it is a bit early to split into syntax / formatting / etc. I think the items Stas added are good. What I am thinking might help is to add another flag to the placeholders (propagated all the wain the "format to parts") saying "this is not rendered as visible / audible text" or "this is decoration only" or something like that. Meaning that post-processing steps can ignore that whole "set of parts" because it is not visible. For example in IF there is agreement on "yes, such a flag would be good", then we get to debate if it belongs in the placeholder or registry, syntax, etc. |
I think the topic of this issue is pretty clear in the original post (formatted parts should identify which placeholder, if any, they come from), but the discussion since then has wandered quite a bit. It would be useful to record any/all such wanderings as separate issues, so that they don't get lost if/when this issue is closed by addressing the original suggestion. |
I can't tell what this issue is about (what is being specifically proposed). I agree with @eemeli's observation about formatted parts identifying their placeholder. Does that mean we've addressed this? If not, can we make a specific issue or issues for what it needed? Thanks. |
After placeholders are replaced with the final text the result should in some cases be "adjusted to match"
Common English example:
"I received a {item}"
The "a" there should become "an" for words that start with a vowel.
Might even want to be fancier, imagine handling
"I received a <a href="link">{item}</a>"
, where the "a" should account for a vowel pretty far away, inside an html tag.There are similar needs in other languages, for example in French
la
andle
becomel’
before a word that starts with a vowel. See https://en.wikipedia.org/wiki/Elision_(French)"Contractions" similar to the French one are also common in Italian, Spanish, Portuguese.
And this kind of "local transformations" might also help with Belorussian, Korean, others.
(for Spanish see https://www.thoughtco.com/pronunciation-based-changes-y-and-o-3078176)
With a more powerful system these would be handled by specifying that "I want the definite article of the noun", and that would solve "a grape" vs "an "apple".
But such systems are not always available.
This can be a "poor's man" alternative.
In order to be able to do this "the smart way" one would need to know AFTER FORMAT that a certain part of the message comes from a placeholder.
That way we don't "repair" text that was put there (intentionally) by the translator, but only text that was "created" by placeholders.
This is why I named this "local" text transformation / contextual changes, as opposed to issue #38, which seems to apply indiscriminately on the whole (final) message.
The exact transformations are probably not part of the standard itself (but might be "registered"?)
I think that this can be implemented on top of the already proposed format-to-parts feature (issues #41)
So I don't think that there is anything special to do with the data model if we already support that feature.
But better to list this explicitly.
The text was updated successfully, but these errors were encountered: