-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unescaped HTML in translation text #6945
Comments
This is already a problem in some of our strings. The translators usually kept the tags unchanged, but sometimes treated things like attribute values as you would the |
Apparently the "proper" way to do this is to have the html inserted as an ICU text replacement. This makes it guaranteed that the translators won't translate them. It's also nice to give an example to the translators of what will be in the replacement...in this case, since it's just a literal string, the example is exactly what will be put in the replacement. We didn't see another easy way to be able to mark within a string "don't translate this section", at least with the format we've chosen (the XML form has a little more flexibility, but come on). The replacement approach has pretty rotten UX in our current i18n setup, however, since the substrings are just literals, not used as markup, and they aren't dynamic. The string and the html substring end up separated and it becomes difficult to read. We could:
|
This was my thought as well.
If there are only a few though, this seems far simpler. We could add something in our checker that makes sure anything with |
There are a decent number and they are multiplying. We also have a few other substrings we don't want translated, though I forget what we decided about localizing things like "First Meaningful Paint". I'm also not sure how effective adding a "Do not translate" line to the description is...do the translators always catch it and understand exactly what we don't want translated? On the other other hand, it'll be hard to always remember to put in double backticked hashtags or whatever to mark do-not-translate blocks, and it's difficult to remember to check for things we missed in the translations that come back for all the languages we don't speak :) |
+1 the idea of demarcation. I think leaving it in the descriptions will lead to sometimes translated, sometimes not, and it will be especially hard to keep track if we are having more and more html-in-text strings. We could also statically analyze if |
RFC @brendankenny @patrickhulce @hoten @paulirish I started to look into this more. And I think we need to really pursue this. We should stop sending text like URLs to be translated, it 1.) causes bugs #9000 2.) causes massive no-op round trips (axe 3.1) #7720 3.) is against the tc/ best practice 😉 Under the hood in GoogleLand we use the Chrome message.json format to convert into UIStrings = {
{
// Message with nothing special.
message: 'Example Message.',
description: 'Description of the message.',
},
{
// Message with a URL and markdown that shouldn't be sent for translation.
message: 'Example Message. $LINK_START$Learn more$LINK_END$.',
description: 'Description of the message.',
placeholders: {
link_start: {
content: '[',
},
link_end: {
content: '](https://developers.google.com/web/tools/lighthouse/audit/example)'
}
},
},
{
// Message with some HTML tags that we use as an example in our markdown.
message: 'Example Message, $HTML$ is some HTML.',
description: 'Description of the message.',
placeholders: {
html: {
content: '`<link rel=preload>`',
},
},
},
{
// Message with time in ms replacement.
message: 'Example Message, rendered in: $TIME_IN_MILLISECONDS$ ms.',
description: 'Description of the message.',
placeholders: {
time_in_milliseconds: {
content: '{timeInMs, number, milliseconds}',
},
},
},
{
// More canonical replacement message with time in ms replacement.
// But probably not what would work best for us? Or maybe?
message: 'Example Message, rendered in: $TIME_IN_MILLISECONDS$ ms.',
description: 'Description of the message.',
placeholders: {
time_in_milliseconds: {
content: '$1',
example: '53'
},
},
},
... This would allow us much more flexibility to change our placeholders, markdown syntax, and strings independent of tc/ and would allow us to properly utilize the placeholder syntax and stop sending non-translated words to tc/. Specifics:
|
Thanks for delineating all the cases and nailing this down @exterkamp! Overall I'm not sure why we need to add the overhead of explicit objects to our strings, but there's no doubt a lot you've thought about already that we haven't so hopefully questions will help :)
|
I think to me, the advantage of the objects is that we are closer to a chrome standard format for i18n and that it allows simple placeholder syntax. It isn't strictly required, we can hack together our own replacement conversion and have the strings be processed into this format right before going into en-US, but I think that would be messier (see "2."). I think that our strings have gotten sufficiently complex that we should start to think about encapsulating these as objects and converging on this Specific answers to your questions:
|
All good points! I think my hesitation comes down to it being a Chrome standard I'm less familiar with compared to ICU, so not a very good reason. I only have a total of 1 prior experience with localization anyway so that's not much of a corpus to go on either 😆 SGTM! Just a bummer it'll be a bit extra weight to add strings. |
I haven't looked at #9114 yet, but it does feel like we need to get to something like this in the end. I don't think it's worth it to keep special-casing stuff (and that's only going to get worse as we notice more things that need to be not translated, or handled in a special way, or taken out before going to tc and then restored, etc), but otoh we need the solution to be as robust (or more) as the current approach if we're going to make the move. The concerns I had left after working out the above examples with @exterkamp were
the good news is we already have a large existing corpus for testing (all our i18n strings), so any new solution won't be able to dodge any of these problems :) |
oh, more things:
|
|
I think I see what you're proposing now with the change in 74e11c1#diff-0b037964dfdf09e1e312fb06ffb7f8bcR28 description: {
message: '{link_start}Learn More!!!{link_end}. This audit took {milliseconds} ms.',
placeholders: {
link_start: '[->',
link_end: '](https://developers.google.com/web/tools/lighthouse/audits/minify-css)',
milliseconds: '{timeInMs, number, milliseconds}',
}
} If real ICU replacements are always inside google placeholders, it seems like it takes care of many issues:
We will have to figure out how the placeholder |
Just a note: Google translation will never* accidentally translate an ICU value i.e. *they claim |
Closing, since #9114 fixed this with |
These translated strings contain unescaped HTML which makes the translation linter unhappy.
[user-scalable="no"]
is not used in the<meta name="viewport">
element and the[maximum-scale]
attribute is not less than 5.<meta http-equiv="refresh">
[user-scalable="no"]
is used in the<meta name="viewport">
element or the[maximum-scale]
attribute is less than 5.<input type="image">
elements have[alt]
text<meta http-equiv="refresh">
<input type="image">
elements do not have[alt]
textThe text was updated successfully, but these errors were encountered: