-
-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Syntax Simplicity #48
Comments
Just a note: linguists don't handle the syntax directly, they use dedicated tools. We should be able to map our data model (independent of syntax) to the most common data model supported by localization tools. |
I disagree with this and I think it would be a good topic so see how others are translating MessageFormat today. From my experience, we decided to train linguists directly with the raw syntax and it worked really well - the main reasons for this:
Keep in mind that MessageFormat has been there for a while and is still in a state that I would consider inadequate in terms of tooling (XLIFF is similar). The localization industry moves slowly, so if we expect to have tools to use the new syntax, we know that adoption will most likely also be slow. I would prefer simple, linguist friendly syntax :) |
XLIFF has a lot more adoption than MessageFormat, which is not as used, even if it was around for a long time. The adoption is not about syntax. The bigger difficulty is the data model. This is the same as creating COBOL to sound very English-like in the hope that programming can be done by business people. Programming is hard because it is hard, not because the program does not read like English. Linguistically correct handling of translation is hard because human languages are messy. The second part of the (non)-adoption is about politics. This is why I am (sometimes a bit too?) aggressive in rejecting complicated features that will prevent adoption. Saying "translators should be able to handle it directly" means we give up on adoption from the get go. We design it so it will not be adopted. |
I agree it's not all about the syntax - I think its a combination of the followings:
This is the part I'm not sure I can picture easily? We didn't talk too much about continuous localization yet, but I think that related to file formats if we keep the current state that MF supports, which from my experience is both simple in terms of syntax but also file format agnostic in terms of storing the syntax in keys - this should align well with the simplicity side of the solution. If you think that to adopt the new syntax we will need to have all users implement some sort of import/export filter, then I would be curious to understand where this fits in a continuous localization pipeline? And also why this is needed because it does increase complexity.
I think we have to be careful with the word "politics" - it can easily be used the wrong way :-)
That's my point - if vendors are not interested, then what? businesses use vendors... so for a new solution it is a dependancy - I think it is important to either:
Yes, this is a reality of the localization landscape, which I think is mostly due to the vast complexity of languages and the lack of a solution that covers all needs.
I'm really not sure I understand what you mean here, are you saying we cannot keep the syntax as simple as it is today with MF? Because if we can, my experience is that this will help adoption, not the opposite. If we can't, then maybe you are seeing something I am missing? |
Did it achieve that? Isn't xliff more a catalog format than a syntax? Might be not really understand why xliff is that important beside trying to be a interchange format... At locize we support the most basic stuff out of XLIFF - to import/export -> but most time the files we get from other vendors are zero compatible with our tooling...our approach is rather different to the old ones -> keeping things simple and close to the runtime format to enable a real continuous localization (publishing usable runtime translation files to CDN) |
I think the poor tooling is a separate issue. While I do agree having linguists learn the syntax can be beneficial, I wouldn't want it to be a primary goal of the format. There should be better tooling in general. |
Continuous localization as nothing to do with adoption.
Partially. If you choose the right subset of the XLIFF features then yes, it is supported.
I see it as more than a catalog. With proper XLIFF support one would be able to translate In a way I think xliff 1.2 was too ambitious. |
Here is a pretty good document on XLIFF adoption: It touches adoption, feature creep, and more. |
This is about the way current localization tools represent data. They are very much geared on a 1:1 mapping.
Anything outside this view of the world breaks functionality. The source (English) is:
Russian needs to send back 4 messages. Things break. OK, "flatten" all messages into 1 and allow translators to edit things directly (your solution, and Fluent, for more complex messages)
These are data model incompatibilities. |
This breaks a lot of the useful features the vendors built into their tooling. It is like saying "make things simple enough so that developers don't need an IDE, they can use a simple text editor" and "a linter / compiler will tell them if there are problems" So now they lost not only syntax highlighting, but refactoring, and suggestions ("intellisense" or whatever you want to call it) and more. This just pushes back the CAT tools to the "dumb text editors" stage. That also means lower linguistic quality (forcing translators to focus on technical aspects). Imagine someone tells you "starting tomorrow you can't touch Eclipse / IntelliJ / Visual Studio, you develop software in Notepad" |
I agree that we should leverage power of CAT tools, but if we're designing for the Open Web, and something that we aim to suggest for a standard, then the reverse of your claim is also true: Imagine if someone told you that in order to write in a new programming language for the Web you need to use Eclipse, because there's no way for you to work with it in anything else. The principle of least power is described by W3C - https://www.w3.org/2001/tag/doc/leastPower-2006-01-23.html - and I believe we should aim for our system to work with notepad. Then, if you plug some tooling, it should work better. And if you use a CAT tool it should work amazing. And we should make it easy to develop CAT tool integrations, command line tooling, etc. But I'd like to avoid a future in which our outcome is basically unreadable/unwritable without sophisticated CAT tools. |
I think I did not touch all points:
This is where there is friction. Programming languages are flexible, and value flexibility. TMSess are a lot less flexible.
This I don't understand. It is like asking "why an image editor needs a filter for GIF files".
The current MF syntax is not well supported. Got some adoption, but not really enough, for something that was around for about 15 years. Or, got developer adoption, but not TMS support. And yes, I say that we can't keep the syntax as simple as it is today. |
I 100% agree, if this means "engineers can use Notepad" The main thing is where we draw the line. Reality is that most designers are unable to properly write HTML + CSS without tools. |
Professional translators, probably not.
I suspect you're right.
Here's the important point I'm trying to make - we're not designing it for ourselves. The Web, in its ideal form, is intended to lower the entry barrier and allow anyone with a notepad to write HTML, then add some CSS and maybe some JS, host it under their IP and anyone else in the World, with any browser, should be able to open it. I know, we can argue how far we're diverging from the ideal with a small number of big players owning substantial amount of traffic this day and playing the golden cage game with the users, but the open standards are intended, as far as I'm aware, to target the Notepad user. So, just like HTML, JS and CSS cannot be made more convenient for Google, Apple, Facebook and Twitter at the cost of the increase of the entry barrier, I believe we should aim for the Notepad user to be able to add L10n to HTML/JS/CSS model. It's important to me to stress that I'm not advocating to set our success criteria for successful, continuous localization at scale at serving the "user with a notepad" scenario. |
Agree with @zbraniecki in
I think we must design for the most simple use case and for the user that has fewer resources but keeping in mind all the tooling and scaling in cases that are needed. IMHO this mindset must be one of the drivers. It was since the beginning when we only wanted to bring MF to Browserland. Now we are trying to go little deep and wide, this down-scale or up-scale design is natural but we must try to balance it |
I don't think anyone advocated "more convenient for 'the big X' players" or "something that CAN'T be edited in Notepad" If we make it possible for programmers to create this kind of messages in Notepad, is that good enough? The think that "bothers" me are requirements like "especially by non-technical users like linguists" or "editable by translators without any tools". Without defining that "translators" are. It is like arguing ".svg is open, because it is plain text, and is editable by designers" "Notepad" is a red herring. "Professional translators, probably not." Maybe you touched here on something that probably we all know, and probably should make it more clear. We most likely "color" the meaning based on our own experience. But there is no such thing as "translator" There is a huge difference between a techie who decides to translate an open source project (because it is a nice tool and wants to give back) and a translator who pays the bills from the translation work. We should be able to support both cases. But if we design something that is not "localization tool friendly" then we have no adoption. |
I think we all agree that simplicity is preferred over complexity. I would like like to call out a few things for the sake of alignment:
Now I completely understand that by adding new features there will be trade-offs on simplicity. I do believe that if we hit a level of complexity that makes it non-Notepad friendly we will have to clearly have solutions around:
I don't have these answers, but all I know is that the simpler the solution, the fewer questions we will have to answer. |
I think this is related to the design principles discussion which I'd like to start in #50. In particular, the question about how "computational" vs. how "manual" we want the standard to be. On one end of the spectrum we have a data model which encodes a wide range of linguistic features, allowing grammatically correct interpolations with proper spelling. Every noun, adjective and verb is defined somewhere else with all possible inflections, plurals, capitalizations and articles.
This model comes with inherent complexity. It potentially allows a lot of new and interesting features, but its compatibility with existing data models is unknown. On the other end ("manual") of the spectrum, the data model is mostly a simple store of messages written out as full sentences. When you need a new variant, you create a new message. Some flexibility is introduced by means of many-to-many relationships, like MF's
This model could have a fairly simple data model and syntax. Interestingly, it "supports" a lot of linguistic complexity by means of plain text. Just not in a way that allows to computationally produce new messages. (Some messages will still require dynamic features, like variable interpolations, so things will never be a simple as just plain strings.) The computational model is great for constructing sentences from smaller pieces of highly dynamic data, when it's impossible to compile a list of all possible combinations. A good example are voice assistants like Siri. It's also good for enforcing consistency between translations. The manual model works well for UI where most messages are static. It's simple to translate and it's simple to translate correctly. OTOH, it leads to many more messages and consistency needs to be enforced through external tooling like translation memory. But ultimately, it's also more likely to be compatible with the lowest common denominator of data models currently used in the LSP industry. |
Of course you can :-)
Then you've been lucky to have relatively simple messages.
Absolutely!
I am not really sure how does it matter. |
Anyway... I am absolutely not arguing that we should design something complicated. And I agree that we want something that developers can edit directly with a simple text editor. But we must be able to export it to standard TMS tools. |
We have seen the same situations but typically we will ask developers to simplify their messages. There is a thin line between good use of MessageFormat and usage that can make it impossible to localize. This is typically resolved by having the ability to have a dialogue between linguists and engineers and also by having training material available for engineers.
Then we will have to figure out why MessageFormat is still not sufficiently supported today and how to remediate this situation.
I think it does matter because if you want to make a solution available at scale, we cannot expect all companies to build custom solutions to support it. Now we could expect the TMS to handle any sort of "conversion" if we need to - but my question about adoption remains. And as I mentioned during our last call, the more we discuss the more I wonder which new scenarios should be supported by the new syntax. Most linguistic problems I have seen so far seems very tricky to support from a syntax perspective. For example, the indefinite article (a/an) in English could be probably an easy one to add to the syntax. The rule is relatively simple. Now if you try the same in French (le/la/les) you will need to know the gender and the plural form for the target word or group of words. Does this mean that the syntax would propose a data model for this, or would it also provide a "dictionary"? And then if its only the data model, do we know how many people will want to use such features, what common problems this will solve and how much would it cost for a company to solve this at scale. Imagine hundreds of thousands of geographic entities that need to have this data for one language. How many companies can afford this? I think we need a backlog :) |
At times there is no way to simplify the message, the structure of the language is complex.
This is why I keep saying that we need a standard mapping to XLIFF. I think that the rest of the message (a/an; la/le/les/l'; etc) belongs in a different issue? I agree that these are hard problems, but they are not about syntax. |
If you have an example maybe it would help picture a bit better? The way I picture this, the syntax should be used at the sentence level since most TMS do segmentation at that level as well. The most extreme (legitimate) case I can imagine would be a sentence with a variable that requires gender (typically a user), and 2 other variables with plurals. But how many times does this scenario occur? And, to be honest I don't think any current TMS support I have seen could help with this. The solutions I could see around these type of extreme scenarios would be:
But then again, if adoption is our priority, I know which one I would prefer, especially if this scenario accounts for 0.001% (guesstimate here) of cases.
Maybe I'm not familiar enough with XLIFF to see how this would work - are you proposing that the base storage format would be directly XLIFF? Otherwise, this is where the continuous localization topic (conversion scripts?) will be required. And if this is what you are proposing, then we need to make sure that the XLIFF feature you have in mind are also supported broadly by most TMS, otherwise, we are back to square one.
To me, I was picturing that inflection could be solved using the syntax which is why I brought this topic back here. Here is an example of what I had in mind:
Maybe you have something else in mind? |
On 3/1/2020 11:19 AM, Nicolas Bouvrette wrote:
The most extreme (legitimate) case I can imagine would be a sentence
with a variable that requires gender (typically a user)...
Why do you say that natural gender is more of a problem than grammatical
gender?
|
My presumption is its a more common problem but I might be wrong. For example, it's very common for applications to have users, but maybe less to have the user specify their gender (other than very specific applications). I'd like to hear back from the group if they have examples where they require grammatical gender - I have a few in the space I work in but we are not using ICU to solve these problems. Depending on the size of the dataset, solving these problems can be quite expensive which is also why I presume they are less commonly solved as well. |
On 3/1/2020 7:03 PM, Nicolas Bouvrette wrote:
Why do you say that natural gender is more of a problem than
grammatical gender?
My presumption is its a more common problem but I might be wrong. For
example, it's very common for applications to have users, but maybe
less to have the user specify their gender (other than very specific
applications).
If you have a message where a parameter is a noun with grammatical
gender, but the message also contains an adjective or article and you
want the latter to track plurals, then they also need to track gender in
many languages.
You may be able to avoid this in some cases by making the parameter
cover the entire noun phrase, or writing messages that attempt to
circumvent this problem. But I thought that the current effort was
partially intended to avoid such defensive designs.
On a more general level, I wonder whether it wouldn't be useful to have
a reasonably exhaustive set of "standard examples" that we expect the
syntax to cover. It's really not possible for anyone to understand all
the requirements in the abstract, because not all of use have the same
working experience of types of messages and types of languages.
With a set of canonical examples (together with pseudo translation into
English that reveals the relevant contraints), it's much easier for
anyone to convince themselves that a proposed syntax is only as simple
as possible, not not more.
|
I think we are now at an important point in this project where we really should decide on the scope of it. When @romulocintra reached out to me - for me the point was bringing the best fitting i18n format to the browser as something like I understand - this looks like a good time to add more features to the syntax - but as I currently experience this discussion with adding too much we will kill the format for the small business. Not every business will have the starting money to buy into a TMS or building up an inhouse solution. In my opinion, we should keep the scope of this project as small as possible:
I mean we got @zbraniecki from fluent, @longlho from react-intl and @eemeli from messageformat (and me from i18next. And I'm rather sure we could get @kazupon onboard from vue-i18n). Just a guess but with those js libs we cover over 90% of the web/js projects out there. I'm no linguist and got just an idea of how complex some languages can be - but I can at least say those are not too often a problem for the users of my lib. |
Fully agree that solving this is very complex, but this is why I keep asking "What is the size of the data". As you mentioned, for small dataset there are ways around this:
Of course, all these strategies do not scale well - but which companies out there have the big data issues and do we need to provide a full solution for them or simply the foundations to help them get there.
+100 I think documenting current issues with potential real use cases and solution (can be pseudo syntax) would help determine the priorities. I am tempted to start a new Git issue on this but I wonder if this is the right tool for such an effort.
Fully agree on this as well - you can have the best solution but if its too complex, it will surely be used by a minority. I think everyone here wants to provide a solution at scale for common i18n problems. Now, do we know what those are? |
Hi all,
I would like to second Mihai's sentiment that the data model needs to be
mappable to XLIFF.
You can argue that XLIFF doesn't have universal support but - based on my
commercial localization (large scale) experience - it is at the core of all
solutions that scale (solutions built by companies such as Microsoft,
Oracle, IBM, etc.) The industry is indeed extremely fragmented and immature
(ever growing with an entry threshold close to zero) and most actors in the
industry are incapable of using proper processes because they are in
reactive mode or worse. Nevertheless, it doesn't mean that a
standardisation effort should mimic the reactive approach of the chaotic
majority that doesn't scale. Adopting proper XLIFF compliant tooling is not
too difficult actually, and as a buyer, the simplest thing to do is to
produce a standard package and say in the RFP that the format is XLIFF 2.1
(XLIFF 2.0 backwards compliant) and the bidders will comply because the
market is extremely competitive.
Most of the services and tooling market leaders want the buyers to believe
that standards are not supported and encourage them to submit all sorts of
crazy non internationalized source formats for direct localization
because it allows them to build insane labor intensive solutions that will
lock in the buyer with them indefinitely.. But generally speaking if a
buyer says "jump", they will ask "how high and how many times?". So it
should be the buyers' procurement (informed by technical champions)
responsibility to say "I want you to translate these XLIFF 2 packages"
instead of "Train for me people that will be able to directly edit this or
that sort of syntax", "Extract text for localization from pdfs because we
don't know where the source content is.."
There is strong OSS support for XLIFF (low level libraries in java, .NET),
the functionality doesn't need to be built from scratch, and it's
especially easy to adopt the core (advanced functionality can be added
later due to modularity of the data model) and all major localization
providers are able to handle XLIFF 2 if required, they just don't advertise
this capability because their business level decision makers believe in
lock in rather than in standards based interoperability. All SDL products
(as of 2017) do support an XLIFF 2 roundtrip, other tools that support
XLIFF 2 roundtrip include memsource, xtm, Lionbridge's logoport
disguised under many marketing whitelabels, OKAPI Ocelot (OSS), etc. Most
of the leading tools don't support XLIFF extraction and merging though and
I believe it should be the buyers concern to extract to and merge back from
XLIFF because it is them who know best their source format. Here is an
informative spec produced by GALA that helps people build proper
extractors/ergers
https://galaglobal.github.io/TAPICC/T1/WG3/rs01/XLIFF-EM-BP-V1.0-rs01.xhtml
It has also code examples and counter examples, so be sure to look at them..
The section 2.4
https://galaglobal.github.io/TAPICC/T1/WG3/rs01/XLIFF-EM-BP-V1.0-rs01.xhtml#Hints
will give you an idea what sort of operations are allowed/supportable on
inline codes during a localization roundtrip..
The basic idea of XLIFF is that of masking inline
code/annotations/whatever artifacts devs fancied to put inside of their
natural language content. The masking is done in a technology agnostic way.
You can extract any sort of syntax into XLIFF and even more, the same
masking data model is not tied to XML only.
XLIFF OMOS TC at OASIS generalizes the XLIFF model
https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=xliff-omos
and is working on JLIFF https://github.com/oasis-tcs/xliff-omos-jliff
the JSON serialization of the same data model.
Proper internationalization should of course to strive to minimize the
amount of cose within content but of course it's not always possible..
XLIFF also solves the structural and project management issues but I'd say
this is out of scope for a message format discussion.
I think the key is to preserve the data model assumptions that make
internationalization and localization possible. Whatever the agreed message
format ends up being, it should be tested on XLIFF (or JLIFF) roundtrip
capability.
If you create something that a linguist is supposed to edit directly, this
might seem SME friendly, but it doesn't scale. ideally you want your format
to be easily supportable by tools. But the format will not be supportable
by tools if it violates the basic set of data model assumptions that Mihai
outlined early in this thread..
Cheers dF
Dr. David Filip
===========
ISO/IEC JTC 1 PAS Mentor | Convenor, ISO/IEC JTC 1/AG 3 Open Source
Software, Convenor, ISO/IEC JTC 1/SC 42/WG 3 Trustworthiness of AI |
National mirror chair, NSAI TC 02/SC 18 AI | Head of the Irish national
delegation, ISO/IEC JTC 1/SC 42 AI | Chair & Editor, OASIS XLIFF OMOS TC |
Secretary & Lead Editor, OASIS XLIFF TC | NSAI expert to ISO/IEC JTC 1/SC
38 Cloud Computing, ISO TC 37/SC 3 Terminology management, SC 4 Language
resources, SC 5 Language technology | GALA TAPICC Steering Committee Member
Spokes Research Fellow
ADAPT Centre
KDEG, Trinity College Dublin
Mobile: +420-777-218-122
…On Mon, Mar 2, 2020 at 2:10 PM Nicolas Bouvrette ***@***.***> wrote:
If you have a message where a parameter is a noun with grammatical gender,
but the message also contains an adjective or article and you want the
latter to track plurals, then they also need to track gender in many
languages.
Fully agree that solving this is very complex, but this is why I keep
asking "What is the size of the data".
As you mentioned, for small dataset there are ways around this:
- Include articles with the data
- Have full sentences that will cover all the different dataset
- Change the sentence to make it simple (the old ":" trick before a
list of items)
Of course, all these strategies do not scale well - but which companies
out there have the big data issues and do we need to provide a full
solution for them or simply the foundations to help them get there.
On a more general level, I wonder whether it wouldn't be useful to have a
reasonably exhaustive set of "standard examples" that we expect the syntax
to cover.
+100
I think documenting current issues with potential real use cases and
solution (can be pseudo syntax) would help determine the priorities. I am
tempted to start a new Git issue on this but I wonder if this is the right
tool for such an effort.
adding too much we will kill the format for the small business.
Fully agree on this as well - you can have the best solution but its too
complex, it will surely be used by a minority. I think everyone here wants
to provide a solution at scale for common i18n problems. Now, do we know
what those are?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#48?email_source=notifications&email_token=AAOXW43GMLT4WFR7DBY5WWLRFO457A5CNFSM4KV3AFGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENPOAJQ#issuecomment-593420326>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAOXW45EYTXOAECS766EUGLRFO457ANCNFSM4KV3AFGA>
.
|
This appears to have been addressed by the adoption of EBNF and later ABNF syntaxes. It is also a bit non-specific: it is a design principle that I think this group aspires to hold up. |
As mentioned in today's telecon (2023-09-18), closing old requirements issues. Note: this specific issue was a topic of interest at the face-to-face and in the feedback we received in Seville and there is work on simplifying the syntax as a result. |
Is your feature request related to a problem? Please describe.
Linguistic challenges are complex and having a simple way to solve them can also be complex. Message Format seems to have kept a certain level of simplicity which makes adoption easier, especially by non-technical users like linguists.
Describe the solution you'd like
I would like for the new syntax to remain simple (at least as simple as Message Format today, or even more simple if this is even possible).
Describe why your solution should shape the standard
By having a simple syntax, it will help both authors and linguists to manipulate raw syntax without having to spend too much time learning it.
There is also a limit in terms of complexity that linguists will be willing to learn, especially if we are aiming for global adoption. Linguists are language experts, not engineers.
If we presume that raw syntax cannot be translated directly by linguists without the need for tools, this means that we will have to rely on other ways to get the translation done.
If the raw syntax is too complex and that we have to support some sort of "linguist friendly format", I am not too sure how this will work for some inflection problems (e.g. adding language-specific syntax by language specialists).
Additional context or examples
Based on personal experience I have seen linguists directly edit several existing syntaxes such as:
The text was updated successfully, but these errors were encountered: