-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datagen should be fallback-aware #2683
Comments
The two issues are not exact duplicates of each other, so we should keep them both open, but there will probably be a single PR or sequence of PRs that will fix this and the other related issues. |
I'll start with "none" and "full" fallback modes. |
Discuss: whether datagen was run in "none" or "full" fallback mode impacts the runtime data provider requirements. If "none" mode, we need full fallback at runtime. In "full" mode, we only support the exact language identifiers that were present at datagen time, and we still need basic fallback that handles extension keywords. Failure to do this will cause data lookups to fail when they should otherwise succeed. How do we enforce these invariants? |
Discussion:
|
This is blocking baked data, as we should be doing fallback, and tests break if we don't. |
Discuss the exact names and behaviors of the datagen fallback modes. See #3487 (comment) |
Discuss with: |
Writing down some thoughts on this design. There are two main categories of fallback-aware datagen: runtime fallback and gentime fallback, which we call Runtime and Expand modes. They have very different semantics. Runtime Fallback ModeThe regular Runtime mode will ship "batteries included" in databake; it doesn't support Postcard or FS. The Manual Runtime mode works with all providers but requires a LocaleFallbackProvider to be hooked up manually at runtime. The behavior when using a set-based LocaleInclude is fairly clear. For all languages in the set, we also include regional variants and extension keywords. We deduplicate based on blobs reachable with locale fallback enabled. If using an explicit list of locales instead of a predefined set, there are a few modes:
Expanded Fallback ModeThis mode is not supported with a predefined locale set; you must use an explicit list of locales. The main open question here seems to be about how to handle extension keywords. I would suggest the following model:
The The extension keyword match things could be in the
For collation, we are still only allowed to select from the set of collations that are made available from the |
Concrete proposal for the CLI that I think covers everything:
|
I really like this proposal. May be worth bikeshedding the name of the first flag ( |
A few examples:
|
Revised proposal for the MVP, to be extended later as client needs arise:
With the following interaction between fallback and locales:
* or Runtime Manual Note that I renamed "Pass-Through" to "Hybrid". I think I would like to rename "Expand" to "Computed" or "Pre-Computed" or similar but we can bikeshed that. |
Discussion:
|
Currently, datagen will generate data for each key that is requested of it. In
--all-locales
1 mode, this means that for, e.g.decimal/symbols@1
, there will identical entries for all of these locales:even though they all fall back to
sr
. BlobDataProvider and BakedDataProvider deduplicate, so this doesn't lead to data duplication, but it does mean there are nine entries in the lookup array when there only needs to be one (when fallback is enabled).The situation is worse for
en-*
where for some data keys there are probably up to 105 duplicate entries! This will only get worse as CLDR adds locales.From @zbraniecki's measurements this is actually causing a nontrivial cost in our constructors when run with full-data.
It should be possible to tell
datagen
which fallback algorithm is used, something like--fallback-algorithm={none, naive, full}
(see #2686 about naïve fallback). This would deduplicate the keys based on fallback algorithm; e.g. all thesr
number formatting keys would collapse to onesr
entry. Something similar would occur foren
and we'd probably end up withen-IN
and some other locales with unique entries along with the majority collapsed toen
.Furthermore,
--all-locales
(or--all-cldr-locales
) should likely force the user to select a fallback algorithm, as there are footguns pointed at both your feet here if we just pick a default:To motivate why people wouldn't always want to run with
--fallback-algorithm=full
, the full fallback algorithm is itself expensive in data, CPU time, and codesize. Embedded platforms are likely not going to be working with a wide set of locales simultaneously, rather they will often just need to hotload data for one locale (and if settings change, they can request data for one different locale), so fallbacking isn't really important.cc @zbraniecki @sffc
Footnotes
Which should probably be called
--all-cldr-locales
for clarity ↩The text was updated successfully, but these errors were encountered: