Datagen should be fallback-aware #2683

Manishearth · 2022-09-29T18:35:45Z

Currently, datagen will generate data for each key that is requested of it. In --all-locales¹ mode, this means that for, e.g. decimal/symbols@1, there will identical entries for all of these locales:

even though they all fall back to sr. BlobDataProvider and BakedDataProvider deduplicate, so this doesn't lead to data duplication, but it does mean there are nine entries in the lookup array when there only needs to be one (when fallback is enabled).

The situation is worse for en-* where for some data keys there are probably up to 105 duplicate entries! This will only get worse as CLDR adds locales.

From @zbraniecki's measurements this is actually causing a nontrivial cost in our constructors when run with full-data.

It should be possible to tell datagen which fallback algorithm is used, something like --fallback-algorithm={none, naive, full} (see #2686 about naïve fallback). This would deduplicate the keys based on fallback algorithm; e.g. all the sr number formatting keys would collapse to one sr entry. Something similar would occur for en and we'd probably end up with en-IN and some other locales with unique entries along with the majority collapsed to en.

Furthermore, --all-locales (or --all-cldr-locales) should likely force the user to select a fallback algorithm, as there are footguns pointed at both your feet here if we just pick a default:

If you generate data without fallbacking you will get a massive lookup vector
If you generate data with fallbacking but haven't enabled fallbacking in your code, stuff won't work at runtime.

To motivate why people wouldn't always want to run with --fallback-algorithm=full, the full fallback algorithm is itself expensive in data, CPU time, and codesize. Embedded platforms are likely not going to be working with a wide set of locales simultaneously, rather they will often just need to hotload data for one locale (and if settings change, they can request data for one different locale), so fallbacking isn't really important.

cc @zbraniecki @sffc

Which should probably be called --all-cldr-locales for clarity ↩

The text was updated successfully, but these errors were encountered:

sffc · 2022-09-29T22:28:29Z

#834

Manishearth · 2022-09-29T22:42:26Z

@sffc do you think we should close one of these in favor of the other, or is this more a subset issue of #834? I do think this has a sketch of a more concrete plan to tackle a subpart of this.

sffc · 2022-09-30T02:05:50Z

@sffc do you think we should close one of these in favor of the other, or is this more a subset issue of #834? I do think this has a sketch of a more concrete plan to tackle a subpart of this.

The two issues are not exact duplicates of each other, so we should keep them both open, but there will probably be a single PR or sequence of PRs that will fix this and the other related issues.

sffc · 2022-12-16T17:54:51Z

I'll start with "none" and "full" fallback modes.

sffc · 2023-04-18T19:02:14Z

Discuss: whether datagen was run in "none" or "full" fallback mode impacts the runtime data provider requirements. If "none" mode, we need full fallback at runtime. In "full" mode, we only support the exact language identifiers that were present at datagen time, and we still need basic fallback that handles extension keywords. Failure to do this will cause data lookups to fail when they should otherwise succeed. How do we enforce these invariants?

sffc · 2023-04-20T18:06:44Z

Discussion:

The three runtime providers we support (blob, fs, and unstable) should all add the correct type of fallbacking on their own. It should be driven by metadata in the blob/fs or in the generated code for databake. Could include no_fallback constructors on those three types.
We could add a flag in DataRequestMetadata to signal whether fallback is taking place here or elsewhere in the tree.

robertbastian · 2023-06-14T16:36:20Z

This is blocking baked data, as we should be doing fallback, and tests break if we don't.

sffc · 2023-06-21T19:43:07Z

Discuss the exact names and behaviors of the datagen fallback modes. See #3487 (comment)

sffc · 2023-06-22T17:36:57Z

Discuss with:

sffc · 2023-07-07T16:57:58Z

Writing down some thoughts on this design.

There are two main categories of fallback-aware datagen: runtime fallback and gentime fallback, which we call Runtime and Expand modes. They have very different semantics.

Runtime Fallback Mode

The regular Runtime mode will ship "batteries included" in databake; it doesn't support Postcard or FS. The Manual Runtime mode works with all providers but requires a LocaleFallbackProvider to be hooked up manually at runtime.

The behavior when using a set-based LocaleInclude is fairly clear. For all languages in the set, we also include regional variants and extension keywords. We deduplicate based on blobs reachable with locale fallback enabled.

If using an explicit list of locales instead of a predefined set, there are a few modes:

Include regional variants and extension keywords that are descendants of a listed entry in the explicit list. For example, if en is in the list, it implies that all locales with en in their inheritance chain will also be included. Open question: does this include cross-language fallback like hi-Latn?
Do not include regional variants unless they are explicitly listed, but do include extension keywords like numbering system preferences
Do not include regional variants or extension keywords unless they are explicitly listed (see below)

Expanded Fallback Mode

This mode is not supported with a predefined locale set; you must use an explicit list of locales.

The main open question here seems to be about how to handle extension keywords. I would suggest the following model:

Do not include extension keywords unless explicitly listed
Add a match-like syntax such as the following:
- ar-EG-u-nu-latn means to include that specific numbering system entry
- ar-EG-u-nu-* means to include all numbering system entries for that langid
- *-u-nu-latn means to include that numbering system for all langids that are otherwise being included
- *-u-nu-* means to include all numbering system entries for all langids that are otherwise being included
- ar-EG-u-* means to include all extensions for ar-EG
- *-u-* means to include all extensions for all langids that are otherwise being included

The *-u-... could instead be mul-u-... to be more like BCP-47.

The extension keyword match things could be in the LocaleInclude::Explicit or in a separate option. I might lean toward a separate option at this point because:

LocaleInclude::Explicit uses LanguageIdentifier which has some benefits such as serializability, and we utilize these types heavily
We can be more flexible with the syntax; don't need to fit the mold of Locale or LanguageIdentifier

For collation, we are still only allowed to select from the set of collations that are made available from the collations option in datagen. For example, in order to include gb2312, it is necessary to enable it in both source.options.collations and the *-u-co-gb2312 locale filter.

CC @robertbastian

sffc · 2023-07-07T18:27:11Z

Concrete proposal for the CLI that I think covers everything:

--fallback takes the following options
- runtime (default for baked)
- runtime-manual
- expand
- pass-through (default for FS and blob)
  - (I think @robertbastian doesn't like this name)
--locales needs to resolve to an explicit list in expand mode
--extensions takes the following options
- all (default)
- none
- list of pattern matches as described in my previous comment
--regions takes the following options
- all (default for all fallback modes except expand)
- none (default for expand)
- list of regions that will be applied to all languages

Manishearth · 2023-07-07T22:15:16Z

I really like this proposal. May be worth bikeshedding the name of the first flag (--for-fallback-strategy?) but i think it's fine.

sffc · 2023-07-08T08:50:34Z

A few examples:

--locales en en-GB de-CH --fallback runtime
- All regional variants inheriting from en, en-GB, or de-CH are included
- All extensions are included
- Entries are fallback-deduplicated
--locales en en-GB de-CH --fallback pass-through
- Same as above except entries are not fallback-deduplicated
- Observation: an explicit locale list with fallback pass-through should probably also expand the locales in the explicit list if they aren't already expanded
--locales en en-GB de-CH --regions none --fallback expand
- Only the three listed locales are included; no other language IDs
- Extensions for those specific langids are included
--locales modern --regions US GB DE --fallback runtime
- All modern languages (and language-scripts) are included
- Only the US, GB, and DE regional variants are included
- Extensions are included
--locales en de --regions US DE --fallback expand
- The full permutation en-US, en-DE, de-US, and de-DE are included and expanded
- Extensions relevant for those locales are included

sffc · 2023-07-08T14:48:25Z

Revised proposal for the MVP, to be extended later as client needs arise:

--fallback takes the following options
- runtime (default for baked)
- runtime-manual
- expand
- hybrid (default for FS/blob)
--extensions takes the following options
- all (default)
- none
- additional options can be added later
--regions is not yet exposed through the CLI or API but takes the following behavior:
- all (default for all fallback modes except expand)
- none (default for expand)

With the following interaction between fallback and locales:

Fallback	LocaleInclude	Behavior
Runtime*	Pre-defined Set	Include everything in the set with deduplication
Runtime*	Explicit	Include ancestors and descendants of the explicit locales with deduplication
Expand	Pre-defined Set	NOT ALLOWED (doesn't make sense)
Expand	Explicit	Include exactly the locales specified
Hybrid	Pre-defined Set	Include everything in the set
Hybrid	Explicit	Include ancestors and descendants of the explicit locales and also ensure that the explicit locales have entries

* or Runtime Manual

Note that I renamed "Pass-Through" to "Hybrid". I think I would like to rename "Expand" to "Computed" or "Pre-Computed" or similar but we can bikeshed that.

sffc · 2023-07-10T11:59:48Z

Discussion:

Rename "Expand" to "Preresolved" (--fallback preresolved)
ExtensionsInclude:
- ::Recommended (default)
- ::Custom(vec!["!*-u-co-*", "und-u-co-search*", "!*-u-nu-*"])
  - Include/Exclude list similar to ICU buildtool
  - Note: All can be achieved with "" and None with "!"
- It would be very nice to land this option in 1.3 but it's not a hard block

LGTM: @robertbastian @sffc @Manishearth @mosuem

Manishearth added C-data-infra Component: provider, datagen, fallback, adapters discuss Discuss at a future ICU4X-SC meeting labels Sep 29, 2022

zbraniecki mentioned this issue Sep 29, 2022

Add NaiveFallback #2686

Open

Manishearth mentioned this issue Sep 30, 2022

Add smarter locale filtering in DataExporter #834

Closed

sffc mentioned this issue Oct 1, 2022

Speed up resource lookup and fallback at runtime #2699

Open

sffc mentioned this issue Nov 7, 2022

Diagnostics icu4x based list formatting. rust-lang/rust#104047

Merged

sffc added this to the ICU4X 1.2 milestone Dec 16, 2022

sffc self-assigned this Dec 16, 2022

sffc added S-large Size: A few weeks (larger feature, major refactoring) T-core Type: Required functionality and removed discuss Discuss at a future ICU4X-SC meeting labels Dec 16, 2022

This was referenced Feb 14, 2023

Implement datagen-time locale fallback #3110

Closed

Moving source data out of icu_testdata #3044

Merged

sffc modified the milestones: 1.2 Blocking ⟨P1⟩, 1.3 Blocking ⟨P1⟩ Mar 23, 2023

sffc added discuss Discuss at a future ICU4X-SC meeting discuss-priority Discuss at the next ICU4X meeting and removed discuss Discuss at a future ICU4X-SC meeting labels Apr 18, 2023

sffc removed the discuss-priority Discuss at the next ICU4X meeting label Apr 20, 2023

robertbastian mentioned this issue Jun 5, 2023

Fallback-aware datagen #3487

Merged

robertbastian mentioned this issue Jun 14, 2023

A path to killing checked in postcard #3530

Closed

sffc added the discuss Discuss at a future ICU4X-SC meeting label Jun 21, 2023

robertbastian mentioned this issue Jun 22, 2023

Batteries-included baked provider #3549

Merged

sffc added the discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band label Jun 22, 2023

This was referenced Jul 5, 2023

Reshape the LocaleFallbacker API to have the WithConfig intermediate #3636

Merged

Add ExportableDataPayload::eq_dyn #3639

Merged

Enable datagen fallbacking with full set of options #3640

Merged

sffc removed discuss Discuss at a future ICU4X-SC meeting discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band labels Jul 10, 2023

robertbastian mentioned this issue Aug 2, 2023

Add LocaleFallbackProvider to tutorial examples #3334

Merged

sffc mentioned this issue Aug 3, 2023

Datagen: Consume CLDR-JSON resources keyed with default script #3772

Merged

sffc mentioned this issue Aug 24, 2023

Add ability to filter extensions and regions in datagen #3932

Open

sffc closed this as completed Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datagen should be fallback-aware #2683

Datagen should be fallback-aware #2683

Manishearth commented Sep 29, 2022 •

edited

Loading

sffc commented Sep 29, 2022

Manishearth commented Sep 29, 2022

sffc commented Sep 30, 2022

sffc commented Dec 16, 2022

sffc commented Apr 18, 2023

sffc commented Apr 20, 2023

robertbastian commented Jun 14, 2023

sffc commented Jun 21, 2023

sffc commented Jun 22, 2023

sffc commented Jul 7, 2023

sffc commented Jul 7, 2023 •

edited

Loading

Manishearth commented Jul 7, 2023

sffc commented Jul 8, 2023

sffc commented Jul 8, 2023

sffc commented Jul 10, 2023

Datagen should be fallback-aware #2683

Datagen should be fallback-aware #2683

Comments

Manishearth commented Sep 29, 2022 • edited Loading

Footnotes

sffc commented Sep 29, 2022

Manishearth commented Sep 29, 2022

sffc commented Sep 30, 2022

sffc commented Dec 16, 2022

sffc commented Apr 18, 2023

sffc commented Apr 20, 2023

robertbastian commented Jun 14, 2023

sffc commented Jun 21, 2023

sffc commented Jun 22, 2023

sffc commented Jul 7, 2023

Runtime Fallback Mode

Expanded Fallback Mode

sffc commented Jul 7, 2023 • edited Loading

Manishearth commented Jul 7, 2023

sffc commented Jul 8, 2023

sffc commented Jul 8, 2023

sffc commented Jul 10, 2023

Manishearth commented Sep 29, 2022 •

edited

Loading

sffc commented Jul 7, 2023 •

edited

Loading