Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split DateSymbols data #3865

Closed
Manishearth opened this issue Aug 15, 2023 · 21 comments
Closed

Split DateSymbols data #3865

Manishearth opened this issue Aug 15, 2023 · 21 comments
Assignees
Labels
C-datetime Component: datetime, calendars, time zones S-large Size: A few weeks (larger feature, major refactoring)

Comments

@Manishearth
Copy link
Member

DateSymbols is giant and has a lot of things inside it, only a fraction of which actually gets used once a formatter has been constructed.

We should split this type along day/month/year lines ,as well as along pattern length lines. (And provide a compatibility path for pre-2.0 V1 data, as usual)

@Manishearth Manishearth added the C-datetime Component: datetime, calendars, time zones label Aug 15, 2023
@Manishearth
Copy link
Member Author

Manishearth commented Aug 15, 2023

@sffc and I discussed this a bunch, in the context of fixing #3766 and #3761, which involves adding more data to datetime anyway, which we don't want to V2 for without doing it right.

The rough proposal we had was that we have the following main symbols keys:

  • Years
    • Either era symbols or cyclic year symbols. It does not make much sense for a calendar to have both, but if it does we can add a third variant
  • Months
    • month symbols
  • Weekdays
  • Days (maybe):
    • we can store day names as well if we end up having patterns for day names like those in Chinese or Hindu
    • worth sketching out, should not be a part of the MVP, can be added retroactively in the future

The symbols keys use an auxiliary key (#3632) to store the eight-way length distinction (abbreviated, narrow, short, wide) × (format, standalone). The current fallbacking between them will be performed either at datagen or via carefully done auxiliary key fallback (essentially, ensure that und is always empty for aux keys). See #3867.

Numeric becomes an optional auxiliary key (like another type of length) for calendars that have special formatting for numeric months (with leap year patterns), days, etc. We attempt to load it during construction but do not error if it is not found. We do not store leap year patterns for calendars that generate leap year names in a pattern based way (Chinese, Dangi).

For the rare pattern that needs multiple lengths to format something, we can store additional loaded data in an Option on the DateTimeFormat.

Finally, lengths would be as we have today, but they may also include a numbering system hint/override (eg hanidec/hanidays). This may potentially be per-field1, which may mean we potentially load multiple number formatters. Currently the overrides are hanidec, d=hanidays, hebr, M=romanlow, y=jpanyear, since we don't have RBNF yet I would recommend we just hardcode an enum for now and hardcode these numbering systems; it's not too hard to implement these in code and I think it's okay to do for such a small set.

cc @eggrobin who has thought about this a bit in the context of skeleta.

Footnotes

  1. E.g. in Chinese date formatting it is common to use hanidec or Latin for the year, hanidays for the day, and hans (spelled out Han) for the months. ICU4C currently handles this by using d=hanidays in the dateFormats.[length].numbers key and using month symbols to mimic hans.

@Manishearth
Copy link
Member Author

cc @zbraniecki @robertbastian

@Manishearth
Copy link
Member Author

Also our plan for #3766 and #3761 for 1.3 is to just let it slip and document the chinese calendar as being a preview calendar when it comes to formatting. We can clean up the placeholders and use mostly-correct placeholders instead.

@Manishearth
Copy link
Member Author

Discussed a bit

  • @Manishearth - Leap month display is a bit of a mess; we can't use a string table because of numbering systems. A similar thing for cyclic years is that you need to pick from the 60 different year names, so we need to add more data (the 60 names). For leap months, either we need to include a pattern, or we need to include special data for numeric months. Right now we have a single DateTimeSymbols object, plus DateLengths. The longer term solution, documented in Split DateSymbols data #3865, is that we split the date symbols data into smaller pieces: Years, which is either cyclic year symbols or era symbols (there is not currently a calendar that uses both, but we can support it later if needed); Months; Weekdays; and a Days key for day name formatting if we want to support that (ICU4C doesn't support it as far as we know). On top of this, we'll use aux keys to store the 8-way length distinction: Narrow, Abbreviated, Short, Wide, between Format and Standalone. We can load them via aux key fallback. Numeric becomes an additional data key that doesn't need to exist. We don't need to store leap year patterns then. There's a potential situation where you need multiple lengths to format a single field, like if a pattern contains both M and MMM; we can handle this with options. So this is the final design. However, this is not a design we can do for 1.3. I don't want to block 1.3 on this. Which means we have some short-term solutions:
    1. Create new keys for leap months and cyclic years. The keys will be short-lived. We'll need to change the code again next time.
    2. Hard-code the data for Chinese and Korean as a best-effort, and say that this functionality is preview.
    3. Say that these calendars are unstable and we hide them.
  • @Manishearth - The main reason not to do 1 is that we have a better plan. Is there really value for doing the haphazard thing in 1.3? I think probably no.
  • @echeran - So that means that you'll get Chinese characters in the Chinese calendar even if that's not your locale?
  • @Manishearth - Yeah, which I think is okay. It's understandable, not ugly, and mostly correct for the main users.
  • @sffc - I think we should prioritize the good solution in 1.4 because we have users who need Chinese calendar. I want to get 1.3 out the door ASAP because it has things including compiled data. If we don't have time for the good solution in 1.4, then we could implement option 1 in 1.4.
  • @robertbastian - We need to write good docs that basically say that these calendars are experimental.

Conclusion: implement option 2 for 1.3.

LGTM: @Manishearth @sffc @echeran (no strong opinion: @robertbastian, @skius)

@Manishearth
Copy link
Member Author

Discussion between @sffc and I on whether we should use aux keys or regular keys for lengths. We didn't dive too deep into the hour cycle part since that is something that can be more easily tweaked later (whereas the lengths are pervasive).

The main benefit of using separate (regular) keys is that they enable more build time slicing: if you know in advance what lengths you'll need, you can slice things appropriately. However, since most ways of interacting with this will be via skeletons or overall lengths, this becomes a bit less easy to do with the layers of indirection. We could potentially design a highly typed API that datagens traits linking skeletons to keys, this feels like overkill. It seems like the main win is only if the user can specify exactly what lengths they want.

Separate keys also have the advantage of being slightly smaller in databake (though not blob), because instead of storing a massive locale lookup array it can store a much smaller lookup array that is deduplicated across keys (especially if we choose to resolve length fallback during datagen).

On the other hand, aux keys are cleaner (we don't end up with hundreds of symbols keys) and easier to deal with. In the long run we can experiment with various horizontal fallback options (see discussion in #3867). There may also be options for optimization in the future by passing around binary search hints.

One major benefit is that users can slice out aux keys if they would like (we can do a very simple fallback algorithm in our code to handle this: if you don't find long, go check out medium, etc)

We decided to go with aux keys for now. We may measure things later and see if there are other benefits.

@Manishearth
Copy link
Member Author

Manishearth commented Oct 19, 2023

Listing out aux keys for each thing:

(a/n/s/w = abbr/narrow/short/wide, f/s = format/standalone)

  • Months: a/n/w × f/s
    • Special key "numeric". We probably want a separate key for this, can be done later.
  • Weekdays: a/n/s/w × f/s
  • Quarters: a/n/w × f/s
  • (cyclic) Days: a/n/w × f/s
    • I don't actually see anyone customizing standalone days.
  • Eras: names / abbr / narrow. @sffc is there any reason we should not just call "names" "wide" instead?

Given that standalone is the more rare one I would recommend having key names be stuff like -x-a and -x-as (i.e. "format" is implicit). Keeps it short, and lets us easily add standalone keys in the future for stuff like days where we don't have any usage right now

@sffc
Copy link
Member

sffc commented Oct 19, 2023

Thought: we could use a digit corresponding to the number of symbols in https://unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table, like:

  • -x-3 = abbreviated
  • -x-4 = wide
  • -x-5 = narrow

And standalone could be

  • -x-3s = abbreviated
  • -x-4s = wide
  • -x-5s = narrow

or maybe

  • -x-f3 = format abbreviated
  • -x-f4 = format wide
  • -x-f5 = format narrow
  • -x-s3 = standalone abbreviated
  • -x-s4 = standalone wide
  • -x-s5 = standalone narrow

@Manishearth
Copy link
Member Author

Makes sense. My instinct is to let format be the "default" because in some cases there is no data for standalone and we can save space by hardcoding that assumption in ICU4X and datagen (but tweaking it in a backcompat way if it changes)

@Manishearth
Copy link
Member Author

The current design for DTF integration is that we load one of each type of field needed (one month symbol, etc).

If a pattern needs multiple fields, we can later add in a Map<Field, Box<dyn Any>> situation for storing extra fields.

@sffc
Copy link
Member

sffc commented Oct 22, 2023

Some initial numbers of postcard with different fallback modes. Number in parentheses is the point in the postcard file at which the sorted locale lookup VarZeroVec ends and the data table begins.

Key Postcard, Runtime Postcard, Hybrid Postcard, Data Only
datetime/gregory/datesymbols@1 186558 (0x869) 190722 (0x15b5) 184405
datetime/symbols/gregory/years@1 30101 (0x2058) 49129 (0x60ac) 21821
datetime/symbols/gregory/months@1 105222 (0x4de1) 141505 (0xc936) 85285
datetime/symbols/weekdays@1 76988 (0x5b47) 129035 (0x10c42) 53621

The sum of the data only size of the three split keys is 160727, which is smaller than the 184405 in the single combined key. However, since the split keys require more locale lookup tables, the overall size is a bit larger. We are investigating ways to reduce the size of the locale lookup tables (e.g. #2699).

Example command line to generate one cell in the table: cargo run --release --bin icu4x-datagen -- --format blob --locales full --keys "datetime/symbols/gregory/months@1" -f runtime-manual

@sffc
Copy link
Member

sffc commented Oct 22, 2023

Very initial estimates for the impact of ZeroTrie on the postcard locale lookup table size, based on the strings in the compiled_data files (not the same set of locales as in the previous post):

Key VZV, Runtime ZT, Runtime
datetime/gregory/datesymbols@1 889 831
datetime/symbols/gregory/years@1 3935 2730
datetime/symbols/gregory/months@1 10223 5104
datetime/symbols/weekdays@1 10595 5096

So the bigger the VZV the bigger the win, with about a 50% win for the larger ones. If we project these ratios back to the full data set above, we stand to save something on the order of 25 kB in the sum of the split keys data size, which would bring the total split key size (runtime fallback mode, including lookup tables) down to just about the same as the combined key size.

@sffc
Copy link
Member

sffc commented Oct 22, 2023

I missed something in #3865 (comment). The lookup table is not only a VZV of locale strings; it is also a FZV of a mapping from the VZV index to the data blob index. With ZeroTrie we do not need that extra index-to-index table. If you include the extra table, the total lookup table size is about 15-20% higher than estimated. This means we should be able to cut an additional 5 kB by moving to ZeroTrie.

@sffc
Copy link
Member

sffc commented Oct 22, 2023

I implemented a ZeroTrie version of BlobSchema in #4207. Results for Gregorian, runtime fallback, and all locales:

Data Key Postcard Size
datesymbols 185248
months 90017
weekdays 58578
years 24893

The new keys are 173488 bytes total, now including locale lookup metadata, smaller than the combined key. 😃

@Manishearth
Copy link
Member Author

Yeah that's a good point

@Manishearth
Copy link
Member Author

Split out numeric symbols stuff in #4242

@Manishearth
Copy link
Member Author

This is done in neo.

@github-project-automation github-project-automation bot moved this from Being worked on to Done in icu4x 2.0 Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-datetime Component: datetime, calendars, time zones S-large Size: A few weeks (larger feature, major refactoring)
Projects
Status: Done
Development

No branches or pull requests

2 participants