Split DateSymbols data #3865

Manishearth · 2023-08-15T00:20:58Z

DateSymbols is giant and has a lot of things inside it, only a fraction of which actually gets used once a formatter has been constructed.

We should split this type along day/month/year lines ,as well as along pattern length lines. (And provide a compatibility path for pre-2.0 V1 data, as usual)

Manishearth · 2023-08-15T05:00:50Z

@sffc and I discussed this a bunch, in the context of fixing #3766 and #3761, which involves adding more data to datetime anyway, which we don't want to V2 for without doing it right.

The rough proposal we had was that we have the following main symbols keys:

Years
- Either era symbols or cyclic year symbols. It does not make much sense for a calendar to have both, but if it does we can add a third variant
Months
- month symbols
Weekdays
Days (maybe):
- we can store day names as well if we end up having patterns for day names like those in Chinese or Hindu
- worth sketching out, should not be a part of the MVP, can be added retroactively in the future

The symbols keys use an auxiliary key (#3632) to store the eight-way length distinction (abbreviated, narrow, short, wide) × (format, standalone). The current fallbacking between them will be performed either at datagen or via carefully done auxiliary key fallback (essentially, ensure that und is always empty for aux keys). See #3867.

Numeric becomes an optional auxiliary key (like another type of length) for calendars that have special formatting for numeric months (with leap year patterns), days, etc. We attempt to load it during construction but do not error if it is not found. We do not store leap year patterns for calendars that generate leap year names in a pattern based way (Chinese, Dangi).

For the rare pattern that needs multiple lengths to format something, we can store additional loaded data in an Option on the DateTimeFormat.

Finally, lengths would be as we have today, but they may also include a numbering system hint/override (eg hanidec/hanidays). This may potentially be per-field¹, which may mean we potentially load multiple number formatters. Currently the overrides are hanidec, d=hanidays, hebr, M=romanlow, y=jpanyear, since we don't have RBNF yet I would recommend we just hardcode an enum for now and hardcode these numbering systems; it's not too hard to implement these in code and I think it's okay to do for such a small set.

cc @eggrobin who has thought about this a bit in the context of skeleta.

E.g. in Chinese date formatting it is common to use hanidec or Latin for the year, hanidays for the day, and hans (spelled out Han) for the months. ICU4C currently handles this by using d=hanidays in the dateFormats.[length].numbers key and using month symbols to mimic hans. ↩

Manishearth · 2023-08-15T05:01:19Z

cc @zbraniecki @robertbastian

Manishearth · 2023-08-15T05:02:17Z

Also our plan for #3766 and #3761 for 1.3 is to just let it slip and document the chinese calendar as being a preview calendar when it comes to formatting. We can clean up the placeholders and use mostly-correct placeholders instead.

Manishearth · 2023-08-24T18:36:07Z

Discussed a bit

@Manishearth - Leap month display is a bit of a mess; we can't use a string table because of numbering systems. A similar thing for cyclic years is that you need to pick from the 60 different year names, so we need to add more data (the 60 names). For leap months, either we need to include a pattern, or we need to include special data for numeric months. Right now we have a single DateTimeSymbols object, plus DateLengths. The longer term solution, documented in Split DateSymbols data #3865, is that we split the date symbols data into smaller pieces: Years, which is either cyclic year symbols or era symbols (there is not currently a calendar that uses both, but we can support it later if needed); Months; Weekdays; and a Days key for day name formatting if we want to support that (ICU4C doesn't support it as far as we know). On top of this, we'll use aux keys to store the 8-way length distinction: Narrow, Abbreviated, Short, Wide, between Format and Standalone. We can load them via aux key fallback. Numeric becomes an additional data key that doesn't need to exist. We don't need to store leap year patterns then. There's a potential situation where you need multiple lengths to format a single field, like if a pattern contains both M and MMM; we can handle this with options. So this is the final design. However, this is not a design we can do for 1.3. I don't want to block 1.3 on this. Which means we have some short-term solutions:
1. Create new keys for leap months and cyclic years. The keys will be short-lived. We'll need to change the code again next time.
2. Hard-code the data for Chinese and Korean as a best-effort, and say that this functionality is preview.
3. Say that these calendars are unstable and we hide them.
@Manishearth - The main reason not to do 1 is that we have a better plan. Is there really value for doing the haphazard thing in 1.3? I think probably no.
@echeran - So that means that you'll get Chinese characters in the Chinese calendar even if that's not your locale?
@Manishearth - Yeah, which I think is okay. It's understandable, not ugly, and mostly correct for the main users.
@sffc - I think we should prioritize the good solution in 1.4 because we have users who need Chinese calendar. I want to get 1.3 out the door ASAP because it has things including compiled data. If we don't have time for the good solution in 1.4, then we could implement option 1 in 1.4.
@robertbastian - We need to write good docs that basically say that these calendars are experimental.

Conclusion: implement option 2 for 1.3.

LGTM: @Manishearth @sffc @echeran (no strong opinion: @robertbastian, @skius)

Manishearth · 2023-10-19T20:20:04Z

Discussion between @sffc and I on whether we should use aux keys or regular keys for lengths. We didn't dive too deep into the hour cycle part since that is something that can be more easily tweaked later (whereas the lengths are pervasive).

The main benefit of using separate (regular) keys is that they enable more build time slicing: if you know in advance what lengths you'll need, you can slice things appropriately. However, since most ways of interacting with this will be via skeletons or overall lengths, this becomes a bit less easy to do with the layers of indirection. We could potentially design a highly typed API that datagens traits linking skeletons to keys, this feels like overkill. It seems like the main win is only if the user can specify exactly what lengths they want.

Separate keys also have the advantage of being slightly smaller in databake (though not blob), because instead of storing a massive locale lookup array it can store a much smaller lookup array that is deduplicated across keys (especially if we choose to resolve length fallback during datagen).

On the other hand, aux keys are cleaner (we don't end up with hundreds of symbols keys) and easier to deal with. In the long run we can experiment with various horizontal fallback options (see discussion in #3867). There may also be options for optimization in the future by passing around binary search hints.

One major benefit is that users can slice out aux keys if they would like (we can do a very simple fallback algorithm in our code to handle this: if you don't find long, go check out medium, etc)

We decided to go with aux keys for now. We may measure things later and see if there are other benefits.

Manishearth · 2023-10-19T20:46:06Z

Listing out aux keys for each thing:

(a/n/s/w = abbr/narrow/short/wide, f/s = format/standalone)

Months: a/n/w × f/s
- Special key "numeric". We probably want a separate key for this, can be done later.
Weekdays: a/n/s/w × f/s
Quarters: a/n/w × f/s
(cyclic) Days: a/n/w × f/s
- I don't actually see anyone customizing standalone days.
Eras: names / abbr / narrow. @sffc is there any reason we should not just call "names" "wide" instead?
- They're called "wide" in the symbol table

Given that standalone is the more rare one I would recommend having key names be stuff like -x-a and -x-as (i.e. "format" is implicit). Keeps it short, and lets us easily add standalone keys in the future for stuff like days where we don't have any usage right now

sffc · 2023-10-19T21:23:25Z

Thought: we could use a digit corresponding to the number of symbols in https://unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table, like:

-x-3 = abbreviated
-x-4 = wide
-x-5 = narrow

And standalone could be

-x-3s = abbreviated
-x-4s = wide
-x-5s = narrow

or maybe

-x-f3 = format abbreviated
-x-f4 = format wide
-x-f5 = format narrow
-x-s3 = standalone abbreviated
-x-s4 = standalone wide
-x-s5 = standalone narrow

Manishearth · 2023-10-19T21:30:53Z

Makes sense. My instinct is to let format be the "default" because in some cases there is no data for standalone and we can save space by hardcoding that assumption in ICU4X and datagen (but tweaking it in a backcompat way if it changes)

Manishearth · 2023-10-21T02:11:25Z

The current design for DTF integration is that we load one of each type of field needed (one month symbol, etc).

If a pattern needs multiple fields, we can later add in a Map<Field, Box<dyn Any>> situation for storing extra fields.

sffc · 2023-10-22T02:41:01Z

Some initial numbers of postcard with different fallback modes. Number in parentheses is the point in the postcard file at which the sorted locale lookup VarZeroVec ends and the data table begins.

Key	Postcard, Runtime	Postcard, Hybrid	Postcard, Data Only
datetime/gregory/datesymbols@1	186558 (0x869)	190722 (0x15b5)	184405
datetime/symbols/gregory/years@1	30101 (0x2058)	49129 (0x60ac)	21821
datetime/symbols/gregory/months@1	105222 (0x4de1)	141505 (0xc936)	85285
datetime/symbols/weekdays@1	76988 (0x5b47)	129035 (0x10c42)	53621

The sum of the data only size of the three split keys is 160727, which is smaller than the 184405 in the single combined key. However, since the split keys require more locale lookup tables, the overall size is a bit larger. We are investigating ways to reduce the size of the locale lookup tables (e.g. #2699).

Example command line to generate one cell in the table: cargo run --release --bin icu4x-datagen -- --format blob --locales full --keys "datetime/symbols/gregory/months@1" -f runtime-manual

sffc · 2023-10-22T03:03:52Z

Very initial estimates for the impact of ZeroTrie on the postcard locale lookup table size, based on the strings in the compiled_data files (not the same set of locales as in the previous post):

Key	VZV, Runtime	ZT, Runtime
datetime/gregory/datesymbols@1	889	831
datetime/symbols/gregory/years@1	3935	2730
datetime/symbols/gregory/months@1	10223	5104
datetime/symbols/weekdays@1	10595	5096

So the bigger the VZV the bigger the win, with about a 50% win for the larger ones. If we project these ratios back to the full data set above, we stand to save something on the order of 25 kB in the sum of the split keys data size, which would bring the total split key size (runtime fallback mode, including lookup tables) down to just about the same as the combined key size.

sffc · 2023-10-22T03:20:06Z

I missed something in #3865 (comment). The lookup table is not only a VZV of locale strings; it is also a FZV of a mapping from the VZV index to the data blob index. With ZeroTrie we do not need that extra index-to-index table. If you include the extra table, the total lookup table size is about 15-20% higher than estimated. This means we should be able to cut an additional 5 kB by moving to ZeroTrie.

sffc · 2023-10-22T05:59:26Z

I implemented a ZeroTrie version of BlobSchema in #4207. Results for Gregorian, runtime fallback, and all locales:

Data Key	Postcard Size
datesymbols	185248
months	90017
weekdays	58578
years	24893

The new keys are 173488 bytes total, now including locale lookup metadata, smaller than the combined key. 😃

Manishearth · 2023-10-24T04:28:48Z

Yeah that's a good point

Manishearth · 2023-10-31T22:45:45Z

Split out numeric symbols stuff in #4242

Part of #3865, #3347

Part of #3347, #3865

#3865 Depends on #4567

Manishearth · 2024-09-13T01:44:17Z

This is done in neo.

Manishearth added the C-datetime Component: datetime, calendars, time zones label Aug 15, 2023

Manishearth assigned sffc and Manishearth Aug 15, 2023

This was referenced Aug 24, 2023

Add data for cyclic year names #3761

Closed

Support leap months in datetime #3766

Closed

sffc added this to the 1.4 Blocking ⟨P1⟩ milestone Oct 5, 2023

sffc added the S-large Size: A few weeks (larger feature, major refactoring) label Oct 5, 2023

Manishearth mentioned this issue Oct 17, 2023

Add experimental new symbols structs #4174

Merged

This was referenced Oct 19, 2023

Bikeshed: What should neo datetime placeholder markers be called #4186

Closed

Perform datagen for the new symbols keys #4192

Merged

split neo datetime patterns keys #4202

Merged

Manishearth mentioned this issue Oct 22, 2023

Datagen for neo patterns #4205

Merged

Manishearth mentioned this issue Oct 23, 2023

Add datagen (neo symbols only) for cyclic year names #4210

Merged

This was referenced Oct 24, 2023

Add initial TypedDateTimePatternInterpolator with neo symbols #4204

Merged

Add BlobSchema V2 with ZeroTrie #4207

Merged

This was referenced Oct 24, 2023

Move neo months over to a linear-only model #4217

Merged

Datagen for (neo) leap month patterns #4212

Closed

Datagen for leap month patterns #4222

Merged

sffc mentioned this issue Oct 25, 2023

Add impl From<DecimalError> for DateTimeError and use it #4224

Merged

This was referenced Oct 26, 2023

How should datagen deal with CLDR invariants? #4226

Closed

Include numeric overrides in neo datetime symbols #4228

Merged

This was referenced Oct 30, 2023

Derive Eq and Hash in datetime::options #4230

Merged

Reword "computationally heavy" language in docs #4237

Closed

Manishearth mentioned this issue Oct 31, 2023

Support numeric overrides in DateTimeFormat #4242

Open

sffc modified the milestones: 1.4 Blocking ⟨P1⟩, 1.5 Blocking ⟨P1⟩ Nov 14, 2023

This was referenced Nov 20, 2023

Handle multiple lengths of the same field in DateTimePatternInterpolator #4337

Open

Decide what to do with stability of CldrCalendar trait in 1.5 #4341

Closed

sffc added a commit that referenced this issue Nov 21, 2023

Add initial TypedDateTimePatternInterpolator with neo symbols (#4204)

1c0b95e

Part of #3865, #3347

sffc mentioned this issue Dec 6, 2023

Neo datetime: dynamic field loading based on pattern #4410

Merged

sffc added a commit that referenced this issue Dec 6, 2023

Neo datetime: dynamic field loading based on pattern (#4410)

9d74b08

Part of #3347, #3865

This was referenced Jan 31, 2024

Add NeoDateFormatter for AnyCalendar #4567

Merged

Finish matrix of NeoDateTime AnyCalendar support #4568

Merged

sffc added a commit that referenced this issue Feb 13, 2024

Finish matrix of NeoDateTime AnyCalendar support (#4568)

2eeda6b

#3865 Depends on #4567

Manishearth added this to icu4x 2.0 Feb 23, 2024

Manishearth moved this to Being worked on in icu4x 2.0 Feb 23, 2024

sffc modified the milestones: 1.5 Blocking ⟨P1⟩, 1.x Priority ⟨P2⟩ May 23, 2024

Manishearth closed this as completed Sep 13, 2024

github-project-automation bot moved this from Being worked on to Done in icu4x 2.0 Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split DateSymbols data #3865

Split DateSymbols data #3865

Manishearth commented Aug 15, 2023

Manishearth commented Aug 15, 2023 •

edited

Loading

Manishearth commented Aug 15, 2023

Manishearth commented Aug 15, 2023

Manishearth commented Aug 24, 2023

Manishearth commented Oct 19, 2023

Manishearth commented Oct 19, 2023 •

edited

Loading

sffc commented Oct 19, 2023 •

edited

Loading

Manishearth commented Oct 19, 2023

Manishearth commented Oct 21, 2023

sffc commented Oct 22, 2023 •

edited

Loading

sffc commented Oct 22, 2023

sffc commented Oct 22, 2023

sffc commented Oct 22, 2023 •

edited

Loading

Manishearth commented Oct 24, 2023

Manishearth commented Oct 31, 2023

Manishearth commented Sep 13, 2024

Split DateSymbols data #3865

Split DateSymbols data #3865

Comments

Manishearth commented Aug 15, 2023

Manishearth commented Aug 15, 2023 • edited Loading

Footnotes

Manishearth commented Aug 15, 2023

Manishearth commented Aug 15, 2023

Manishearth commented Aug 24, 2023

Manishearth commented Oct 19, 2023

Manishearth commented Oct 19, 2023 • edited Loading

sffc commented Oct 19, 2023 • edited Loading

Manishearth commented Oct 19, 2023

Manishearth commented Oct 21, 2023

sffc commented Oct 22, 2023 • edited Loading

sffc commented Oct 22, 2023

sffc commented Oct 22, 2023

sffc commented Oct 22, 2023 • edited Loading

Manishearth commented Oct 24, 2023

Manishearth commented Oct 31, 2023

Manishearth commented Sep 13, 2024

Manishearth commented Aug 15, 2023 •

edited

Loading

Manishearth commented Oct 19, 2023 •

edited

Loading

sffc commented Oct 19, 2023 •

edited

Loading

sffc commented Oct 22, 2023 •

edited

Loading

sffc commented Oct 22, 2023 •

edited

Loading