-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split DateSymbols data #3865
Comments
@sffc and I discussed this a bunch, in the context of fixing #3766 and #3761, which involves adding more data to datetime anyway, which we don't want to V2 for without doing it right. The rough proposal we had was that we have the following main symbols keys:
The symbols keys use an auxiliary key (#3632) to store the eight-way length distinction (abbreviated, narrow, short, wide) × (format, standalone). The current fallbacking between them will be performed either at datagen or via carefully done auxiliary key fallback (essentially, ensure that Numeric becomes an optional auxiliary key (like another type of length) for calendars that have special formatting for numeric months (with leap year patterns), days, etc. We attempt to load it during construction but do not error if it is not found. We do not store leap year patterns for calendars that generate leap year names in a pattern based way (Chinese, Dangi). For the rare pattern that needs multiple lengths to format something, we can store additional loaded data in an Option on the DateTimeFormat. Finally, cc @eggrobin who has thought about this a bit in the context of skeleta. Footnotes
|
Discussed a bit
Conclusion: implement option 2 for 1.3. LGTM: @Manishearth @sffc @echeran (no strong opinion: @robertbastian, @skius) |
Discussion between @sffc and I on whether we should use aux keys or regular keys for lengths. We didn't dive too deep into the hour cycle part since that is something that can be more easily tweaked later (whereas the lengths are pervasive). The main benefit of using separate (regular) keys is that they enable more build time slicing: if you know in advance what lengths you'll need, you can slice things appropriately. However, since most ways of interacting with this will be via skeletons or overall lengths, this becomes a bit less easy to do with the layers of indirection. We could potentially design a highly typed API that datagens traits linking skeletons to keys, this feels like overkill. It seems like the main win is only if the user can specify exactly what lengths they want. Separate keys also have the advantage of being slightly smaller in databake (though not blob), because instead of storing a massive locale lookup array it can store a much smaller lookup array that is deduplicated across keys (especially if we choose to resolve length fallback during datagen). On the other hand, aux keys are cleaner (we don't end up with hundreds of symbols keys) and easier to deal with. In the long run we can experiment with various horizontal fallback options (see discussion in #3867). There may also be options for optimization in the future by passing around binary search hints. One major benefit is that users can slice out aux keys if they would like (we can do a very simple fallback algorithm in our code to handle this: if you don't find long, go check out medium, etc) We decided to go with aux keys for now. We may measure things later and see if there are other benefits. |
Listing out aux keys for each thing: (a/n/s/w = abbr/narrow/short/wide, f/s = format/standalone)
Given that standalone is the more rare one I would recommend having key names be stuff like |
Thought: we could use a digit corresponding to the number of symbols in https://unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table, like:
And standalone could be
or maybe
|
Makes sense. My instinct is to let format be the "default" because in some cases there is no data for standalone and we can save space by hardcoding that assumption in ICU4X and datagen (but tweaking it in a backcompat way if it changes) |
The current design for DTF integration is that we load one of each type of field needed (one month symbol, etc). If a pattern needs multiple fields, we can later add in a |
Some initial numbers of postcard with different fallback modes. Number in parentheses is the point in the postcard file at which the sorted locale lookup VarZeroVec ends and the data table begins.
The sum of the data only size of the three split keys is 160727, which is smaller than the 184405 in the single combined key. However, since the split keys require more locale lookup tables, the overall size is a bit larger. We are investigating ways to reduce the size of the locale lookup tables (e.g. #2699). Example command line to generate one cell in the table: |
Very initial estimates for the impact of ZeroTrie on the postcard locale lookup table size, based on the strings in the compiled_data files (not the same set of locales as in the previous post):
So the bigger the VZV the bigger the win, with about a 50% win for the larger ones. If we project these ratios back to the full data set above, we stand to save something on the order of 25 kB in the sum of the split keys data size, which would bring the total split key size (runtime fallback mode, including lookup tables) down to just about the same as the combined key size. |
I missed something in #3865 (comment). The lookup table is not only a VZV of locale strings; it is also a FZV of a mapping from the VZV index to the data blob index. With ZeroTrie we do not need that extra index-to-index table. If you include the extra table, the total lookup table size is about 15-20% higher than estimated. This means we should be able to cut an additional 5 kB by moving to ZeroTrie. |
I implemented a ZeroTrie version of BlobSchema in #4207. Results for Gregorian, runtime fallback, and all locales:
The new keys are 173488 bytes total, now including locale lookup metadata, smaller than the combined key. 😃 |
Yeah that's a good point |
Split out numeric symbols stuff in #4242 |
This is done in neo. |
DateSymbols is giant and has a lot of things inside it, only a fraction of which actually gets used once a formatter has been constructed.
We should split this type along day/month/year lines ,as well as along pattern length lines. (And provide a compatibility path for pre-2.0 V1 data, as usual)
The text was updated successfully, but these errors were encountered: