Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize data size for likely subtags #1488

Closed
sffc opened this issue Jan 9, 2022 · 4 comments
Closed

Optimize data size for likely subtags #1488

sffc opened this issue Jan 9, 2022 · 4 comments
Assignees
Labels
A-data Area: Data coverage or quality C-locale Component: Locale identifiers, BCP47 help wanted Issue needs an assignee R-obsolete Resolution: This issue is no longer relevant S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality

Comments

@sffc
Copy link
Member

sffc commented Jan 9, 2022

As discussed in #1462, I would like to consider making likely subtags a central feature of locale resolution when performing vertical fallback in data loading.

However, the current 40 KB size for the likely subtags data is a rather heavy dependency to take on. We should investigate:

  1. Enabling the data to be sliced by language/region
  2. Reducing the absolute size of the data by, e.g., removing duplicated subtags between input and output, and by removing "und" in places where it does not matter
  3. Consider a mode that does not add in the script subtag if the script is the default for the language (e.g., considering "zh-CN" to be maximized, rather than "zh-Hans-CN"), which may enable us to reduce the size of some mappings

CC @zbraniecki

@sffc sffc added T-core Type: Required functionality C-locale Component: Locale identifiers, BCP47 S-medium Size: Less than a week (larger bug fix or enhancement) A-data Area: Data coverage or quality labels Jan 9, 2022
@mihnita
Copy link
Contributor

mihnita commented Jan 11, 2022

if the script is the default for the language

Yes, but then you need that info somewhere :-)
And might not save that much.

Here are all entries that map from / to zh:

<likelySubtag from="zh" to="zh_Hans_CN"/>
<likelySubtag from="und_030" to="zh_Hans_CN"/>
<likelySubtag from="und_142" to="zh_Hans_CN"/>
<likelySubtag from="und_CN" to="zh_Hans_CN"/>
<likelySubtag from="und_Hans" to="zh_Hans_CN"/>

<likelySubtag from="zh_Bopo" to="zh_Bopo_TW"/>
<likelySubtag from="zh_Hanb" to="zh_Hanb_TW"/>
<likelySubtag from="und_Bopo" to="zh_Bopo_TW"/>
<likelySubtag from="und_Hanb" to="zh_Hanb_TW"/>
<likelySubtag from="und_Hani" to="zh_Hani_CN"/>

<likelySubtag from="zh_AU" to="zh_Hant_AU"/>
<likelySubtag from="zh_BN" to="zh_Hant_BN"/>
<likelySubtag from="zh_GB" to="zh_Hant_GB"/>
<likelySubtag from="zh_GF" to="zh_Hant_GF"/>
<likelySubtag from="zh_HK" to="zh_Hant_HK"/>
<likelySubtag from="zh_ID" to="zh_Hant_ID"/>
<likelySubtag from="zh_MO" to="zh_Hant_MO"/>
<likelySubtag from="zh_PA" to="zh_Hant_PA"/>
<likelySubtag from="zh_PF" to="zh_Hant_PF"/>
<likelySubtag from="zh_PH" to="zh_Hant_PH"/>
<likelySubtag from="zh_SR" to="zh_Hant_SR"/>
<likelySubtag from="zh_TH" to="zh_Hant_TH"/>
<likelySubtag from="zh_TW" to="zh_Hant_TW"/>
<likelySubtag from="zh_US" to="zh_Hant_US"/>
<likelySubtag from="zh_VN" to="zh_Hant_VN"/>
<likelySubtag from="zh_Hant" to="zh_Hant_TW"/>
<likelySubtag from="und_HK" to="zh_Hant_HK"/>
<likelySubtag from="und_MO" to="zh_Hant_MO"/>
<likelySubtag from="und_TW" to="zh_Hant_TW"/>
<likelySubtag from="und_Hant" to="zh_Hant_TW"/>

So we would save more space if we drop Hant, not Hans.

@robertbastian
Copy link
Member

I'll do (2) as part of #1034

@sffc
Copy link
Member Author

sffc commented Apr 28, 2022

The low-hanging fruit is done. The rest of this issue will require some deeper design.

@sffc sffc added this to the Backlog milestone Dec 22, 2022
@sffc sffc removed the backlog label Dec 22, 2022
@sffc sffc self-assigned this Oct 22, 2023
@sffc
Copy link
Member Author

sffc commented Oct 22, 2023

I don't think there is much left to do here. Our likely subtags data is now very small, especially with the core and extended slicing.

@sffc sffc closed this as completed Oct 22, 2023
@sffc sffc added the R-obsolete Resolution: This issue is no longer relevant label Oct 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-data Area: Data coverage or quality C-locale Component: Locale identifiers, BCP47 help wanted Issue needs an assignee R-obsolete Resolution: This issue is no longer relevant S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality
Projects
None yet
Development

No branches or pull requests

3 participants