Migrate LocaleCanonicalizer data structs to zero-copy #1034

sffc · 2021-09-01T17:01:44Z

Main issue: #856

The LocaleCanonicalizer data structs currently rely on Vec. They should be migrated to ZeroVec, VarZeroVec, or ZeroMap.

https://github.com/unicode-org/icu4x/blob/main/components/locale_canonicalizer/src/provider.rs

I think in most cases ZeroMap serves the use case fairly well. (This would make this issue depend on #844)

For example, Vec<(TinyStr4, LanguageIdentifier)> could become ZeroMap<TinyStr4, LanguageIdentifier> (where LanguageIdentifier needs to support AsVarULE).

Is there a reason you aren't already using LiteMap in your data structure?

CC @dminor @zbraniecki @Manishearth

The text was updated successfully, but these errors were encountered:

dminor · 2021-09-01T19:04:28Z

I think LiteMap wasn't yet written at the time the likely subtags support was implemented, and I kept Vec for consistency while doing the alias part. I don't see any reason not to use ZeroMap once we're able to.

sffc · 2021-09-01T19:19:11Z

I have some questions about how you use LanguageIdentifier in LocaleCanonicalizer data.

LanguageIdentifier is often used as the "value" in the key/value maps in both the AliasesV1 and the LikelySubtagsV1 data structs. When I look in the data file, I see things like:

    [
      "arc",
      "Palm",
      "und-SY"
    ],
...
    [
      "zza",
      "und-Latn-TR"
    ]
...

    [
      "und-hepburn-heploc",
      "und-alalc97"
    ],

My questions are:

Would it be consistent with the required data model to represent some of these sets as, e.g., a (TinyStr4, TinyStr4) (for script+region) instead of a LanguageIdentifier?
For cases where that is not possible, would storing a string instead of a LanguageIdentifier be consistent with the required data model?

The reason I'm asking is that we would need to design and implement AsVarULE for LanguageIdentifer, which may be non-trivial and may come at a potential performance cost. If we could represent the data as Copy + 'static types instead, we have more options for optimization when we implement ZeroMap.

dminor · 2021-09-02T12:42:22Z

This may have improved now, but the last time I tried breaking LanguageIdentifiers up into tuples of TinyStr4, it caused a substantial increase in the disk storage size for the data, both json and bincode.

In general, the algorithms require being able to examine the parts of the LanguageIdentifier. I think a tuple representation would be fine for this, but I'm concerned that a string representation would be quite a bit slower.

This is performance critical code, I don't think we should accept a change that regresses performance. Zibi invested a lot of time in making the likely subtags code fast and the canonicalization code is already substantially slower than the equivalent code in SpiderMonkey, so we don't want to make that even worse. There are benchmarks, it might be possible to quickly test things out and see if it going to have a big impact on performance or not.

sffc · 2021-09-05T20:24:39Z

Thanks for the info.

To be clear, the status quo is that all the LanguageIdentifier strings in the data bundle need to be parsed at data loading time. What we need to figure out is how to store the LanguageIdentifiers in ZeroVec. There are several approaches I can think of:

Use tuples of TinyStr instead of LanguageIdentifier
- Pro: Easy
- Con: Impact on data size?
Implement AsVarULE for LanguageIdentifier (keeps the in-file representation as a string, but performs lazy parsing when data is accessed)
- Pro: Minimal impact on data size
- Con: Likely to cause measurable performance regressions in the canonicalize function
Design a new Locale-like data structure optimized for storage and retrieval in data files (Design architecture around low-cost locale parsing and storage #958)
- Pro: We can design exactly what we need
- Con: Solution is not ready yet and is blocked on Design architecture around low-cost locale parsing and storage #958

Either way, whoever takes this bug will need to do some experimentation and come back with a recommended solution that enables zero-copy data while keeping good performance.

dminor · 2021-09-07T11:17:27Z

I was afraid you were proposing a design where we'd have to parse the LanguageIdentifier from a string every time we do a canonicalization or likely subtags operation. Parsing once at data loading time is not going to be a problem.

Manishearth · 2021-09-07T15:39:02Z

Note that solution 3 involves a little bit of parsing each time, but it should not be expensive

zbraniecki · 2021-09-07T15:45:48Z

I'm intuitively in favor of [1].

sffc · 2021-10-19T17:23:53Z

@pandusonu2 will be working on this.

pandusonu2 · 2021-10-20T17:00:09Z

Steps discussed:

Vec<(K, V)> to LiteMap
LanguageIdentifier to custom tuple struct
LiteMap to ZeroMap

pandusonu2 · 2021-12-02T07:26:05Z

From what I understand, the next step is to convert

pub struct LanguageIdentifier {
    pub language: subtags::Language,
    pub script: Option<subtags::Script>,
    pub region: Option<subtags::Region>,
    pub variants: subtags::Variants,
}

to just use everywhere:

(subtags:Languauge, Option<Script>, Option<Region> subtags::Variants)

Manishearth · 2021-12-14T16:44:48Z

@sffc okay, that makes sense.

I think (once this is discussed and approved), the steps for @pandusonu2 would be to first restructure LanguageIdentifier so that variants is a single String, and then write the ULE type (perhaps with help from me on getting the impl right, but it's a good chance to see how good our docs are!)

I'm fine with @pandusonu2 starting work on a prototype as well if you think that's a good move.

sffc · 2021-12-14T17:09:30Z

I don't think we should change LanguageIdentifier as part of this. We should re-parse the variants string to the Vec when converting. It's an edge case when you have more than 1 variant anyway.

Thinking of it, it might be fine to make the ULE be simply the BCP-47 string, and we just do a little simple parsing when accessing a field. We have to do that anyway when using the offset marker. The advantage is that we can have a well-defined as_str().

sffc · 2021-12-14T17:13:06Z

pub struct LanguageIdentifierStr([u8]);

impl LanguageIdentifierStr {
  pub as_str(&self) -> &str {
    // Safe because of invariants on LanguageIdentifierStr
    str::from_utf8_unchecked(self.0)
  }
}

Manishearth · 2021-12-14T18:13:42Z

I think we can just wrap it around str even, and have the wrapper just guarantee that it's been validated to be in the right format for a variants string

sffc · 2021-12-14T19:06:50Z

Whether we use str or [u8], the ULE impl should check the string for syntax validity.

sffc · 2021-12-23T19:15:48Z

Discussion:

@zbraniecki / @dminor: In LocaleCanonicalizer, we don't carry variants for any of the LanguageIdentifiers.
@zbraniecki - The left side of the map is the one that's performance-critical. The right side is less critical.
@zbraniecki - I think we should remove all LanguageIdentifier from all locale canonicalizer and replace them with tuples of TinyStr.
@sffc - I think that's reasonable and in some cases it could reduce data size.

Conclusion:

@pandusonu2 will migrate LikelySubtagsV1 to have only TinyStr tuples
@zbraniecki will continue work on AliasesV1

pandusonu2 · 2021-12-27T07:22:25Z

@sffc, @zbraniecki doubt:

Should und be replaced with tuple of TinyStr as well?
Also, if so, how to implement #[derive(Default)] for tuples here?

zbraniecki · 2021-12-28T06:18:52Z

Should und be replaced with tuple of TinyStr as well?

I'd like to maintain the Option<TinyStr4> for it. Is there a reason not to?

pandusonu2 · 2021-12-28T06:27:36Z

Should und be replaced with tuple of TinyStr as well?

I'd like to maintain the Option<TinyStr4> for it. Is there a reason not to?

I was just not sure how #[derive(Default)] works, and that if TinyStr4 would suffice, will update with Option<TinyStr4> for now, and can discuss on the PR if thats good enough I suppose

sffc · 2022-01-27T19:23:48Z

Blocked on TinyStr migration.

pandusonu2 · 2022-03-03T17:39:29Z

Maybe blocked on #831

sffc added T-core Type: Required functionality C-locale Component: Locale identifiers, BCP47 S-small Size: One afternoon (small bug fix or enhancement) A-data Area: Data coverage or quality labels Sep 1, 2021

sffc mentioned this issue Sep 1, 2021

Migrate remaining data structs to fully borrowed #856

Closed

sffc added S-medium Size: Less than a week (larger bug fix or enhancement) and removed S-small Size: One afternoon (small bug fix or enhancement) labels Sep 5, 2021

sffc added good first issue Good for newcomers help wanted Issue needs an assignee labels Sep 30, 2021

sffc added this to the ICU4X 0.4 milestone Sep 30, 2021

sffc assigned sapriyag Oct 19, 2021

sffc removed the help wanted Issue needs an assignee label Oct 19, 2021

sffc modified the milestones: ICU4X 0.4, 2021 Q4 0.5 Sprint A Oct 21, 2021

sffc modified the milestones: 2021 Q4 0.5 Sprint A, 2021 Q4 0.5 Sprint B Nov 5, 2021

pandusonu2 self-assigned this Nov 11, 2021

sffc modified the milestones: 2021 Q4 0.5 Sprint B, 2021 Q4 0.5 Sprint C Nov 18, 2021

pandusonu2 mentioned this issue Nov 24, 2021

Replace Vec with LiteMap in locale canonicalizer #1275

Merged

sffc modified the milestones: 2021 Q4 0.5 Sprint D, 2021 Q4 0.5 Sprint E Dec 23, 2021

sffc removed the discuss-priority Discuss at the next ICU4X meeting label Dec 23, 2021

pandusonu2 mentioned this issue Dec 29, 2021

Remove LanguageIdentifier from locale canonicalizer #1450

Closed

sffc mentioned this issue Jan 27, 2022

Design architecture around low-cost locale parsing and storage #958

Closed

sffc modified the milestones: 2021 Q4 0.5 Sprint E, ICU4X 0.6 Jan 27, 2022

sffc added the blocked A dependency must be resolved before this is actionable label Jan 27, 2022

robertbastian removed the blocked A dependency must be resolved before this is actionable label Feb 23, 2022

sffc mentioned this issue Mar 10, 2022

Using TinyAsciiStr for locale_canonicalizer and locid #1683

Merged

sffc assigned robertbastian and unassigned sapriyag Mar 31, 2022

sffc modified the milestones: ICU4X 0.6, 2022 Q2 0.6 Sprint F Mar 31, 2022

robertbastian mentioned this issue Mar 31, 2022

Optimize data size for likely subtags #1488

Closed

robertbastian removed the good first issue Good for newcomers label Apr 1, 2022

This was referenced Apr 1, 2022

ZeroCopy likely subtags #1760

Merged

ZeroCopy aliases #1777

Merged

robertbastian closed this as completed in #1777 Apr 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate LocaleCanonicalizer data structs to zero-copy #1034

Migrate LocaleCanonicalizer data structs to zero-copy #1034

sffc commented Sep 1, 2021

dminor commented Sep 1, 2021

sffc commented Sep 1, 2021

dminor commented Sep 2, 2021

sffc commented Sep 5, 2021

dminor commented Sep 7, 2021

Manishearth commented Sep 7, 2021

zbraniecki commented Sep 7, 2021

sffc commented Oct 19, 2021

pandusonu2 commented Oct 20, 2021

pandusonu2 commented Dec 2, 2021 •

edited

Loading

Manishearth commented Dec 14, 2021

sffc commented Dec 14, 2021

sffc commented Dec 14, 2021 •

edited

Loading

Manishearth commented Dec 14, 2021

sffc commented Dec 14, 2021

sffc commented Dec 23, 2021

pandusonu2 commented Dec 27, 2021 •

edited

Loading

zbraniecki commented Dec 28, 2021

pandusonu2 commented Dec 28, 2021

sffc commented Jan 27, 2022

pandusonu2 commented Mar 3, 2022

Migrate LocaleCanonicalizer data structs to zero-copy #1034

Migrate LocaleCanonicalizer data structs to zero-copy #1034

Comments

sffc commented Sep 1, 2021

dminor commented Sep 1, 2021

sffc commented Sep 1, 2021

dminor commented Sep 2, 2021

sffc commented Sep 5, 2021

dminor commented Sep 7, 2021

Manishearth commented Sep 7, 2021

zbraniecki commented Sep 7, 2021

sffc commented Oct 19, 2021

pandusonu2 commented Oct 20, 2021

pandusonu2 commented Dec 2, 2021 • edited Loading

Manishearth commented Dec 14, 2021

sffc commented Dec 14, 2021

sffc commented Dec 14, 2021 • edited Loading

Manishearth commented Dec 14, 2021

sffc commented Dec 14, 2021

sffc commented Dec 23, 2021

pandusonu2 commented Dec 27, 2021 • edited Loading

zbraniecki commented Dec 28, 2021

pandusonu2 commented Dec 28, 2021

sffc commented Jan 27, 2022

pandusonu2 commented Mar 3, 2022

pandusonu2 commented Dec 2, 2021 •

edited

Loading

sffc commented Dec 14, 2021 •

edited

Loading

pandusonu2 commented Dec 27, 2021 •

edited

Loading