-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate LocaleCanonicalizer data structs to zero-copy #1034
Comments
I think |
I have some questions about how you use LanguageIdentifier is often used as the "value" in the key/value maps in both the AliasesV1 and the LikelySubtagsV1 data structs. When I look in the data file, I see things like:
My questions are:
The reason I'm asking is that we would need to design and implement |
This may have improved now, but the last time I tried breaking LanguageIdentifiers up into tuples of TinyStr4, it caused a substantial increase in the disk storage size for the data, both json and bincode. In general, the algorithms require being able to examine the parts of the LanguageIdentifier. I think a tuple representation would be fine for this, but I'm concerned that a string representation would be quite a bit slower. This is performance critical code, I don't think we should accept a change that regresses performance. Zibi invested a lot of time in making the likely subtags code fast and the canonicalization code is already substantially slower than the equivalent code in SpiderMonkey, so we don't want to make that even worse. There are benchmarks, it might be possible to quickly test things out and see if it going to have a big impact on performance or not. |
Thanks for the info. To be clear, the status quo is that all the LanguageIdentifier strings in the data bundle need to be parsed at data loading time. What we need to figure out is how to store the LanguageIdentifiers in ZeroVec. There are several approaches I can think of:
Either way, whoever takes this bug will need to do some experimentation and come back with a recommended solution that enables zero-copy data while keeping good performance. |
I was afraid you were proposing a design where we'd have to parse the LanguageIdentifier from a string every time we do a canonicalization or likely subtags operation. Parsing once at data loading time is not going to be a problem. |
Note that solution 3 involves a little bit of parsing each time, but it should not be expensive |
I'm intuitively in favor of [1]. |
@pandusonu2 will be working on this. |
Steps discussed:
|
From what I understand, the next step is to convert
to just use everywhere:
|
@sffc okay, that makes sense. I think (once this is discussed and approved), the steps for @pandusonu2 would be to first restructure I'm fine with @pandusonu2 starting work on a prototype as well if you think that's a good move. |
I don't think we should change LanguageIdentifier as part of this. We should re-parse the variants string to the Vec when converting. It's an edge case when you have more than 1 variant anyway. Thinking of it, it might be fine to make the ULE be simply the BCP-47 string, and we just do a little simple parsing when accessing a field. We have to do that anyway when using the offset marker. The advantage is that we can have a well-defined as_str(). |
|
I think we can just wrap it around |
Whether we use |
Discussion:
Conclusion:
|
@sffc, @zbraniecki doubt: Should |
I'd like to maintain the |
I was just not sure how #[derive(Default)] works, and that if TinyStr4 would suffice, will update with |
Blocked on TinyStr migration. |
Maybe blocked on #831 |
Main issue: #856
The LocaleCanonicalizer data structs currently rely on
Vec
. They should be migrated toZeroVec
,VarZeroVec
, orZeroMap
.https://github.com/unicode-org/icu4x/blob/main/components/locale_canonicalizer/src/provider.rs
I think in most cases
ZeroMap
serves the use case fairly well. (This would make this issue depend on #844)For example,
Vec<(TinyStr4, LanguageIdentifier)>
could becomeZeroMap<TinyStr4, LanguageIdentifier>
(whereLanguageIdentifier
needs to supportAsVarULE
).Is there a reason you aren't already using
LiteMap
in your data structure?CC @dminor @zbraniecki @Manishearth
The text was updated successfully, but these errors were encountered: