-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeSet implementation & API #91
Comments
@kpozin maybe move your crate into ICU4X as UnicodeSet with some changes (add/remove APIs, immutability)? We can extend it over time with unions, intersects and do more optimization... |
I can do that. Here's the external source link, by the way: https://fuchsia.googlesource.com/fuchsia/+/refs/heads/master/src/lib/intl/unicode_utils/char_collection/ |
You can have private fields, yes. We should mostly be using private fields.
Either works, if they're in one crate we might want to add features? One solution is to have L3 be a feature because it pulls in additional dependencies. |
wrt. |
I've shared this in meetings, but I'll post it here for posterity: I think having a clear delineation between L1, L2, and L3 is a key part of this design. L1 UnicodeSetBased on my experience, I think it's safe to say that L1 is by far the most common use case for runtime code. Clients can build sets offline, and load/evaluate them at runtime, without needing to ship the builder code with their app. The most common operation is to check if a character is in a certain Unicode binary property, and we can ship pre-built data for that with ICU4X. Therefore, I think the most important problem to solve with ICU4X UnicodeSet is a serialized data format for the UnicodeSet and small code to load and evaluate it. L2 UnicodeSetI think it is useful for ICU4X to have a builder, but it's not as critical as L1. The builder should output the serialized format that can be saved to a data file, so that clients can save it and ship it with their app. Eventually (long-term goal), we could wire up a proc macro to build the structure offline during In the mean time, we should make sure ICU4C can emit the same serialized format, such that we can fully leverage the existing machinery for building ICU4X data files. L3 UnicodeSetAs others have stated, this is a longer term goal. I think ICU4X will likely want to have this, but it doesn't need to ship in version 1. Side note: I think L3 UnicodeSet need not depend on L2 UnicodeSet. If we parse the pattern string, we could emit a structure of L1 UnicodeSets from data matching the pattern string, which might be smaller code size than shipping L2 UnicodeSet at runtime. |
The L2 Builder API as discussed in #91. Provides: - UnicodeSetBuilder::new() - UnicodeSetBuilder::build() -> UnicodeSet - add_char, add_range, add_set - remove_char, remove_range, remove_set - retain_char, retain_range, retain_set - complement, complement_char, complement_range, complement_set - Cargo Docs
@markusicu and I had a quick chat about UnicodeSets and how should they be done today. Quick notes and some questions are below.
UnicodeSet use cases:
UnicodeSet vs ICU tries:
UnicodeSet Documentation:
Character and string support:
Optimizations:
API design:
Questions for @zbraniecki, @Manishearth and etc:
Another approach is to use ICU4C to build ranges using existing machinery, and then serialize those into Rust UnicodeSet. I don't like this approach much since it complicates the tool chain, and requires checkout/build of ICU4C/J.
The text was updated successfully, but these errors were encountered: