Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeSet implementation & API #91

Closed
nciric opened this issue May 14, 2020 · 9 comments
Closed

UnicodeSet implementation & API #91

nciric opened this issue May 14, 2020 · 9 comments
Assignees
Labels
C-unicode Component: Props, sets, tries T-core Type: Required functionality
Milestone

Comments

@nciric
Copy link
Contributor

nciric commented May 14, 2020

@markusicu and I had a quick chat about UnicodeSets and how should they be done today. Quick notes and some questions are below.

UnicodeSet use cases:

  • Initial use was support of transliterator logic in ICU.
  • Rule based break iterators
  • Remove white space from end of string - different white space definitions
  • Deep in HTML lexer - is it in word letter set vs interpunction character
  • Every time we have [a-z\p{Cyrillic}...] in regex, we build an UnicodeSet

UnicodeSet vs ICU tries:

  • set: maps code points or strings to boolean (in set or not in set)
  • code point trie: maps code points (only) to integers
  • string trie: maps sequences of 8-bit or 16-bit string units to integers

UnicodeSet Documentation:

Character and string support:

  • Basic implementation can support only characters
  • But short string support is strongly encouraged, esp Slovak zh, emoji etc.

Optimizations:

  • Ranges are represented as start-end values, e.g. ASCII 0-127
    • binary search is required to find if character belongs to a set
  • After set is frozen (ICU4C/J) an extra data structure is created that in most cases avoids binary search requirement - rough size 850 bytes
    • ASCII range uses bytes (direct mapped)
    • Rest of BMP uses 1-2bits per 64 char sets to define in, out, maybe (binary search required)
    • Above BMP we always use ranges and binary search

API design:

  • L1: User visible API should expose membership test, contains, span etc.
  • L2: Builder API is used to create sets - builders, intersects, unions, etc.
  • L3: Pattern API is used to create sets from patterns, like \p{Cyrillic}. This part requires Unicode character properties to match pattern with actual characters and ranges.

Questions for @zbraniecki, @Manishearth and etc:

  • Where do we put core data structures? Components? Utils? Common?
  • Internals of UnicodeSet should be invisible to the user. Should struct be private? (I'll look at Locale implementation to see how it can be done)
  • Should L1, L2 and L3 be in different crates, or in one? What would be the best way to organize the code/API - to maximally reduce deps.
    • icu-unicodeset-query, icu-unicodeset-builder, icu-unicodeset-pattern
    • pattern (L3) obviously depends on L2

Another approach is to use ICU4C to build ranges using existing machinery, and then serialize those into Rust UnicodeSet. I don't like this approach much since it complicates the tool chain, and requires checkout/build of ICU4C/J.

@nciric nciric added question Unresolved questions; type unclear T-core Type: Required functionality discuss Discuss at a future ICU4X-SC meeting labels May 14, 2020
@nciric nciric self-assigned this May 14, 2020
@nciric
Copy link
Contributor Author

nciric commented May 14, 2020

Notes from @kpozin :
I've written something similar for Fuchsia test cases, though it's not very optimized:
char_collection: docs, source

@nciric
Copy link
Contributor Author

nciric commented May 14, 2020

@kpozin maybe move your crate into ICU4X as UnicodeSet with some changes (add/remove APIs, immutability)? We can extend it over time with unions, intersects and do more optimization...

@kpozin
Copy link
Contributor

kpozin commented May 15, 2020

I can do that. Here's the external source link, by the way: https://fuchsia.googlesource.com/fuchsia/+/refs/heads/master/src/lib/intl/unicode_utils/char_collection/

@Manishearth
Copy link
Member

Internals of UnicodeSet should be invisible to the user. Should struct be private? (I'll look at Locale implementation to see how it can be done)

You can have private fields, yes. We should mostly be using private fields.

Should L1, L2 and L3 be in different crates, or in one? What would be the best way to organize the code/API - to maximally reduce deps.

Either works, if they're in one crate we might want to add features?

One solution is to have L3 be a feature because it pulls in additional dependencies.

@sffc sffc added the C-unicode Component: Props, sets, tries label May 15, 2020
@sffc sffc removed the discuss Discuss at a future ICU4X-SC meeting label Jun 4, 2020
@zbraniecki
Copy link
Member

wrt. builder pattern, I filed - #112

@sffc
Copy link
Member

sffc commented Jun 8, 2020

I've shared this in meetings, but I'll post it here for posterity:

I think having a clear delineation between L1, L2, and L3 is a key part of this design.

L1 UnicodeSet

Based on my experience, I think it's safe to say that L1 is by far the most common use case for runtime code. Clients can build sets offline, and load/evaluate them at runtime, without needing to ship the builder code with their app.

The most common operation is to check if a character is in a certain Unicode binary property, and we can ship pre-built data for that with ICU4X.

Therefore, I think the most important problem to solve with ICU4X UnicodeSet is a serialized data format for the UnicodeSet and small code to load and evaluate it.

L2 UnicodeSet

I think it is useful for ICU4X to have a builder, but it's not as critical as L1.

The builder should output the serialized format that can be saved to a data file, so that clients can save it and ship it with their app. Eventually (long-term goal), we could wire up a proc macro to build the structure offline during cargo build.

In the mean time, we should make sure ICU4C can emit the same serialized format, such that we can fully leverage the existing machinery for building ICU4X data files.

L3 UnicodeSet

As others have stated, this is a longer term goal. I think ICU4X will likely want to have this, but it doesn't need to ship in version 1.

Side note: I think L3 UnicodeSet need not depend on L2 UnicodeSet. If we parse the pattern string, we could emit a structure of L1 UnicodeSets from data matching the pattern string, which might be smaller code size than shipping L2 UnicodeSet at runtime.

@sffc sffc assigned echeran and unassigned nciric Jun 8, 2020
@sffc
Copy link
Member

sffc commented Jun 8, 2020

Assigning to @echeran since @EvanJP isn't in unicode-org yet.

@sffc sffc added this to the 2020 Q3 milestone Jun 17, 2020
@sffc sffc assigned EvanJP and unassigned echeran Jul 23, 2020
@EvanJP EvanJP linked a pull request Jul 23, 2020 that will close this issue
@EvanJP EvanJP removed a link to a pull request Jul 23, 2020
EvanJP added a commit that referenced this issue Aug 25, 2020
The L2 Builder API as discussed in #91.

Provides:

- UnicodeSetBuilder::new()
- UnicodeSetBuilder::build() -> UnicodeSet
- add_char, add_range, add_set
- remove_char, remove_range, remove_set
- retain_char, retain_range, retain_set
- complement, complement_char, complement_range, complement_set
- Cargo Docs
@sffc
Copy link
Member

sffc commented Oct 22, 2020

L1 and L2 are implemented. L3 is being tracked in #168 and #217. I will therefore close this meta-issue.

@sffc sffc closed this as completed Oct 22, 2020
@sffc sffc removed the question Unresolved questions; type unclear label Oct 22, 2020
@sffc
Copy link
Member

sffc commented Oct 24, 2020

L1 and L2 are done in #122 and #195, but I'm having trouble retroactively linking the PRs. Awaiting a response from GitHub support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-unicode Component: Props, sets, tries T-core Type: Required functionality
Projects
None yet
Development

No branches or pull requests

7 participants