UnicodeSet implementation & API #91

nciric · 2020-05-14T22:35:18Z

@markusicu and I had a quick chat about UnicodeSets and how should they be done today. Quick notes and some questions are below.

UnicodeSet use cases:

Initial use was support of transliterator logic in ICU.
Rule based break iterators
Remove white space from end of string - different white space definitions
Deep in HTML lexer - is it in word letter set vs interpunction character
Every time we have [a-z\p{Cyrillic}...] in regex, we build an UnicodeSet

UnicodeSet vs ICU tries:

set: maps code points or strings to boolean (in set or not in set)
code point trie: maps code points (only) to integers
string trie: maps sequences of 8-bit or 16-bit string units to integers

UnicodeSet Documentation:

UnicodeSet
Properties
ICU 67.1: icu::UnicodeSet Class Reference
UnicodeSet ICU4J 67.1

Character and string support:

Basic implementation can support only characters
But short string support is strongly encouraged, esp Slovak zh, emoji etc.

Optimizations:

Ranges are represented as start-end values, e.g. ASCII 0-127
- binary search is required to find if character belongs to a set
After set is frozen (ICU4C/J) an extra data structure is created that in most cases avoids binary search requirement - rough size 850 bytes
- ASCII range uses bytes (direct mapped)
- Rest of BMP uses 1-2bits per 64 char sets to define in, out, maybe (binary search required)
- Above BMP we always use ranges and binary search

API design:

L1: User visible API should expose membership test, contains, span etc.
L2: Builder API is used to create sets - builders, intersects, unions, etc.
L3: Pattern API is used to create sets from patterns, like \p{Cyrillic}. This part requires Unicode character properties to match pattern with actual characters and ranges.

Questions for @zbraniecki, @Manishearth and etc:

Where do we put core data structures? Components? Utils? Common?
Internals of UnicodeSet should be invisible to the user. Should struct be private? (I'll look at Locale implementation to see how it can be done)
Should L1, L2 and L3 be in different crates, or in one? What would be the best way to organize the code/API - to maximally reduce deps.
- icu-unicodeset-query, icu-unicodeset-builder, icu-unicodeset-pattern
- pattern (L3) obviously depends on L2

Another approach is to use ICU4C to build ranges using existing machinery, and then serialize those into Rust UnicodeSet. I don't like this approach much since it complicates the tool chain, and requires checkout/build of ICU4C/J.

nciric · 2020-05-14T23:51:32Z

Notes from @kpozin :
I've written something similar for Fuchsia test cases, though it's not very optimized:
char_collection: docs, source

nciric · 2020-05-14T23:56:19Z

@kpozin maybe move your crate into ICU4X as UnicodeSet with some changes (add/remove APIs, immutability)? We can extend it over time with unions, intersects and do more optimization...

kpozin · 2020-05-15T00:08:16Z

I can do that. Here's the external source link, by the way: https://fuchsia.googlesource.com/fuchsia/+/refs/heads/master/src/lib/intl/unicode_utils/char_collection/

Manishearth · 2020-05-15T04:25:52Z

Internals of UnicodeSet should be invisible to the user. Should struct be private? (I'll look at Locale implementation to see how it can be done)

You can have private fields, yes. We should mostly be using private fields.

Should L1, L2 and L3 be in different crates, or in one? What would be the best way to organize the code/API - to maximally reduce deps.

Either works, if they're in one crate we might want to add features?

One solution is to have L3 be a feature because it pulls in additional dependencies.

zbraniecki · 2020-06-04T18:21:09Z

wrt. builder pattern, I filed - #112

sffc · 2020-06-08T23:15:41Z

I've shared this in meetings, but I'll post it here for posterity:

I think having a clear delineation between L1, L2, and L3 is a key part of this design.

L1 UnicodeSet

Based on my experience, I think it's safe to say that L1 is by far the most common use case for runtime code. Clients can build sets offline, and load/evaluate them at runtime, without needing to ship the builder code with their app.

The most common operation is to check if a character is in a certain Unicode binary property, and we can ship pre-built data for that with ICU4X.

Therefore, I think the most important problem to solve with ICU4X UnicodeSet is a serialized data format for the UnicodeSet and small code to load and evaluate it.

L2 UnicodeSet

I think it is useful for ICU4X to have a builder, but it's not as critical as L1.

The builder should output the serialized format that can be saved to a data file, so that clients can save it and ship it with their app. Eventually (long-term goal), we could wire up a proc macro to build the structure offline during cargo build.

In the mean time, we should make sure ICU4C can emit the same serialized format, such that we can fully leverage the existing machinery for building ICU4X data files.

L3 UnicodeSet

As others have stated, this is a longer term goal. I think ICU4X will likely want to have this, but it doesn't need to ship in version 1.

Side note: I think L3 UnicodeSet need not depend on L2 UnicodeSet. If we parse the pattern string, we could emit a structure of L1 UnicodeSets from data matching the pattern string, which might be smaller code size than shipping L2 UnicodeSet at runtime.

sffc · 2020-06-08T23:17:09Z

Assigning to @echeran since @EvanJP isn't in unicode-org yet.

The L2 Builder API as discussed in #91. Provides: - UnicodeSetBuilder::new() - UnicodeSetBuilder::build() -> UnicodeSet - add_char, add_range, add_set - remove_char, remove_range, remove_set - retain_char, retain_range, retain_set - complement, complement_char, complement_range, complement_set - Cargo Docs

sffc · 2020-10-22T17:57:14Z

L1 and L2 are implemented. L3 is being tracked in #168 and #217. I will therefore close this meta-issue.

sffc · 2020-10-24T08:04:25Z

L1 and L2 are done in #122 and #195, but I'm having trouble retroactively linking the PRs. Awaiting a response from GitHub support.

nciric added question Unresolved questions; type unclear T-core Type: Required functionality discuss Discuss at a future ICU4X-SC meeting labels May 14, 2020

nciric self-assigned this May 14, 2020

sffc added the C-unicode Component: Props, sets, tries label May 15, 2020

sffc removed the discuss Discuss at a future ICU4X-SC meeting label Jun 4, 2020

sffc assigned echeran and unassigned nciric Jun 8, 2020

sffc added this to the 2020 Q3 milestone Jun 17, 2020

sffc mentioned this issue Jun 17, 2020

What are the key low-level data structures we need to support ECMA-402? #17

Closed

sffc mentioned this issue Jun 24, 2020

API for accessing Unicode properties #148

Closed

6 tasks

sffc assigned EvanJP and unassigned echeran Jul 23, 2020

EvanJP linked a pull request Jul 23, 2020 that will close this issue

UnicodeSet L1 Work #122

Merged

EvanJP removed a link to a pull request Jul 23, 2020

UnicodeSet L1 Work #122

Merged

EvanJP mentioned this issue Aug 4, 2020

UnicodeSet Builder L2 #195

Merged

sffc closed this as completed Oct 22, 2020

sffc removed the question Unresolved questions; type unclear label Oct 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeSet implementation & API #91

UnicodeSet implementation & API #91

nciric commented May 14, 2020

nciric commented May 14, 2020

nciric commented May 14, 2020

kpozin commented May 15, 2020

Manishearth commented May 15, 2020

zbraniecki commented Jun 4, 2020

sffc commented Jun 8, 2020

sffc commented Jun 8, 2020

sffc commented Oct 22, 2020

sffc commented Oct 24, 2020

UnicodeSet implementation & API #91

UnicodeSet implementation & API #91

Comments

nciric commented May 14, 2020

nciric commented May 14, 2020

nciric commented May 14, 2020

kpozin commented May 15, 2020

Manishearth commented May 15, 2020

zbraniecki commented Jun 4, 2020

sffc commented Jun 8, 2020

L1 UnicodeSet

L2 UnicodeSet

L3 UnicodeSet

sffc commented Jun 8, 2020

sffc commented Oct 22, 2020

sffc commented Oct 24, 2020