UnicodeSet Properties from PPUCD (L3a) - Binary Properties #217

echeran · 2020-08-28T01:59:19Z

This issue represents the first part ("L3a") of the functionality we want from instantiating UnicodeSets that represent Unicode Properties (#168 - "L3").

Starting from the L3a design doc that that @EvanJP created, this issue represents supporting only a more well-defined subset of user input than in full L3 functionality. For a specified short list of binary and enum properties, whose data is collected and regularized in the Pre-Parsed Unicode Character Database (PPUCD), we want to create a separate static constructor-like function that can create the UnicodeSet for that property.

The text was updated successfully, but these errors were encountered:

sffc · 2020-09-03T17:54:26Z

Discussion from 2020-09-03:

Functions or Enums?

@mihnita: Can we use an enum instead of a function?
@sffc: I prefer functions because (1) programmers know at build time which properties they want to query, so by having different functions, we can do code slicing; and (2) different properties have different return types (UnicodeSet or UCPTrie).
@mihnita: At runtime, programmers might not know whether a character is lowercase or uppercase, for example. They will want to loop over multiple properties to test.
@EvanJP: Could we have both?
@sffc: For the uppercase/lowercase use, we can have multiple functions: one for getting the set of lowercase characters via UnicodeSet ([:Lu:]), and one for getting the type of letter via UCPTrie.
@sffc: It sounds like this is an ergonomic versus functional API argument.
@echeran: We should also have the pattern parsing for the more sophisticated operations. But we should still have the individual functions for code slicing purposes.
@sffc: Since the functional API (individual functions) is the building block, we should start with that.
@mihnita: OK

Code Generation?

@sffc: I think manual would be OK because (1) properties don't get added that often, and when they do, we can have a human check for consistency; and (2) because we may want to write our own documentation for each individual function, and code generation may not be able to generate docs.
@mihnita: Manual maintenance happens so rarely that I'm not convinced it's worth the trouble to automate.
@EvanJP: The concern I have is that if we manually generate these functions, how do we test them? With code generation, we could also generate tests.
@sffc: The rustdoc tests could have a positive example and a negative example, and that might be sufficient test coverage.
@EvanJP: Properties are complex, and I feel that it would be useful to generate more robust tests.
@sffc: For data-driven testing, we could implement that on top of L3b.
@sffc: Code generation is messy. ICU has a lot of it. If we have code generation, we shouldn't check in the artifacts.
@zbraniecki: For Rust code generation, you can have a "cargo regenerate data". And since code generation is in Rust, dependencies are easier to resolve. So code generation in Rust is a bit nicer than C++. But I can't comment on long-term maintainability.

Next steps

@EvanJP: I can take a shot at it this weekend. I need to read more about the ppucd file.
@sffc: A good first step would be generating an inversion list for your favorite binary property from ppucd.
@echeran: I'm happy to help you out with this too.

echeran · 2020-11-19T04:57:35Z

I finished converting the L3 design doc into an L3a-specific design doc, at the same place:
https://docs.google.com/document/d/10z0RK7WC7pVOIP9fCZQfcGKCoMvsOcXZIpr86UN9M1Q/edit#

I hope that captures the scope of what we're currently considering, and what are the main tradeoffs. I've marked off which are questions with alternate options, even if some might have more obvious conclusions than others.

echeran self-assigned this Aug 28, 2020

sffc added the discuss Discuss at a future ICU4X-SC meeting label Aug 28, 2020

zbraniecki mentioned this issue Sep 3, 2020

ICU4X 0.1 #204

Closed

sffc added T-core Type: Required functionality C-unicode Component: Props, sets, tries and removed discuss Discuss at a future ICU4X-SC meeting labels Sep 3, 2020

sffc mentioned this issue Sep 6, 2020

Segmenter #109

Closed

sffc added this to the ICU4X 0.1 milestone Sep 11, 2020

echeran mentioned this issue Sep 14, 2020

Manually-defined UnicodeSets for binary Unicode properties #242

Merged

sffc mentioned this issue Sep 24, 2020

[uniset] Should UnicodeSetBuilder::add_range accept owned RangeBounds? #267

Closed

zbraniecki mentioned this issue Oct 9, 2020

ICU4X 0.2 #239

Closed

16 tasks

zbraniecki modified the milestones: ICU4X 0.1, ICU4X 0.2 Oct 9, 2020

sffc mentioned this issue Oct 22, 2020

UnicodeSet implementation & API #91

Closed

sffc modified the milestones: ICU4X 0.2, 2020 Q4 Nov 19, 2020

echeran modified the milestones: 2020 Q4, 2021-Q1-m1 Jan 15, 2021

sffc changed the title ~~UnicodeSet Properties from PPUCD (L3a)~~ UnicodeSet Properties from PPUCD (L3a) - Binary Properties Feb 4, 2021

sffc closed this as completed Feb 4, 2021

sffc linked a pull request Feb 18, 2021 that will close this issue

Manually-defined UnicodeSets for binary Unicode properties #242

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeSet Properties from PPUCD (L3a) - Binary Properties #217

UnicodeSet Properties from PPUCD (L3a) - Binary Properties #217

echeran commented Aug 28, 2020

sffc commented Sep 3, 2020

echeran commented Nov 19, 2020

UnicodeSet Properties from PPUCD (L3a) - Binary Properties #217

UnicodeSet Properties from PPUCD (L3a) - Binary Properties #217

Comments

echeran commented Aug 28, 2020

sffc commented Sep 3, 2020

Functions or Enums?

Code Generation?

Next steps

echeran commented Nov 19, 2020