Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeSet Properties from PPUCD (L3a) - Binary Properties #217

Closed
echeran opened this issue Aug 28, 2020 · 2 comments · Fixed by #242
Closed

UnicodeSet Properties from PPUCD (L3a) - Binary Properties #217

echeran opened this issue Aug 28, 2020 · 2 comments · Fixed by #242
Assignees
Labels
C-unicode Component: Props, sets, tries T-core Type: Required functionality
Milestone

Comments

@echeran
Copy link
Contributor

echeran commented Aug 28, 2020

This issue represents the first part ("L3a") of the functionality we want from instantiating UnicodeSets that represent Unicode Properties (#168 - "L3").

Starting from the L3a design doc that that @EvanJP created, this issue represents supporting only a more well-defined subset of user input than in full L3 functionality. For a specified short list of binary and enum properties, whose data is collected and regularized in the Pre-Parsed Unicode Character Database (PPUCD), we want to create a separate static constructor-like function that can create the UnicodeSet for that property.

@echeran echeran self-assigned this Aug 28, 2020
@sffc sffc added the discuss Discuss at a future ICU4X-SC meeting label Aug 28, 2020
@zbraniecki zbraniecki mentioned this issue Sep 3, 2020
@sffc
Copy link
Member

sffc commented Sep 3, 2020

Discussion from 2020-09-03:

Functions or Enums?

  • @mihnita: Can we use an enum instead of a function?
  • @sffc: I prefer functions because (1) programmers know at build time which properties they want to query, so by having different functions, we can do code slicing; and (2) different properties have different return types (UnicodeSet or UCPTrie).
  • @mihnita: At runtime, programmers might not know whether a character is lowercase or uppercase, for example. They will want to loop over multiple properties to test.
  • @EvanJP: Could we have both?
  • @sffc: For the uppercase/lowercase use, we can have multiple functions: one for getting the set of lowercase characters via UnicodeSet ([:Lu:]), and one for getting the type of letter via UCPTrie.
  • @sffc: It sounds like this is an ergonomic versus functional API argument.
  • @echeran: We should also have the pattern parsing for the more sophisticated operations. But we should still have the individual functions for code slicing purposes.
  • @sffc: Since the functional API (individual functions) is the building block, we should start with that.
  • @mihnita: OK

Code Generation?

  • @sffc: I think manual would be OK because (1) properties don't get added that often, and when they do, we can have a human check for consistency; and (2) because we may want to write our own documentation for each individual function, and code generation may not be able to generate docs.
  • @mihnita: Manual maintenance happens so rarely that I'm not convinced it's worth the trouble to automate.
  • @EvanJP: The concern I have is that if we manually generate these functions, how do we test them? With code generation, we could also generate tests.
  • @sffc: The rustdoc tests could have a positive example and a negative example, and that might be sufficient test coverage.
  • @EvanJP: Properties are complex, and I feel that it would be useful to generate more robust tests.
  • @sffc: For data-driven testing, we could implement that on top of L3b.
  • @sffc: Code generation is messy. ICU has a lot of it. If we have code generation, we shouldn't check in the artifacts.
  • @zbraniecki: For Rust code generation, you can have a "cargo regenerate data". And since code generation is in Rust, dependencies are easier to resolve. So code generation in Rust is a bit nicer than C++. But I can't comment on long-term maintainability.

Next steps

  • @EvanJP: I can take a shot at it this weekend. I need to read more about the ppucd file.
  • @sffc: A good first step would be generating an inversion list for your favorite binary property from ppucd.
  • @echeran: I'm happy to help you out with this too.

@sffc sffc added T-core Type: Required functionality C-unicode Component: Props, sets, tries and removed discuss Discuss at a future ICU4X-SC meeting labels Sep 3, 2020
@sffc sffc mentioned this issue Sep 6, 2020
@sffc sffc added this to the ICU4X 0.1 milestone Sep 11, 2020
@zbraniecki zbraniecki mentioned this issue Oct 9, 2020
16 tasks
@zbraniecki zbraniecki modified the milestones: ICU4X 0.1, ICU4X 0.2 Oct 9, 2020
@echeran
Copy link
Contributor Author

echeran commented Nov 19, 2020

I finished converting the L3 design doc into an L3a-specific design doc, at the same place:
https://docs.google.com/document/d/10z0RK7WC7pVOIP9fCZQfcGKCoMvsOcXZIpr86UN9M1Q/edit#

I hope that captures the scope of what we're currently considering, and what are the main tradeoffs. I've marked off which are questions with alternate options, even if some might have more obvious conclusions than others.

@sffc sffc modified the milestones: ICU4X 0.2, 2020 Q4 Nov 19, 2020
@echeran echeran modified the milestones: 2020 Q4, 2021-Q1-m1 Jan 15, 2021
@sffc sffc changed the title UnicodeSet Properties from PPUCD (L3a) UnicodeSet Properties from PPUCD (L3a) - Binary Properties Feb 4, 2021
@sffc sffc closed this as completed Feb 4, 2021
@sffc sffc linked a pull request Feb 18, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-unicode Component: Props, sets, tries T-core Type: Required functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants