Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeSet L1 Work #122

Merged
merged 30 commits into from
Jul 23, 2020
Merged

UnicodeSet L1 Work #122

merged 30 commits into from
Jul 23, 2020

Conversation

EvanJP
Copy link
Contributor

@EvanJP EvanJP commented Jun 12, 2020

  • Main file is uniset.rs
  • Construct UnicodeSet from serialized ICU4C/J UnicodeSet data
    • Serialization of UnicodeSet data will be handled in separate ICU4J driver program
    • Takes in a Vec<u32> to create the UnicodeSet or RangeBounds<Char>
  • UnicodeSet Data Structure and Membership Functions
    • Data Structure
      • Inversion list
    • Membership Functions
      • contains(char) and contains_range(RangeBounds<char>)
      • size()
      • ranges()
      • iter()
      • isEmpty()

@coveralls
Copy link

coveralls commented Jun 16, 2020

Pull Request Test Coverage Report for Build c8c149675ccc01e08099abf610a5d8cee960de86-PR-122

  • 269 of 273 (98.53%) changed or added relevant lines in 4 files are covered.
  • 3 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+2.9%) to 90.113%

Changes Missing Coverage Covered Lines Changed/Added Lines %
components/uniset/src/utils.rs 58 59 98.31%
components/uniset/src/uniset.rs 139 142 97.89%
Files with Coverage Reduction New Missed Lines %
components/data-provider/src/decimal.rs 1 87.5%
components/data-provider-json/src/lib.rs 2 89.66%
Totals Coverage Status
Change from base Build a0d0cfec033b6db6a500c927cca7bc7e5beff280: 2.9%
Covered Lines: 957
Relevant Lines: 1062

💛 - Coveralls

components/char_collection/src/char_collection.rs Outdated Show resolved Hide resolved
components/char_collection/src/char_collection.rs Outdated Show resolved Hide resolved
components/char_collection/src/char_collection.rs Outdated Show resolved Hide resolved
components/char_collection/src/char_collection.rs Outdated Show resolved Hide resolved
components/char_collection/src/char_collection.rs Outdated Show resolved Hide resolved
components/char_collection/src/char_collection.rs Outdated Show resolved Hide resolved
components/char_collection/src/char_collection.rs Outdated Show resolved Hide resolved
components/char_collection/src/char_collection.rs Outdated Show resolved Hide resolved
@zbraniecki zbraniecki requested a review from Manishearth June 23, 2020 18:52
components/char_collection/BUILD.gn Outdated Show resolved Hide resolved
components/char_collection/Cargo.toml Outdated Show resolved Hide resolved
components/char_collection/src/char_collection.rs Outdated Show resolved Hide resolved
components/char_collection/src/char_collection.rs Outdated Show resolved Hide resolved
components/char_collection/src/char_collection.rs Outdated Show resolved Hide resolved
components/char_collection/src/char_collection.rs Outdated Show resolved Hide resolved
components/char_collection/src/char_collection.rs Outdated Show resolved Hide resolved
components/char_collection/src/conversions.rs Outdated Show resolved Hide resolved
components/char_collection/src/conversions.rs Outdated Show resolved Hide resolved
components/char_collection/src/lib.rs Outdated Show resolved Hide resolved
components/char_collection/src/char_collection.rs Outdated Show resolved Hide resolved
@Manishearth
Copy link
Member

It occurs to me if we use ranges here then this can actually be made into a generic crate that has nothing to do with chars and applies to any Ord+Eq ranges. Not something we have to do, but something we can consider!

components/char_collection/src/uniset.rs Outdated Show resolved Hide resolved
components/char_collection/src/uniset.rs Outdated Show resolved Hide resolved
components/char_collection/src/uniset.rs Outdated Show resolved Hide resolved
components/char_collection/src/uniset.rs Outdated Show resolved Hide resolved
@EvanJP
Copy link
Contributor Author

EvanJP commented Jun 23, 2020

@Manishearth @zbraniecki @nciric

Thank you all for the comments! Unfortunately, I wasn't expecting to have my code reviewed so quickly, and failed to mention that most of char_collection.rs, conversions.rs, and macros.rs was being used a reference and is going to be removed and not used. This is an oversight on my part, especially since the folder is named char_collection and I apologize. While I wasn't expecting to have it reviewed so quickly, I greatly appreciated it and I hope that you are all willing to review the uniset.rs file in turn as that is the major file for UnicodeSet L1.

I still took all of your feedback into consideration for uniset.rs, and the latest commit hopefully reflects that. There are implementation details that differ from char_collection.rs quite a bit so please look at the doc here: https://docs.google.com/document/d/1usXoudAnwCx19pQ18qoNMLTAcHW7IuAE6CuosnmtS5s/edit?usp=sharing and the code in uniset.rs as a reference.

@EvanJP EvanJP changed the title WIP: UnicodeSet L1 Work UnicodeSet L1 Work Jul 15, 2020
zbraniecki
zbraniecki previously approved these changes Jul 15, 2020
Copy link
Member

@zbraniecki zbraniecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good! I left some nits/suggestions and request for benchmarks.

I see a number of tests that test for 'A'..'z' or 'A'.. or 'a'..'z' which I recognize from higher level uses like regexps, parsers, etc.
Some of those would work really well as "real world" benchmarks, but I don't know which.

If you think we should iron out benchmarks after we land it, I'm comfortable with an issue filed to kick off the conversation and not blocking landing of this PR any longer.

components/uniset/benches/inv_list.rs Outdated Show resolved Hide resolved
components/uniset/benches/inv_list.rs Outdated Show resolved Hide resolved
components/uniset/src/conversions.rs Outdated Show resolved Hide resolved
components/uniset/src/utils.rs Outdated Show resolved Hide resolved
components/uniset/src/utils.rs Show resolved Hide resolved
components/uniset/src/uniset.rs Outdated Show resolved Hide resolved
components/uniset/src/uniset.rs Show resolved Hide resolved
zbraniecki
zbraniecki previously approved these changes Jul 15, 2020
Copy link
Member

@zbraniecki zbraniecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

One more optional nit: you include std::char::MAX in multiple places and then reference it as MAX.
It requires the reader to jump to the top of the file to find if MAX is a local constant, or reference and which MAX is it (char::MAX, f64::MAX, etc.).

When dealing with such generic variables (MAX, MIN, Error) we usually keep its module name to make it easier to read: use std::char; char::MAX.

@sffc
Copy link
Member

sffc commented Jul 16, 2020

I took the liberty to mark resolved or obsolete threads from 20+ days ago from @filmil and @nciric as resolved.

There are still a few open comments from @zbraniecki, like using unreachable!(). You can see them in the Conversation view on the PR.

I also agree with using char::MAX instead of MAX.

filmil
filmil previously approved these changes Jul 16, 2020
@sffc sffc removed the request for review from Manishearth July 16, 2020 17:21
@EvanJP EvanJP dismissed stale reviews from filmil and zbraniecki via 845bc35 July 16, 2020 17:21
@sffc sffc requested review from Manishearth and removed request for Manishearth July 16, 2020 17:22
components/uniset/src/lib.rs Outdated Show resolved Hide resolved
components/uniset/src/uniset.rs Outdated Show resolved Hide resolved
components/uniset/src/uniset.rs Outdated Show resolved Hide resolved
components/uniset/src/uniset.rs Show resolved Hide resolved
components/uniset/src/uniset.rs Outdated Show resolved Hide resolved
components/uniset/src/uniset.rs Outdated Show resolved Hide resolved
components/uniset/src/uniset.rs Show resolved Hide resolved
sffc
sffc previously approved these changes Jul 17, 2020
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! All of my concerns are addressed.

components/uniset/src/uniset.rs Show resolved Hide resolved
@zbraniecki
Copy link
Member

Advantages of a flat array:

Serialization: a flat array is simpler and smaller than nested arrays in JSON and other serialization formats.

We can have serializer/deserializer that reads [0, 10, 30, 50] into Vec<Range<u32>>.

I don't understand arguments about semantics being cleaner. Clients should never have to touch this structure; it's internal to ICU4X, except for serializing to JSON.

You have to continuously remember that the list has to be even length, you need to always know if you're on an even or odd number to understand in the start or end of a range etc.
Vec<Range<usize>> makes Rust ensure that you have the right structure, making it impossible to make errors, and allowing you to easily work with each range knowing you're at the start or end.

@zbraniecki
Copy link
Member

To be clear - I'm not blocking on the Range issue. I'm relaying the decision to @EvanJP and @sffc and am comfortable with their position. If we'll find more reasons to reconsider later, we always can.

@sffc
Copy link
Member

sffc commented Jul 18, 2020

We can have serializer/deserializer that reads [0, 10, 30, 50] into Vec<Range>.

Yeah, but why complicate things when currently we can take ownership of the Vec<u32> directly from data, and not perform any other logic besides validation?

You have to continuously remember that the list has to be even length

Technically the list doesn't actually need to be an even length. A missing endpoint just means that the final range has no upper bound. I don't actually know why we're enforcing an even length.

@EvanJP EvanJP merged commit 1e3b00e into unicode-org:master Jul 23, 2020
@EvanJP EvanJP linked an issue Jul 23, 2020 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants