Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Unicode case folding support #14

Merged
merged 3 commits into from
Jul 7, 2016
Merged

add Unicode case folding support #14

merged 3 commits into from
Jul 7, 2016

Conversation

seanmonstar
Copy link
Owner

No description provided.

@@ -110,7 +171,7 @@ macro_rules! from_impl {
($from:ty => $to:ty; $by:ident) => (
impl<'a> From<$from> for UniCase<$to> {
fn from(s: $from) -> Self {
UniCase(s.$by())
UniCase::new(s.$by())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be quite a hidden cost in generic code if it will incur a full scan for a potentially large string (compared to how it should be cheap for failed equality tests). I think it would be better to require the user to opt-in to the potential ASCII optimisation.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the From impls could just jump straight to Encoding::Unicode, is that what you mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep!

@seanmonstar
Copy link
Owner Author

@Ryman do you find value in there being UniCase::ascii("foo") and UniCase::unicode("foo") constructors, that skip the is_ascii check of the new constructor?

Then someone could choose to skip the check if they were certain is was one or the other.

@Ryman
Copy link
Contributor

Ryman commented Jul 1, 2016

UniCase::ascii(..) seems nice compared to having to import the Ascii type!

Perhaps it's overkill, but if you make the Ascii constructor private except through that then you could add a debug_assert!(input.is_ascii()) to the constructor.

@seanmonstar
Copy link
Owner Author

I had added the Ascii type because knowing beforehand that both are ASCII can be a noticeable speed improvement. The match in UniCase.eq made the ASCII benchmarks 2x slower (probably less noticeable on bigger inputs).

So I created the Ascii type so something like hyper can use Ascii<Cow<'static, str>> and skip the if both_are_ascii branch.

@Ryman
Copy link
Contributor

Ryman commented Jul 1, 2016

I'm surprised that it would take a 2x perf hit (unless you mean single char inputs perhaps) but if the type is useful, perhaps limiting it's construction to a function as you've done is the right thing to do.

@seanmonstar
Copy link
Owner Author

Here's the benchmarks, to show what I mean:

test ascii::tests::bench_ascii_eq              ... bench:           6 ns/iter (+/- 0) = 1000 MB/s
test tests::bench_unicase_ascii                ... bench:          11 ns/iter (+/- 9) = 545 MB/s
test unicode::tests::bench_ascii_folding       ... bench:         204 ns/iter (+/- 3) = 34 MB/s
test unicode::tests::bench_simple_case_folding ... bench:         267 ns/iter (+/- 8) = 52 MB/s
  1. Ascii, which goes straight to eq_ignore_case.
  2. UniCase::new("foo bar"), which internally becomes UniCase(Encoding::Ascii(Ascii("foo bar"))), means that before it can eq_ignore_case, it has to check that both itself and the other are Encoding::Ascii(..) variants, hence the 11ns/iter instead of 6ns/iter.
  3. UniCase::unicode("foo bar") shows the cost of the bigger lookup table and dealing with chars.
  4. UniCase::new(some_unicode_str) as another comparison.

@Ryman
Copy link
Contributor

Ryman commented Jul 6, 2016

Sorry that I wasn't clear enough, but I was trying to say that if there is a benchmarked difference for useful inputs then having Ascii::new available seems like a legitimately good thing.

For curiosity's sake, I checked out and expanded the benchmarks a bit though (assuming they're the ones on the branch):

    #[cfg(feature = "nightly")]
    #[bench]
    fn bench_unicase_ascii(b: &mut ::test::Bencher) {
        b.bytes = b"foobar".len() as u64;
        let x = UniCase::new("foobar");
        let y = UniCase::new("FOOBAR");
        b.iter(|| assert_eq!(x, y));
    }

    #[cfg(feature = "nightly")]
    #[bench]
    fn bench_unicase_ascii_header(b: &mut ::test::Bencher) {
        let subject = "Content-Type";
        b.bytes = subject.len() as u64;
        let x = UniCase::new(subject);
        let y = UniCase::new(subject.to_lowercase());
        b.iter(|| assert_eq!(x, y));
    }

    #[cfg(feature = "nightly")]
    #[bench]
    fn bench_unicase_ascii_long(b: &mut ::test::Bencher) {
        let subject = ::std::str::from_utf8(SUBJECT).unwrap();
        b.bytes = subject.len() as u64;
        let x = UniCase::new(subject);
        let y = UniCase::new(subject.to_lowercase());
        b.iter(|| assert_eq!(x, y));
    }

    #[cfg(feature = "nightly")]
    #[bench]
    fn bench_ascii_eq(b: &mut ::test::Bencher) {
        use Ascii;
        b.bytes = b"foobar".len() as u64;
        b.iter(|| assert_eq!(Ascii("foobar"), Ascii("FOOBAR")));
    }

    #[cfg(feature = "nightly")]
    #[bench]
    fn bench_ascii_eq_header(b: &mut ::test::Bencher) {
        use Ascii;
        let left = "Content-Type";
        let right = &left.to_lowercase();
        b.bytes = left.len() as u64;
        b.iter(|| assert_eq!(Ascii(left), Ascii(right)));
    }

    #[cfg(feature = "nightly")]
    #[bench]
    fn bench_ascii_eq_long(b: &mut ::test::Bencher) {
        use Ascii;
        let left = ::std::str::from_utf8(SUBJECT).unwrap();
        let right = &left.to_lowercase();
        b.bytes = left.len() as u64;
        b.iter(|| assert_eq!(Ascii(left), Ascii(right)));
    }

    #[cfg(feature = "nightly")]
    static SUBJECT: &'static [u8] = b"ffoo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz oo bar baz quux herp derp";
test tests::bench_ascii_eq                     ... bench:           9 ns/iter (+/- 6) = 666 MB/s
test tests::bench_ascii_eq_header              ... bench:          19 ns/iter (+/- 16) = 631 MB/s
test tests::bench_ascii_eq_long                ... bench:         672 ns/iter (+/- 551) = 1056 MB/s
...
test tests::bench_unicase_ascii                ... bench:          15 ns/iter (+/- 12) = 400 MB/s
test tests::bench_unicase_ascii_header         ... bench:          19 ns/iter (+/- 7) = 631 MB/s
test tests::bench_unicase_ascii_long           ... bench:         699 ns/iter (+/- 187) = 1015 MB/s

The difference tends to evaporate even in smaller header strings ("Content-Type") and the faster of the long case is often interchanging. Basically, they're close enough that it's just a matter of noise. It's worth remembering that these are for complete matches too, I'm sure things are much closer for a != b. (Stuff like this reminds me that computers are pretty damn amazing these days, I assume it's the branch predictor and prefetcher that play factors in the header and long cases)

But yeah, to be super clear again, I think Ascii::new is worth having (with a debug_assert in the constructor!)

@nox
Copy link
Contributor

nox commented Nov 21, 2017

This PR made phf unable to support unicase 2 because there is no way to use UniCase in static variables anymore.

@seanmonstar seanmonstar restored the unicode branch October 18, 2024 18:34
@seanmonstar seanmonstar deleted the unicode branch October 18, 2024 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants