add Unicode case folding support #14

seanmonstar · 2016-07-01T00:21:30Z

No description provided.

Ryman · 2016-07-01T00:37:24Z

src/lib.rs

@@ -110,7 +171,7 @@ macro_rules! from_impl {
    ($from:ty => $to:ty; $by:ident) => (
        impl<'a> From<$from> for UniCase<$to> {
            fn from(s: $from) -> Self {
-                UniCase(s.$by())
+                UniCase::new(s.$by())


This could be quite a hidden cost in generic code if it will incur a full scan for a potentially large string (compared to how it should be cheap for failed equality tests). I think it would be better to require the user to opt-in to the potential ASCII optimisation.

I suppose the From impls could just jump straight to Encoding::Unicode, is that what you mean?

seanmonstar · 2016-07-01T00:56:28Z

@Ryman do you find value in there being UniCase::ascii("foo") and UniCase::unicode("foo") constructors, that skip the is_ascii check of the new constructor?

Then someone could choose to skip the check if they were certain is was one or the other.

Ryman · 2016-07-01T01:01:36Z

UniCase::ascii(..) seems nice compared to having to import the Ascii type!

Perhaps it's overkill, but if you make the Ascii constructor private except through that then you could add a debug_assert!(input.is_ascii()) to the constructor.

seanmonstar · 2016-07-01T01:04:41Z

I had added the Ascii type because knowing beforehand that both are ASCII can be a noticeable speed improvement. The match in UniCase.eq made the ASCII benchmarks 2x slower (probably less noticeable on bigger inputs).

So I created the Ascii type so something like hyper can use Ascii<Cow<'static, str>> and skip the if both_are_ascii branch.

Ryman · 2016-07-01T01:20:14Z

I'm surprised that it would take a 2x perf hit (unless you mean single char inputs perhaps) but if the type is useful, perhaps limiting it's construction to a function as you've done is the right thing to do.

seanmonstar · 2016-07-06T01:54:08Z

Here's the benchmarks, to show what I mean:

test ascii::tests::bench_ascii_eq              ... bench:           6 ns/iter (+/- 0) = 1000 MB/s
test tests::bench_unicase_ascii                ... bench:          11 ns/iter (+/- 9) = 545 MB/s
test unicode::tests::bench_ascii_folding       ... bench:         204 ns/iter (+/- 3) = 34 MB/s
test unicode::tests::bench_simple_case_folding ... bench:         267 ns/iter (+/- 8) = 52 MB/s

Ascii, which goes straight to eq_ignore_case.
UniCase::new("foo bar"), which internally becomes UniCase(Encoding::Ascii(Ascii("foo bar"))), means that before it can eq_ignore_case, it has to check that both itself and the other are Encoding::Ascii(..) variants, hence the 11ns/iter instead of 6ns/iter.
UniCase::unicode("foo bar") shows the cost of the bigger lookup table and dealing with chars.
UniCase::new(some_unicode_str) as another comparison.

Ryman · 2016-07-06T04:35:34Z

Sorry that I wasn't clear enough, but I was trying to say that if there is a benchmarked difference for useful inputs then having Ascii::new available seems like a legitimately good thing.

For curiosity's sake, I checked out and expanded the benchmarks a bit though (assuming they're the ones on the branch):

    #[cfg(feature = "nightly")]
    #[bench]
    fn bench_unicase_ascii(b: &mut ::test::Bencher) {
        b.bytes = b"foobar".len() as u64;
        let x = UniCase::new("foobar");
        let y = UniCase::new("FOOBAR");
        b.iter(|| assert_eq!(x, y));
    }

    #[cfg(feature = "nightly")]
    #[bench]
    fn bench_unicase_ascii_header(b: &mut ::test::Bencher) {
        let subject = "Content-Type";
        b.bytes = subject.len() as u64;
        let x = UniCase::new(subject);
        let y = UniCase::new(subject.to_lowercase());
        b.iter(|| assert_eq!(x, y));
    }

    #[cfg(feature = "nightly")]
    #[bench]
    fn bench_unicase_ascii_long(b: &mut ::test::Bencher) {
        let subject = ::std::str::from_utf8(SUBJECT).unwrap();
        b.bytes = subject.len() as u64;
        let x = UniCase::new(subject);
        let y = UniCase::new(subject.to_lowercase());
        b.iter(|| assert_eq!(x, y));
    }

    #[cfg(feature = "nightly")]
    #[bench]
    fn bench_ascii_eq(b: &mut ::test::Bencher) {
        use Ascii;
        b.bytes = b"foobar".len() as u64;
        b.iter(|| assert_eq!(Ascii("foobar"), Ascii("FOOBAR")));
    }

    #[cfg(feature = "nightly")]
    #[bench]
    fn bench_ascii_eq_header(b: &mut ::test::Bencher) {
        use Ascii;
        let left = "Content-Type";
        let right = &left.to_lowercase();
        b.bytes = left.len() as u64;
        b.iter(|| assert_eq!(Ascii(left), Ascii(right)));
    }

    #[cfg(feature = "nightly")]
    #[bench]
    fn bench_ascii_eq_long(b: &mut ::test::Bencher) {
        use Ascii;
        let left = ::std::str::from_utf8(SUBJECT).unwrap();
        let right = &left.to_lowercase();
        b.bytes = left.len() as u64;
        b.iter(|| assert_eq!(Ascii(left), Ascii(right)));
    }

    #[cfg(feature = "nightly")]
    static SUBJECT: &'static [u8] = b"ffoo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz oo bar baz quux herp derp";

test tests::bench_ascii_eq                     ... bench:           9 ns/iter (+/- 6) = 666 MB/s
test tests::bench_ascii_eq_header              ... bench:          19 ns/iter (+/- 16) = 631 MB/s
test tests::bench_ascii_eq_long                ... bench:         672 ns/iter (+/- 551) = 1056 MB/s
...
test tests::bench_unicase_ascii                ... bench:          15 ns/iter (+/- 12) = 400 MB/s
test tests::bench_unicase_ascii_header         ... bench:          19 ns/iter (+/- 7) = 631 MB/s
test tests::bench_unicase_ascii_long           ... bench:         699 ns/iter (+/- 187) = 1015 MB/s

The difference tends to evaporate even in smaller header strings ("Content-Type") and the faster of the long case is often interchanging. Basically, they're close enough that it's just a matter of noise. It's worth remembering that these are for complete matches too, I'm sure things are much closer for a != b. (Stuff like this reminds me that computers are pretty damn amazing these days, I assume it's the branch predictor and prefetcher that play factors in the header and long cases)

But yeah, to be super clear again, I think Ascii::new is worth having (with a debug_assert in the constructor!)

nox · 2017-11-21T11:53:00Z

This PR made phf unable to support unicase 2 because there is no way to use UniCase in static variables anymore.

seanmonstar added 2 commits June 30, 2016 17:12

add Unicode case folding support

196e655

fix deploying docs

7f5a18d

Ryman reviewed Jul 1, 2016
View reviewed changes

review updates

5c23917

seanmonstar merged commit 0e6745b into master Jul 7, 2016

seanmonstar deleted the unicode branch July 7, 2016 18:12

dashed mentioned this pull request Feb 19, 2017

Make it crystal clear in documentation what this crate is supposed to do #8

Open

Bobo1239 mentioned this pull request Oct 10, 2017

Upgrade UniCase to 2.0 [BREAKING CHANGE] rust-phf/rust-phf#109

Closed

seanmonstar restored the unicode branch October 18, 2024 18:34

seanmonstar deleted the unicode branch October 18, 2024 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add Unicode case folding support #14

add Unicode case folding support #14

seanmonstar commented Jul 1, 2016

Ryman Jul 1, 2016

seanmonstar Jul 1, 2016

Ryman Jul 1, 2016

seanmonstar commented Jul 1, 2016

Ryman commented Jul 1, 2016 •

edited

Loading

seanmonstar commented Jul 1, 2016

Ryman commented Jul 1, 2016

seanmonstar commented Jul 6, 2016

Ryman commented Jul 6, 2016 •

edited

Loading

nox commented Nov 21, 2017

add Unicode case folding support #14

add Unicode case folding support #14

Conversation

seanmonstar commented Jul 1, 2016

Ryman Jul 1, 2016

Choose a reason for hiding this comment

seanmonstar Jul 1, 2016

Choose a reason for hiding this comment

Ryman Jul 1, 2016

Choose a reason for hiding this comment

seanmonstar commented Jul 1, 2016

Ryman commented Jul 1, 2016 • edited Loading

seanmonstar commented Jul 1, 2016

Ryman commented Jul 1, 2016

seanmonstar commented Jul 6, 2016

Ryman commented Jul 6, 2016 • edited Loading

nox commented Nov 21, 2017

Ryman commented Jul 1, 2016 •

edited

Loading

Ryman commented Jul 6, 2016 •

edited

Loading