-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add Unicode case folding support #14
Conversation
@@ -110,7 +171,7 @@ macro_rules! from_impl { | |||
($from:ty => $to:ty; $by:ident) => ( | |||
impl<'a> From<$from> for UniCase<$to> { | |||
fn from(s: $from) -> Self { | |||
UniCase(s.$by()) | |||
UniCase::new(s.$by()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be quite a hidden cost in generic code if it will incur a full scan for a potentially large string (compared to how it should be cheap for failed equality tests). I think it would be better to require the user to opt-in to the potential ASCII optimisation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the From
impls could just jump straight to Encoding::Unicode
, is that what you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep!
@Ryman do you find value in there being Then someone could choose to skip the check if they were certain is was one or the other. |
Perhaps it's overkill, but if you make the |
I had added the So I created the |
I'm surprised that it would take a 2x perf hit (unless you mean single char inputs perhaps) but if the type is useful, perhaps limiting it's construction to a function as you've done is the right thing to do. |
Here's the benchmarks, to show what I mean:
|
Sorry that I wasn't clear enough, but I was trying to say that if there is a benchmarked difference for useful inputs then having For curiosity's sake, I checked out and expanded the benchmarks a bit though (assuming they're the ones on the branch): #[cfg(feature = "nightly")]
#[bench]
fn bench_unicase_ascii(b: &mut ::test::Bencher) {
b.bytes = b"foobar".len() as u64;
let x = UniCase::new("foobar");
let y = UniCase::new("FOOBAR");
b.iter(|| assert_eq!(x, y));
}
#[cfg(feature = "nightly")]
#[bench]
fn bench_unicase_ascii_header(b: &mut ::test::Bencher) {
let subject = "Content-Type";
b.bytes = subject.len() as u64;
let x = UniCase::new(subject);
let y = UniCase::new(subject.to_lowercase());
b.iter(|| assert_eq!(x, y));
}
#[cfg(feature = "nightly")]
#[bench]
fn bench_unicase_ascii_long(b: &mut ::test::Bencher) {
let subject = ::std::str::from_utf8(SUBJECT).unwrap();
b.bytes = subject.len() as u64;
let x = UniCase::new(subject);
let y = UniCase::new(subject.to_lowercase());
b.iter(|| assert_eq!(x, y));
}
#[cfg(feature = "nightly")]
#[bench]
fn bench_ascii_eq(b: &mut ::test::Bencher) {
use Ascii;
b.bytes = b"foobar".len() as u64;
b.iter(|| assert_eq!(Ascii("foobar"), Ascii("FOOBAR")));
}
#[cfg(feature = "nightly")]
#[bench]
fn bench_ascii_eq_header(b: &mut ::test::Bencher) {
use Ascii;
let left = "Content-Type";
let right = &left.to_lowercase();
b.bytes = left.len() as u64;
b.iter(|| assert_eq!(Ascii(left), Ascii(right)));
}
#[cfg(feature = "nightly")]
#[bench]
fn bench_ascii_eq_long(b: &mut ::test::Bencher) {
use Ascii;
let left = ::std::str::from_utf8(SUBJECT).unwrap();
let right = &left.to_lowercase();
b.bytes = left.len() as u64;
b.iter(|| assert_eq!(Ascii(left), Ascii(right)));
}
#[cfg(feature = "nightly")]
static SUBJECT: &'static [u8] = b"ffoo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz foo bar baz oo bar baz quux herp derp"; test tests::bench_ascii_eq ... bench: 9 ns/iter (+/- 6) = 666 MB/s
test tests::bench_ascii_eq_header ... bench: 19 ns/iter (+/- 16) = 631 MB/s
test tests::bench_ascii_eq_long ... bench: 672 ns/iter (+/- 551) = 1056 MB/s
...
test tests::bench_unicase_ascii ... bench: 15 ns/iter (+/- 12) = 400 MB/s
test tests::bench_unicase_ascii_header ... bench: 19 ns/iter (+/- 7) = 631 MB/s
test tests::bench_unicase_ascii_long ... bench: 699 ns/iter (+/- 187) = 1015 MB/s The difference tends to evaporate even in smaller header strings ("Content-Type") and the faster of the But yeah, to be super clear again, I think |
This PR made phf unable to support unicase 2 because there is no way to use |
No description provided.