Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected match failure #234

Closed
SeanRBurton opened this issue May 18, 2016 · 6 comments
Closed

Unexpected match failure #234

SeanRBurton opened this issue May 18, 2016 · 6 comments
Labels

Comments

@SeanRBurton
Copy link
Contributor

I'm not sure if I'm doing something wrong, but I think this should match. It seems the issue is that '\u{f1}' is not valid ascii, and it's unclear from the docs what the behaviour should be here, but in re_builder.rs the byte builder is defined such that 'only_utf8' is false.

extern crate regex;                                                             

fn main() {                                                                     
    let re = regex::bytes::Regex::new(&"(?su:.)(?s-u:.)(?su:.)").unwrap();      
    let s = "\u{decb1}\u{f1}\u{d70ab}";  
    // prints 'None'                                       
    println!("{:?}", re.captures(&s.as_bytes()));                               
}                                                                               

Thanks

@BurntSushi
Copy link
Member

That looks correct to me. I haven't actually tried it, but since \u{F1} isn't ASCII, its UTF-8 encoding is more than 1 byte (probably 2). Therefore, your middle (?s-u:.) would match the first byte in \u{F1}'s UTF-8 encoding, and your last (?su:.) would try to match the next byte, which would of course be invalid UTF-8 as it would be a lone continuation byte.

@SeanRBurton
Copy link
Contributor Author

SeanRBurton commented May 19, 2016

Thanks, I should have probably been able to figure that out. ;)

How about running "(?u:\u{705f}){1,}" against the string "\u{705f}"?

That should match right?

@BurntSushi
Copy link
Member

Yup, it should, and it looks like it doesn't in 0.1.70 but it does in 0.1.69, which means 0.1.70 contains a regression.

@BurntSushi BurntSushi added the bug label May 19, 2016
@SeanRBurton
Copy link
Contributor Author

The responsible commit is:

[37b6d31] Reintroduce the reverse suffix literal optimization.

@BurntSushi
Copy link
Member

@SeanRBurton Thank you for tracking that down! Should be much easier to fix now.

@BurntSushi
Copy link
Member

OK, I the cause is actually in regex-syntax. It is computing suffix literals incorrectly in some cases. I should have a fix out by end of today.

SeanRBurton pushed a commit to SeanRBurton/regex that referenced this issue May 20, 2016
It turns out that we weren't compute suffix literals correctly in all
cases. In particular, the bytes from a Unicode character were being
reversed.
SeanRBurton pushed a commit to SeanRBurton/regex that referenced this issue May 20, 2016
It turns out that we weren't compute suffix literals correctly in all
cases. In particular, the bytes from a Unicode character were being
reversed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants