Unexpected match failure #234

SeanRBurton · 2016-05-18T23:37:02Z

I'm not sure if I'm doing something wrong, but I think this should match. It seems the issue is that '\u{f1}' is not valid ascii, and it's unclear from the docs what the behaviour should be here, but in re_builder.rs the byte builder is defined such that 'only_utf8' is false.

extern crate regex;                                                             

fn main() {                                                                     
    let re = regex::bytes::Regex::new(&"(?su:.)(?s-u:.)(?su:.)").unwrap();      
    let s = "\u{decb1}\u{f1}\u{d70ab}";  
    // prints 'None'                                       
    println!("{:?}", re.captures(&s.as_bytes()));                               
}

Thanks

The text was updated successfully, but these errors were encountered:

BurntSushi · 2016-05-19T00:43:16Z

That looks correct to me. I haven't actually tried it, but since \u{F1} isn't ASCII, its UTF-8 encoding is more than 1 byte (probably 2). Therefore, your middle (?s-u:.) would match the first byte in \u{F1}'s UTF-8 encoding, and your last (?su:.) would try to match the next byte, which would of course be invalid UTF-8 as it would be a lone continuation byte.

SeanRBurton · 2016-05-19T12:55:54Z

Thanks, I should have probably been able to figure that out. ;)

How about running "(?u:\u{705f}){1,}" against the string "\u{705f}"?

That should match right?

BurntSushi · 2016-05-19T13:14:58Z

Yup, it should, and it looks like it doesn't in 0.1.70 but it does in 0.1.69, which means 0.1.70 contains a regression.

SeanRBurton · 2016-05-19T14:23:08Z

The responsible commit is:

[37b6d31] Reintroduce the reverse suffix literal optimization.

BurntSushi · 2016-05-19T14:24:16Z

@SeanRBurton Thank you for tracking that down! Should be much easier to fix now.

BurntSushi · 2016-05-19T17:25:37Z

OK, I the cause is actually in regex-syntax. It is computing suffix literals incorrectly in some cases. I should have a fix out by end of today.

Fixes #234.

It turns out that we weren't compute suffix literals correctly in all cases. In particular, the bytes from a Unicode character were being reversed.

BurntSushi added the bug label May 19, 2016

BurntSushi closed this as completed in f9af58c May 20, 2016

BurntSushi added a commit that referenced this issue May 20, 2016

Merge pull request #236 from rust-lang-nursery/fix-234

792bb30

Fixes #234.

SeanRBurton pushed a commit to SeanRBurton/regex that referenced this issue May 20, 2016

Fixes rust-lang#234.

42322bb

It turns out that we weren't compute suffix literals correctly in all cases. In particular, the bytes from a Unicode character were being reversed.

SeanRBurton pushed a commit to SeanRBurton/regex that referenced this issue May 20, 2016

Fixes rust-lang#234.

310ae46

It turns out that we weren't compute suffix literals correctly in all cases. In particular, the bytes from a Unicode character were being reversed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected match failure #234

Unexpected match failure #234

SeanRBurton commented May 18, 2016

BurntSushi commented May 19, 2016

SeanRBurton commented May 19, 2016 •

edited

Loading

BurntSushi commented May 19, 2016

SeanRBurton commented May 19, 2016

BurntSushi commented May 19, 2016

BurntSushi commented May 19, 2016

Unexpected match failure #234

Unexpected match failure #234

Comments

SeanRBurton commented May 18, 2016

BurntSushi commented May 19, 2016

SeanRBurton commented May 19, 2016 • edited Loading

BurntSushi commented May 19, 2016

SeanRBurton commented May 19, 2016

BurntSushi commented May 19, 2016

BurntSushi commented May 19, 2016

SeanRBurton commented May 19, 2016 •

edited

Loading