Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix panics parsing regex with whitespace in extended mode #349

Conversation

robinst
Copy link
Contributor

@robinst robinst commented Mar 20, 2017

The added tests fail without the fix like this:

---- parser::tests::ignore_space_escape_hex2 stdout ----
	thread 'parser::tests::ignore_space_escape_hex2' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 10, surround: "x 5 3", kind: InvalidBase16(" 5 3") }', src/libcore/result.rs:860

---- parser::tests::ignore_space_escape_hex stdout ----
	thread 'parser::tests::ignore_space_escape_hex' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 12, surround: "{ 5 3 }", kind: InvalidBase16(" 5 3") }', src/libcore/result.rs:860

---- parser::tests::ignore_space_ascii_classes stdout ----
	thread 'parser::tests::ignore_space_ascii_classes' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 5, surround: "(?x)[ [ : ", kind: UnsupportedClassChar('[') }', src/libcore/result.rs:860
note: Run with `RUST_BACKTRACE=1` for a backtrace.

---- parser::tests::ignore_space_escape_octal stdout ----
	thread 'parser::tests::ignore_space_escape_octal' panicked at 'valid octal number', src/libcore/option.rs:785

---- parser::tests::ignore_space_escape_unicode_name stdout ----
	thread 'parser::tests::ignore_space_escape_unicode_name' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 15, surround: "Y i }", kind: UnrecognizedUnicodeClass(" Y i") }', src/libcore/result.rs:860

---- parser::tests::ignore_space_repeat_counted stdout ----
	thread 'parser::tests::ignore_space_repeat_counted' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 15, surround: ", 1 0 }", kind: InvalidBase10("1 0") }', src/libcore/result.rs:860

The reason for the panics is that bump_get would ignore space when
walking the characters, but then keep the spaces in the returned String.

Found using cargo-fuzz.

The added tests fail without the fix like this:

    ---- parser::tests::ignore_space_escape_hex2 stdout ----
    	thread 'parser::tests::ignore_space_escape_hex2' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 10, surround: "x 5 3", kind: InvalidBase16(" 5 3") }', src/libcore/result.rs:860

    ---- parser::tests::ignore_space_escape_hex stdout ----
    	thread 'parser::tests::ignore_space_escape_hex' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 12, surround: "{ 5 3 }", kind: InvalidBase16(" 5 3") }', src/libcore/result.rs:860

    ---- parser::tests::ignore_space_ascii_classes stdout ----
    	thread 'parser::tests::ignore_space_ascii_classes' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 5, surround: "(?x)[ [ : ", kind: UnsupportedClassChar('[') }', src/libcore/result.rs:860
    note: Run with `RUST_BACKTRACE=1` for a backtrace.

    ---- parser::tests::ignore_space_escape_octal stdout ----
    	thread 'parser::tests::ignore_space_escape_octal' panicked at 'valid octal number', src/libcore/option.rs:785

    ---- parser::tests::ignore_space_escape_unicode_name stdout ----
    	thread 'parser::tests::ignore_space_escape_unicode_name' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 15, surround: "Y i }", kind: UnrecognizedUnicodeClass(" Y i") }', src/libcore/result.rs:860

    ---- parser::tests::ignore_space_repeat_counted stdout ----
    	thread 'parser::tests::ignore_space_repeat_counted' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 15, surround: ", 1 0 }", kind: InvalidBase10("1 0") }', src/libcore/result.rs:860

The reason for the panics is that `bump_get` would ignore space when
walking the characters, but then keep the spaces in the returned String.

Found using cargo-fuzz.
@robinst
Copy link
Contributor Author

robinst commented Mar 20, 2017

The fuzz script is here (not sure if you would want to merge that or not): master...robinst:add-cargo-fuzz-script

You can run it using cargo install cargo-fuzz && cargo fuzz run fuzzer_script_parse.

The artifact that it returned was this: m:(?xxxxxxxxxxxxxms)mmm\x00\x01\x00.\x00+@-\x00\x0a\x00\x10\x10\x10\x10\x10\x10\x10\x10'\x10\x02[--\x0a[\x00\x00$\\\x0a3\x03[\x00:6\x03D\x00. Reduced that and added tests for other cases.


#[test]
fn ignore_space_escape_octal() {
assert_eq!(p(r"(?x)\ 1 2 3"), lit('S'));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit weird that it's allowed to add space between digits of a number, but that seems to be the closest to the current behavior.

@BurntSushi
Copy link
Member

@robinst Thanks for finding this! Sorry it slipped out of my queue, but your blog post caught my attention. :-) Nice work!

I'm not sure the fix is right either. Does this also apply to thinks like \p{G r e ek}? We should probably look at how other regex engines handle verbose mode.

@robinst
Copy link
Contributor Author

robinst commented Mar 30, 2017

I'm not sure the fix is right either. Does this also apply to thinks like \p{G r e ek}?

Yes, and things like [ [ : u p p e r : ] ], see the test cases. I'm not sure about it either.

Maybe whitespace should only be allowed between logical groups of characters. For example, it should not be allowed within a number or within a text identifier. Here's what other engines do:

Oniguruma: (?x) \ x53, (?x) \x 53, (?x) \x5 3 all compile but don't match S

Perl behaves the same way, checked with perl -e 'print "matches" if "S" =~ /(?x) \ x53/'

So at least for \x they don't seem to allow any space in the whole sequence. Even \ d doesn't match a digit, whereas regex-syntax does.

@BurntSushi
Copy link
Member

Thinking about this a bit more, it feels like we shouldn't allow arbitrary whitespace in arbitrary syntax. Maybe things like \p{G r e e k} should produce a syntax error?

robinst added a commit to robinst/regex that referenced this pull request Apr 7, 2017
Instead of ignoring space in all the bump/peek methods (as proposed in
pull request rust-lang#349), have an explicit `ignore_space` method that can be
used in places where space/comments should be allowed.

This makes parsing a bit stricter than before as well.
@robinst
Copy link
Contributor Author

robinst commented Apr 7, 2017

Thinking about this a bit more, it feels like we shouldn't allow arbitrary whitespace in arbitrary syntax.

Agreed. I've prepared a different pull request here: #354

bors added a commit that referenced this pull request May 20, 2017
…e-strict, r=BurntSushi

Fix panics with whitespace in extended mode by being more strict

Instead of ignoring space in all the bump/peek methods (as proposed in
pull request #349), have an explicit `ignore_space` method that can be
used in places where space/comments should be allowed.

This makes parsing a bit stricter than before as well.
@BurntSushi
Copy link
Member

I decided to go with #354 over this one. Thanks so much!

@BurntSushi BurntSushi closed this May 20, 2017
@robinst robinst deleted the fix-panics-parsing-regex-with-extended-mode branch May 22, 2017 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants