Skip to content

Commit

Permalink
automata: fix word boundary bug
Browse files Browse the repository at this point in the history
This fixes a bug that can occur when:

1. The regex has a Unicode word boundary.
2. The haystack contains some non-ASCII Unicode scalar value.
3. An inner or suffix literal optimization is in play.

Specifically, this provokes a case where a match is detected in one of
the meta engine's ad hoc DFA search routines, but before the match
reaches its correct endpoint, a quit state is entered. (Because DFAs
can't deal with Unicode word boundaries on non-ASCII haystacks.) The
correct thing to do is to return a quit error and let the higher level
logic divert to a different engine, but it was returning the match that
it had found up until that point instead. The match returned is not
technically incorrect in the sense that a match does indeed exist, but
the offsets it reports may be shorter than what the true match actually
is.

So... if a quit state is entered, return an error regardless of whether
a match has been found.

Fixes #1046
  • Loading branch information
BurntSushi committed Oct 3, 2023
1 parent c12a7df commit 9c8796a
Show file tree
Hide file tree
Showing 4 changed files with 26 additions and 24 deletions.
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,11 @@
TBD
===

* [BUG #1046](https://github.com/rust-lang/regex/issues/1046):
Fix a bug that could result in incorrect match spans when using a Unicode word
boundary and searching non-ASCII strings.


1.9.6 (2023-09-30)
==================
This is a patch release that fixes a panic that can occur when the default
Expand Down
12 changes: 0 additions & 12 deletions regex-automata/src/meta/limited.rs
Original file line number Diff line number Diff line change
Expand Up @@ -69,9 +69,6 @@ pub(crate) fn dfa_try_search_half_rev(
} else if dfa.is_dead_state(sid) {
return Ok(mat);
} else if dfa.is_quit_state(sid) {
if mat.is_some() {
return Ok(mat);
}
return Err(MatchError::quit(input.haystack()[at], at).into());
}
}
Expand Down Expand Up @@ -155,9 +152,6 @@ pub(crate) fn hybrid_try_search_half_rev(
} else if sid.is_dead() {
return Ok(mat);
} else if sid.is_quit() {
if mat.is_some() {
return Ok(mat);
}
return Err(MatchError::quit(input.haystack()[at], at).into());
}
}
Expand Down Expand Up @@ -209,9 +203,6 @@ fn dfa_eoi_rev(
let pattern = dfa.match_pattern(*sid, 0);
*mat = Some(HalfMatch::new(pattern, sp.start));
} else if dfa.is_quit_state(*sid) {
if mat.is_some() {
return Ok(());
}
return Err(MatchError::quit(byte, sp.start - 1));
}
} else {
Expand Down Expand Up @@ -246,9 +237,6 @@ fn hybrid_eoi_rev(
let pattern = dfa.match_pattern(cache, *sid, 0);
*mat = Some(HalfMatch::new(pattern, sp.start));
} else if sid.is_quit() {
if mat.is_some() {
return Ok(());
}
return Err(MatchError::quit(byte, sp.start - 1));
}
} else {
Expand Down
12 changes: 0 additions & 12 deletions regex-automata/src/meta/stopat.rs
Original file line number Diff line number Diff line change
Expand Up @@ -81,9 +81,6 @@ pub(crate) fn dfa_try_search_half_fwd(
} else if dfa.is_dead_state(sid) {
return Ok(mat.ok_or(at));
} else if dfa.is_quit_state(sid) {
if mat.is_some() {
return Ok(mat.ok_or(at));
}
return Err(MatchError::quit(input.haystack()[at], at).into());
} else {
// Ideally we wouldn't use a DFA that specialized start states
Expand Down Expand Up @@ -122,9 +119,6 @@ pub(crate) fn hybrid_try_search_half_fwd(
} else if sid.is_dead() {
return Ok(mat.ok_or(at));
} else if sid.is_quit() {
if mat.is_some() {
return Ok(mat.ok_or(at));
}
return Err(MatchError::quit(input.haystack()[at], at).into());
} else {
// We should NEVER get an unknown state ID back from
Expand Down Expand Up @@ -162,9 +156,6 @@ fn dfa_eoi_fwd(
let pattern = dfa.match_pattern(*sid, 0);
*mat = Some(HalfMatch::new(pattern, sp.end));
} else if dfa.is_quit_state(*sid) {
if mat.is_some() {
return Ok(());
}
return Err(MatchError::quit(b, sp.end));
}
}
Expand Down Expand Up @@ -201,9 +192,6 @@ fn hybrid_eoi_fwd(
let pattern = dfa.match_pattern(cache, *sid, 0);
*mat = Some(HalfMatch::new(pattern, sp.end));
} else if sid.is_quit() {
if mat.is_some() {
return Ok(());
}
return Err(MatchError::quit(b, sp.end));
}
}
Expand Down
18 changes: 18 additions & 0 deletions testdata/regression.toml
Original file line number Diff line number Diff line change
Expand Up @@ -782,3 +782,21 @@ match-kind = "all"
search-kind = "overlapping"
unicode = true
utf8 = true

# This tests that the PikeVM and the meta regex agree on a particular regex.
# This test previously failed when the ad hoc engines inside the meta engine
# did not handle quit states correctly. Namely, the Unicode word boundary here
# combined with a non-ASCII codepoint provokes the quit state. The ad hoc
# engines were previously returning a match even after entering the quit state
# if a match had been previously detected, but this is incorrect. The reason
# is that if a quit state is found, then the search must give up *immediately*
# because it prevents the search from finding the "proper" leftmost-first
# match. If it instead returns a match that has been found, it risks reporting
# an improper match, as it did in this case.
#
# See: https://github.com/rust-lang/regex/issues/1046
[[test]]
name = "non-prefix-literal-quit-state"
regex = '.+\b\n'
haystack = "β77\n"
matches = [[0, 5]]

0 comments on commit 9c8796a

Please sign in to comment.