automata: fix subtle DFA performance bug

This commit fixes a subtle *performance* bug in the start state computation. The issue here is rather tricky, but it boils down to the fact that the way the look-behind assertions are computed in the start state is not quite precisely equivalent to how they're computed during normal state generation. Namely, in normal state generation, we only compute look-behind assertions if the NFA actually has one (or one similar to it) in its graph somewhere. If it doesn't, then there's no point in saving whether the assertion is satisfied or not. Logically speaking, this doesn't matter too much, because if the look-around assertions don't match up with how they're computed in the start state, a new state will simply be created. Not a huge deal, but wasteful. The real problem is that the new state will no longer be considered a start state. It will just be like any other normal state. We rely on being able to detect start states at search time to know when to trigger the prefilter. So if we re-generate start states as non-start states, then we may end up not triggering the prefilter. That's bad. rebar actually caught this bug via the `imported/sherlock/line-boundary-sherlock-holmes` benchmark, which recorded a 20x slowdown due to the prefilter not running. Owch! This specifically was caused by the start states unconditionally attaching half-starting word boundary assertions whenever they were satisfied, where as normal state generation only does this when there is actually a half-starting word boundary assertion in the NFA. So this led to re-generating start states needlessly. Interestingly, the start state computation was unconditionally attaching all different types of look-behind assertions, and thus in theory, this problem already existed under different circumstances. My hypothesis is that it wasn't "as bad" because it was mostly limited to line terminators. But the half-starting word boundary assertion is much more broadly applicable. We remedy this not only for the half-starting word boundary assertion, but for all others as well. I also did manual mutation testing in this start state computation and found a few branches not covered by tests. We add those tests here. Thanks rebar!
rust-lang · Oct 9, 2023 · 2f824f8 · 2f824f8
1 parent a448a56
commit 2f824f8
Show file tree

Hide file tree

Showing 3 changed files with 111 additions and 37 deletions.
diff --git a/regex-automata/src/util/determinize/mod.rs b/regex-automata/src/util/determinize/mod.rs
@@ -587,67 +587,95 @@ pub(crate) fn set_lookbehind_from_start(
 ) {
     let rev = nfa.is_reverse();
     let lineterm = nfa.look_matcher().get_line_terminator();
+    let lookset = nfa.look_set_any();
     match *start {
         Start::NonWordByte => {
-            builder.set_look_have(|have| {
-                have.insert(Look::WordStartHalfAscii)
-                    .insert(Look::WordStartHalfUnicode)
-            });
+            if lookset.contains_word() {
+                builder.set_look_have(|have| {
+                    have.insert(Look::WordStartHalfAscii)
+                        .insert(Look::WordStartHalfUnicode)
+                });
+            }
         }
         Start::WordByte => {
-            builder.set_is_from_word();
+            if lookset.contains_word() {
+                builder.set_is_from_word();
+            }
         }
         Start::Text => {
-            builder.set_look_have(|have| {
-                have.insert(Look::Start)
-                    .insert(Look::StartLF)
-                    .insert(Look::StartCRLF)
-                    .insert(Look::WordStartHalfAscii)
-                    .insert(Look::WordStartHalfUnicode)
-            });
+            if lookset.contains_anchor_haystack() {
+                builder.set_look_have(|have| have.insert(Look::Start));
+            }
+            if lookset.contains_anchor_line() {
+                builder.set_look_have(|have| {
+                    have.insert(Look::StartLF).insert(Look::StartCRLF)
+                });
+            }
+            if lookset.contains_word() {
+                builder.set_look_have(|have| {
+                    have.insert(Look::WordStartHalfAscii)
+                        .insert(Look::WordStartHalfUnicode)
+                });
+            }
         }
         Start::LineLF => {
             if rev {
-                builder.set_is_half_crlf();
-                builder.set_look_have(|have| have.insert(Look::StartLF));
+                if lookset.contains_anchor_crlf() {
+                    builder.set_is_half_crlf();
+                }
+                if lookset.contains_anchor_line() {
+                    builder.set_look_have(|have| have.insert(Look::StartLF));
+                }
             } else {
-                builder.set_look_have(|have| have.insert(Look::StartCRLF));
+                if lookset.contains_anchor_line() {
+                    builder.set_look_have(|have| have.insert(Look::StartCRLF));
+                }
             }
-            if lineterm == b'\n' {
+            if lookset.contains_anchor_line() && lineterm == b'\n' {
                 builder.set_look_have(|have| have.insert(Look::StartLF));
             }
-            builder.set_look_have(|have| {
-                have.insert(Look::WordStartHalfAscii)
-                    .insert(Look::WordStartHalfUnicode)
-            });
+            if lookset.contains_word() {
+                builder.set_look_have(|have| {
+                    have.insert(Look::WordStartHalfAscii)
+                        .insert(Look::WordStartHalfUnicode)
+                });
+            }
         }
         Start::LineCR => {
-            if rev {
-                builder.set_look_have(|have| have.insert(Look::StartCRLF));
-            } else {
-                builder.set_is_half_crlf();
+            if lookset.contains_anchor_crlf() {
+                if rev {
+                    builder.set_look_have(|have| have.insert(Look::StartCRLF));
+                } else {
+                    builder.set_is_half_crlf();
+                }
             }
-            if lineterm == b'\r' {
+            if lookset.contains_anchor_line() && lineterm == b'\r' {
                 builder.set_look_have(|have| have.insert(Look::StartLF));
             }
-            builder.set_look_have(|have| {
-                have.insert(Look::WordStartHalfAscii)
-                    .insert(Look::WordStartHalfUnicode)
-            });
+            if lookset.contains_word() {
+                builder.set_look_have(|have| {
+                    have.insert(Look::WordStartHalfAscii)
+                        .insert(Look::WordStartHalfUnicode)
+                });
+            }
         }
         Start::CustomLineTerminator => {
-            builder.set_look_have(|have| have.insert(Look::StartLF));
+            if lookset.contains_anchor_line() {
+                builder.set_look_have(|have| have.insert(Look::StartLF));
+            }
             // This is a bit of a tricky case, but if the line terminator was
             // set to a word byte, then we also need to behave as if the start
             // configuration is Start::WordByte. That is, we need to mark our
             // state as having come from a word byte.
-            if utf8::is_word_byte(lineterm) {
-                builder.set_is_from_word();
-            } else {
-                builder.set_look_have(|have| {
-                    have.insert(Look::WordStartHalfAscii)
-                        .insert(Look::WordStartHalfUnicode)
-                });
+            if lookset.contains_word() {
+                if utf8::is_word_byte(lineterm) {
+                    builder.set_is_from_word();
+                } else {
+                    builder.set_look_have(|have| {
+                        have.insert(Look::WordStartHalfAscii)
+                            .insert(Look::WordStartHalfUnicode)
+                    });
+                }
             }
         }
     }

diff --git a/testdata/line-terminator.toml b/testdata/line-terminator.toml
@@ -38,6 +38,18 @@ unescape = true
 line-terminator = '\xFF'
 utf8 = false
 
+# This tests a tricky case where the line terminator is set to \r. This ensures
+# that the StartLF look-behind assertion is tracked when computing the start
+# state.
+[[test]]
+name = "carriage"
+regex = '(?m)^[a-z]+'
+haystack = 'ABC\rabc'
+matches = [[4, 7]]
+bounds = [4, 7]
+unescape = true
+line-terminator = '\r'
+
 # This tests that we can set the line terminator to a byte corresponding to a
 # word character, and things work as expected.
 [[test]]

diff --git a/testdata/word-boundary-special.toml b/testdata/word-boundary-special.toml
@@ -651,3 +651,37 @@ regex = '\b{end-half}'
 haystack = "b𝛃"
 matches = [[5, 5]]
 unicode = true
+
+# Specialty tests.
+
+# Since \r is special cased in the start state computation (to deal with CRLF
+# mode), this test ensures that the correct start state is computed when the
+# pattern starts with a half word boundary assertion.
+[[test]]
+name = "word-start-half-ascii-carriage"
+regex = '\b{start-half}[a-z]+'
+haystack = 'ABC\rabc'
+matches = [[4, 7]]
+bounds = [4, 7]
+unescape = true
+
+# Since \n is also special cased in the start state computation, this test
+# ensures that the correct start state is computed when the pattern starts with
+# a half word boundary assertion.
+[[test]]
+name = "word-start-half-ascii-linefeed"
+regex = '\b{start-half}[a-z]+'
+haystack = 'ABC\nabc'
+matches = [[4, 7]]
+bounds = [4, 7]
+unescape = true
+
+# Like the carriage return test above, but with a custom line terminator.
+[[test]]
+name = "word-start-half-ascii-customlineterm"
+regex = '\b{start-half}[a-z]+'
+haystack = 'ABC!abc'
+matches = [[4, 7]]
+bounds = [4, 7]
+unescape = true
+line-terminator = '!'