-
Notifications
You must be signed in to change notification settings - Fork 452
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
automata: fix subtle DFA performance bug
This commit fixes a subtle *performance* bug in the start state computation. The issue here is rather tricky, but it boils down to the fact that the way the look-behind assertions are computed in the start state is not quite precisely equivalent to how they're computed during normal state generation. Namely, in normal state generation, we only compute look-behind assertions if the NFA actually has one (or one similar to it) in its graph somewhere. If it doesn't, then there's no point in saving whether the assertion is satisfied or not. Logically speaking, this doesn't matter too much, because if the look-around assertions don't match up with how they're computed in the start state, a new state will simply be created. Not a huge deal, but wasteful. The real problem is that the new state will no longer be considered a start state. It will just be like any other normal state. We rely on being able to detect start states at search time to know when to trigger the prefilter. So if we re-generate start states as non-start states, then we may end up not triggering the prefilter. That's bad. rebar actually caught this bug via the `imported/sherlock/line-boundary-sherlock-holmes` benchmark, which recorded a 20x slowdown due to the prefilter not running. Owch! This specifically was caused by the start states unconditionally attaching half-starting word boundary assertions whenever they were satisfied, where as normal state generation only does this when there is actually a half-starting word boundary assertion in the NFA. So this led to re-generating start states needlessly. Interestingly, the start state computation was unconditionally attaching all different types of look-behind assertions, and thus in theory, this problem already existed under different circumstances. My hypothesis is that it wasn't "as bad" because it was mostly limited to line terminators. But the half-starting word boundary assertion is much more broadly applicable. We remedy this not only for the half-starting word boundary assertion, but for all others as well. I also did manual mutation testing in this start state computation and found a few branches not covered by tests. We add those tests here. Thanks rebar!
- Loading branch information
1 parent
a448a56
commit 2f824f8
Showing
3 changed files
with
111 additions
and
37 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters