-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
move to byte based automata #146
Comments
This is proving to be quite tricky. I do have this working in my branch (including reducing the UTF-8 automata by reusing common suffixes), but it has introduced some weird problems. Byte based automata are considerably slower when used in the NFA or backtracking engines. For example, matching a I did this by simply adding a
Since there are a few unknowns, I'm going to mush on to building the DFA before submitting a PR. Perhaps the right way to make instructions polymorphic will emerge from that... |
A lazy DFA is much faster than executing an NFA because it doesn't repeat the work of following epsilon transitions over and and over. Instead, it computes states during search and caches them for reuse. We avoid exponential state blow up by bounding the cache in size. When the DFA isn't powerful enough to fulfill the caller's request (e.g., return sub-capture locations), it still runs to find the boundaries of the match and then falls back to NFA execution on the matched region. The lazy DFA can otherwise execute on every regular expression *except* for regular expressions that contain word boundary assertions (`\b` or `\B`). (They are tricky to implement in the lazy DFA because they are Unicode aware and therefore require multi-byte look-behind/ahead.) The implementation in this PR is based on the implementation in Google's RE2 library. Adding a lazy DFA was a substantial change and required several modifications: 1. The compiler can now produce both Unicode based programs (still used by the NFA engines) and byte based programs (required by the lazy DFA, but possible to use in the NFA engines too). In byte based programs, UTF-8 decoding is built into the automaton. 2. A new `Exec` type was introduced to implement the logic for compiling and choosing the right engine to use on each search. 3. Prefix literal detection was rewritten to work on bytes. 4. Benchmarks were overhauled and new ones were added to more carefully track the impact of various optimizations. 5. A new `HACKING.md` guide has been added that gives a high-level design overview of this crate. Other changes in this commit include: 1. Protection against stack overflows. All places that once required recursion have now either acquired a bound or have been converted to using a stack on the heap. 2. Update the Aho-Corasick dependency, which includes `memchr2` and `memchr3` optimizations. 3. Add PCRE benchmarks using the Rust `pcre` bindings. Closes #66, #146.
A lazy DFA is much faster than executing an NFA because it doesn't repeat the work of following epsilon transitions over and and over. Instead, it computes states during search and caches them for reuse. We avoid exponential state blow up by bounding the cache in size. When the DFA isn't powerful enough to fulfill the caller's request (e.g., return sub-capture locations), it still runs to find the boundaries of the match and then falls back to NFA execution on the matched region. The lazy DFA can otherwise execute on every regular expression *except* for regular expressions that contain word boundary assertions (`\b` or `\B`). (They are tricky to implement in the lazy DFA because they are Unicode aware and therefore require multi-byte look-behind/ahead.) The implementation in this PR is based on the implementation in Google's RE2 library. Adding a lazy DFA was a substantial change and required several modifications: 1. The compiler can now produce both Unicode based programs (still used by the NFA engines) and byte based programs (required by the lazy DFA, but possible to use in the NFA engines too). In byte based programs, UTF-8 decoding is built into the automaton. 2. A new `Exec` type was introduced to implement the logic for compiling and choosing the right engine to use on each search. 3. Prefix literal detection was rewritten to work on bytes. 4. Benchmarks were overhauled and new ones were added to more carefully track the impact of various optimizations. 5. A new `HACKING.md` guide has been added that gives a high-level design overview of this crate. Other changes in this commit include: 1. Protection against stack overflows. All places that once required recursion have now either acquired a bound or have been converted to using a stack on the heap. 2. Update the Aho-Corasick dependency, which includes `memchr2` and `memchr3` optimizations. 3. Add PCRE benchmarks using the Rust `pcre` bindings. Closes #66, #146.
A lazy DFA is much faster than executing an NFA because it doesn't repeat the work of following epsilon transitions over and and over. Instead, it computes states during search and caches them for reuse. We avoid exponential state blow up by bounding the cache in size. When the DFA isn't powerful enough to fulfill the caller's request (e.g., return sub-capture locations), it still runs to find the boundaries of the match and then falls back to NFA execution on the matched region. The lazy DFA can otherwise execute on every regular expression *except* for regular expressions that contain word boundary assertions (`\b` or `\B`). (They are tricky to implement in the lazy DFA because they are Unicode aware and therefore require multi-byte look-behind/ahead.) The implementation in this PR is based on the implementation in Google's RE2 library. Adding a lazy DFA was a substantial change and required several modifications: 1. The compiler can now produce both Unicode based programs (still used by the NFA engines) and byte based programs (required by the lazy DFA, but possible to use in the NFA engines too). In byte based programs, UTF-8 decoding is built into the automaton. 2. A new `Exec` type was introduced to implement the logic for compiling and choosing the right engine to use on each search. 3. Prefix literal detection was rewritten to work on bytes. 4. Benchmarks were overhauled and new ones were added to more carefully track the impact of various optimizations. 5. A new `HACKING.md` guide has been added that gives a high-level design overview of this crate. Other changes in this commit include: 1. Protection against stack overflows. All places that once required recursion have now either acquired a bound or have been converted to using a stack on the heap. 2. Update the Aho-Corasick dependency, which includes `memchr2` and `memchr3` optimizations. 3. Add PCRE benchmarks using the Rust `pcre` bindings. Closes #66, #146.
Done in 2aa1727 |
A prerequisite to a faster regex engine, and a DFA (see #66) is compiling UTF-8 decoding into the regex automaton. The reason why this is necessary, especially for a DFA, is so state transitions can proceed a byte-at-a-time and can therefore avoid large character maps in each state transition. We can preserve the invariant that all match boundaries returned are on UTF-8 code unit sequence boundaries by virtue of the fact that all regex automatons exclusively match valid UTF-8 encodings.
This is also a prerequisite to providing byte-based regexes that can match on a
&[u8]
instead of requiring a&str
. See #85.There are a few tricky components to this:
utf8_ranges
crate.The text was updated successfully, but these errors were encountered: