-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement a DFA matcher #66
Comments
I'll do it! @BurntSushi recommends that I look into an existing DFA engine by @carllerche. I'll be starting by getting familiar with the code in this repo and RE2. |
Feel free to ping me on IRC or wherever if you want to talk about carllerche/automaton |
Here's a summary of some research I've done and my intended next steps. TL;DRThe algorithms that @carllerche uses in carllerche/automaton make a different time/space tradeoff than that made by RE2 and rust-lang/regex so far. For now, I recommend that we keep modeling rust-lang/regex after RE2 and use the same method of lazy DFA construction used in RE2 rather than incorporate Carl's work. However, using Carl's work to construct a DFA at compile time with the A Comparison of rust-lang/regex, RE2, and carllerche/automatonI consulted The Dragon Book and Russ Cox's blog series on implementing RE2 for background on algorithms for constructing and simulating automata. I'll refer to algorithms by numbers given to them in the Dragon Book for anyone who wants to take a closer look.
Next StepsI say we keep modeling rust-lang/regex after RE2 and implement lazy DFA construction rather than ahead-of-time DFA construction. I see this as taking two big steps:
Compilation of a regex into VM instructions would be unchanged. However, the ahead-of-time approach could still be useful. Constructing a DFA at compile time with the Questions@carllerche: I know you and @BurntSushi were talking about folding in carllerche/automaton. Am I misunderstanding anything? Does this seem reasonable to you? Also, many thanks to @BurntSushi for being so helpful to this Rust newbie. |
@WillEngler Sounds good to me! w.r.t. to (My new branch has some basic support for picking a matching engine on-the-fly, but the DFA is a bit more complicated since it can be beneficial to run the DFA first and then the NFA to find submatch boundaries. But this doesn't have to be in the first pass! A simple match-only DFA seems like a good initial goal to shoot for.) |
@WillEngler sounds accurate. My goal was to target the case of optimizing runtime at the expense of compile time with the assumption that it is easy to build the regex once at boot so trading off compile time vs. execution time is desirable. |
Great! Then I'll get to work on a match-only DFA. |
DFA SitrepI don't have anything to show for this week. If last week was about learning the theory of DFA construction, this week was about learning the implementation details. I traced and annotated Russ Cox's educational bounded-memory DFA implementation and his production implementation. I also followed @BurntSushi's major refactoring (#91) as I learned more about rustlang/regex's internals. Expect to see some action in WillEngler/regex this week. (Is this proper Github etiquette? I thought it would be helpful to give an update because I haven't shipped anything this week.) |
@WillEngler Updates are really good, particularly when you're still getting your bearing. If you share info with others, then you can benefit from their experience! In any case, sounds like you're on the right track. :D |
@WillEngler Oh, and please don't feel pressure to "ship" stuff! |
@BurntSushi Thanks, I needed that 😄. I saw you contribute so many great additions so quickly in the refactor and I got a little intimidated. But as a beginner, it's unreasonable to expect myself to move that fast. Onwards! |
@BurntSushi I'm going down the rabbit hole of UTF-8 encoding and would like to make sure I'm going in the right direction. As I understand it, each DFA state will need a collection of pointers to other DFA states in order to form a graph. We want to know the number of pointers when we allocate the DFA state. Obviously, one pointer per UTF-8 codepoint isn't viable. RE2's approach is to maintain a 256-bit bitmap during compilation that records borders between character classes that are treated identically. Then, when constructing the DFA engine, it creates a mapping from [0, 255] -> [0, num_character_classes]. This map is used to pick the right pointer when walking/lazily constructing the graph. Here's an example from Russ Cox:
RE2 gets away with only caring about 256 values by compiling multi-byte UTF-8 codepoints into mini-automata. (I think. I haven't scrutinized that part of the code enough to fully grasp it.) Given Rust's approach to Unicode, it doesn't seem natural to compile codepoints down to their bytes. So here is the approach I'm imagining:
I have one major reservation. In RE2, the number of possible character ranges is bounded above at 256. In the approach I described, the upper bound is the number of Unicode codepoints. Are there use cases that would have many (say, > 1000) character classes and cause the size of the pointer array to blow up? And overall, does this approach seem reasonable? |
@WillEngler I was actually thinking about this problem a couple mornings ago, but I hadn't gotten a chance to write up my thoughts. Thank you for bringing it up! My first thought is this: we shouldn't let the way Rust handles strings influence how matching engines are written. In particular, if it is easier to write a faster engine byte-by-byte instead of character-by-character, then we should do that. If we expose the same interface that we do today, then we can assume UTF-8 input (which is rather convenient). Doing this will, as you pointed out, require changing a fair bit so that UTF-8 decoding is built into the automaton generated by the compiler. The RE2 code to do this doesn't look that bad. I think this also means that we could collapse the My second thought is that this could potentially introduce some problems in corner cases. For example, consider the My third thought is that you should not take my thoughts for gospel. I am not saying, "to get a DFA implemented, we must switch to byte matching engines." :-) For example, if it is easier to ignore OK, with that out of the way, I'll actually respond to your thoughts. :-)
Yes! One thing I would consider if I were you is the choice of whether to actually use a pointer or not. My suspicion is that using pointers might require use of On an unrelated note... When I first published
This is my understanding as well. I understand it at a high level. I think I'd need to work through an example or two before I'd really get it.
Hmm. So I think terminology is getting a bit conflated here. The term "character class" has a well-defined meaning in regex land, but here, I think you're using it to refer to a specific implementation detail. (How about, "state character class"?) If so, I think that is a problem. e.g., Consider I haven't been able to come up with a way to work around this without either moving to byte matching or making unacceptable sacrifices in memory usage. :-/ (Go's |
@BurntSushi A lot to chew over here!
I hadn't considered that. The Unicode-unaware behavior seems really unintuitive. I had to go look at RE2 again to convince myself that it would act so strangely.
I'm in complete agreement.
Overloading "character class" is definitely a bad idea :) .
Of course :)
On that note, here is how I think I ought to go forward.
It will be instructive for me to make an experimental feature in the compiler so that I can understand that part of the crate better. Also, it helps that this will be a task with a comparatively low surface area. I can have a small victory to feel chuffed about before tackling the DFA engine proper. Sound reasonable? Miscellaneous thoughts:
Thanks for paying that forward! I always appreciate being corrected on terminology.
Interesting. I'll have to take a look. The Go implementation might be easier for me to use as a reference than the C++ implementation.
Thanks for that suggestion! Tips on how to write effective Rust are always welcome. |
I think everything sounds good. If you wanted to try out your state transition character class idea before byte based matching engines, then I'd support that too. (This will get you to DFA-land faster.) OK, so here are my thoughts on byte based matching engines.
You might choose to ignore (3) and hope that I can do it in a timely manner. :-) |
Sounds good! Ideally, I'll handle case insensitivity myself. But as for doing things in a timely manner, I'm more likely to be the slow one :) |
Btw. have you checked the "Regen" Regex engine? It uses JIT and parallerization, but according to this paper[1], even WITHOUT them, it manages to outspeed RE2 with a considerable marginal. [1] http://www.is.titech.ac.jp/~sassa/papers-written/shinya-regen-postergenkou-PROSYM12.pdf (Sorry, Japanese only, but the abstract, graphs and numbers are understandable.) |
@golddranks No, this is the first I've heard of it. Thanks for the links! I would need to take a lot of time to understand their approach because I don't know the first thing about JIT compilation. So I'm going to make a note to myself to revisit this later. |
@golddranks Thanks! But I fear that without an English description of the work they've done, it will be hard to benefit from it. JITing and parallelization seem like optimizations that come after initial work on a DFA is done. :-) |
Progress ReportI'm chugging away at the byte matching engine. The trickiest part has been deciphering laconic bit-twiddling code like this to figure out what approach RE2 took. I've made some decent progress. I have pieces of a module that can take in a character range and spit out byte ranges that are ready to become instructions. So far, that means splitting ranges along the length of their UTF-8 encoding and further splitting them so that each range shares the same leading bytes. It doesn't work yet. I still need to write up tests and do some serious debugging. And some design changes are probably in order. There is much left to do after that! Here is the order I want to tackle things:
|
@WillEngler Looks great so far! Case folding could get tricky, as you'll need to make modifications to Otherwise, yes, putting off optimizations like caching is good. Make it work. Make it right. Make it fast. :-) |
@WillEngler It looks like rsc also wrote a DFA in Go with byte instructions like in RE2. Here's the UTF-8 automaton code: https://github.com/google/codesearch/blob/master/regexp/utf.go |
@BurntSushi Thanks for bringing that up. I've been banging my head against the C++ version, so it'll be helpful to see it from a different perspective. |
@BurntSushi I said I was hoping to get
working by this Sunday. That was pure hubris. Here's my true status: In RE2, rune* compilation is mostly handled in a big, recursive function with a lot of side effects that mutate the compiler. I was following that approach, but I had a hard time reasoning about all of the state transformations. And when I thought about how I would write tests for it, I cried a little. So I'm taking a different tack now. The module I'm working on exposes one main function: As a result, the rune compiler module consists mostly of pure and almost-pure functions that should (hopefully) be straightforward for me to test and debug. Also, I like how it keeps the innards of compiler.rs away from rune compilation. The downside, of course, is that the indirection of Now I'm working to get rune -> BRTree working and presentable. When I finish, I think that will be a nice milestone for a code review (if you would be so kind @BurntSushi). (*) Ken Thompson and friends like calling Unicode scalar codepoints "runes." I like it, but it isn't commonly used. If you think it will cause more confusion than it's worth, I can not use "rune" in the codebase. |
Sounds great. I've also found RE2's compiler difficult to follow. Pulling out the byte range stuff so it can be tested sounds fantastic. As long as performance stays reasonable (even on large Unicode classes), then I'm happy. Make sure to test it on debug builds too. When I first introduced RE rune: I like rune too (being a gopher myself), but the Rust ecosystem unfortunately doesn't use that terminology, so it's probably best to avoid it in the code. Something like |
Thanks, I wouldn't have thought to give special attention to debug builds. And roger, I'll get rid of |
@BurntSushi: Update: getting really close to having a working The structure of the new module is about where I want it. The one big roadblock left is in some helper functions I've written to test it. As a QuickCheck property, I assert that the transformation from a |
@WillEngler Awesome! For Hmm, looking at the code, the |
@WillEngler Do you have a sense for how big |
@BurntSushi Thanks for the feedback.
Yep, I got that done this afternoon. Now Tree::to_scalar_range() is big and ugly, but it works.
Hehehe
I intended it as an intermediate data structure. Trying to look at it with fresh eyes, I see what you mean about it being heavyweight. I like the idea of encoding it directly in terms of the existing instructions rather than defining an entirely new type that the compiler needs to worry about and then break down. I'm pretty sure you suggested that before, but it's only really clicking now that I'm knee-deep in the problem. I would much rather make big changes than go for a solution that is merely "liveable". So if it sounds good by you, I'd like to gut
It can get real big. More detailed answer forthcoming, but for now I'll say that suffix caching will help a lot. |
I see. I think that's enough of a reason to say, "Yeah, don't use it as an intermediate data structure." (Unless suffix caching significantly bounds the size of the tree.) With that said, you may find it more satisfying to mush on, get the instruction set updated and work on the matching engines. Then circle back around and fix up the |
I'm with you. I got wrapped up and couldn't see the forest for the... well, you know.
As much as I love a good mush, it would drive me nuts to leave this module in a sloppy state. So I'll put off matching for a bit longer. |
@WillEngler As a heads up, I've made some progress on this issue. In particular, a crate for generating UTF-8 ranges from scalar values: http://burntsushi.net/rustdoc/utf8_ranges/ I also have a basic DFA construction working (using utf8-ranges), although it doesn't have zero-width assertions or match priority (because they aren't really necessary in this particular context): https://github.com/BurntSushi/fst/blob/master/src/regex/dfa.rs --- Still lots of work to do before it's in good enough shape for Also, I know @jneem has been doing a lot of work in this space, including possibly getting word boundaries working in a DFA. I'm not sure what their plans are though! |
Yes, I do have word boundaries working in a DFA (and my proof-of-concept fork integrating it into this crate passes all the tests). But I'm taking the offline-compilation approach (maybe similar to the one of @carllerche), and my compilation times are pretty terrible right now, especially after incorporating the |
Thanks for the heads-up @BurntSushi! (And good timing: I've settled in after some big life changes and I'm able to carve out time for Rust again) I'll check out the progress that you and @jneem have made and try to figure out where I can be of help. |
Nope, I was wrong. I still don't have the time to make an impact here. Thanks a million @BurntSushi for guiding me. I regret I didn't contribute much in return. I will return to Rust some day! |
@WillEngler No regrets, please. :-) You actually made quite a large impact. At least for me personally, I learned a lot from our exchanges because you forced me to give more thought to DFAs than I had previously! I look forward to working with you in the future! |
A lazy DFA is much faster than executing an NFA because it doesn't repeat the work of following epsilon transitions over and and over. Instead, it computes states during search and caches them for reuse. We avoid exponential state blow up by bounding the cache in size. When the DFA isn't powerful enough to fulfill the caller's request (e.g., return sub-capture locations), it still runs to find the boundaries of the match and then falls back to NFA execution on the matched region. The lazy DFA can otherwise execute on every regular expression *except* for regular expressions that contain word boundary assertions (`\b` or `\B`). (They are tricky to implement in the lazy DFA because they are Unicode aware and therefore require multi-byte look-behind/ahead.) The implementation in this PR is based on the implementation in Google's RE2 library. Adding a lazy DFA was a substantial change and required several modifications: 1. The compiler can now produce both Unicode based programs (still used by the NFA engines) and byte based programs (required by the lazy DFA, but possible to use in the NFA engines too). In byte based programs, UTF-8 decoding is built into the automaton. 2. A new `Exec` type was introduced to implement the logic for compiling and choosing the right engine to use on each search. 3. Prefix literal detection was rewritten to work on bytes. 4. Benchmarks were overhauled and new ones were added to more carefully track the impact of various optimizations. 5. A new `HACKING.md` guide has been added that gives a high-level design overview of this crate. Other changes in this commit include: 1. Protection against stack overflows. All places that once required recursion have now either acquired a bound or have been converted to using a stack on the heap. 2. Update the Aho-Corasick dependency, which includes `memchr2` and `memchr3` optimizations. 3. Add PCRE benchmarks using the Rust `pcre` bindings. Closes #66, #146.
A lazy DFA is much faster than executing an NFA because it doesn't repeat the work of following epsilon transitions over and and over. Instead, it computes states during search and caches them for reuse. We avoid exponential state blow up by bounding the cache in size. When the DFA isn't powerful enough to fulfill the caller's request (e.g., return sub-capture locations), it still runs to find the boundaries of the match and then falls back to NFA execution on the matched region. The lazy DFA can otherwise execute on every regular expression *except* for regular expressions that contain word boundary assertions (`\b` or `\B`). (They are tricky to implement in the lazy DFA because they are Unicode aware and therefore require multi-byte look-behind/ahead.) The implementation in this PR is based on the implementation in Google's RE2 library. Adding a lazy DFA was a substantial change and required several modifications: 1. The compiler can now produce both Unicode based programs (still used by the NFA engines) and byte based programs (required by the lazy DFA, but possible to use in the NFA engines too). In byte based programs, UTF-8 decoding is built into the automaton. 2. A new `Exec` type was introduced to implement the logic for compiling and choosing the right engine to use on each search. 3. Prefix literal detection was rewritten to work on bytes. 4. Benchmarks were overhauled and new ones were added to more carefully track the impact of various optimizations. 5. A new `HACKING.md` guide has been added that gives a high-level design overview of this crate. Other changes in this commit include: 1. Protection against stack overflows. All places that once required recursion have now either acquired a bound or have been converted to using a stack on the heap. 2. Update the Aho-Corasick dependency, which includes `memchr2` and `memchr3` optimizations. 3. Add PCRE benchmarks using the Rust `pcre` bindings. Closes #66, #146.
A lazy DFA is much faster than executing an NFA because it doesn't repeat the work of following epsilon transitions over and and over. Instead, it computes states during search and caches them for reuse. We avoid exponential state blow up by bounding the cache in size. When the DFA isn't powerful enough to fulfill the caller's request (e.g., return sub-capture locations), it still runs to find the boundaries of the match and then falls back to NFA execution on the matched region. The lazy DFA can otherwise execute on every regular expression *except* for regular expressions that contain word boundary assertions (`\b` or `\B`). (They are tricky to implement in the lazy DFA because they are Unicode aware and therefore require multi-byte look-behind/ahead.) The implementation in this PR is based on the implementation in Google's RE2 library. Adding a lazy DFA was a substantial change and required several modifications: 1. The compiler can now produce both Unicode based programs (still used by the NFA engines) and byte based programs (required by the lazy DFA, but possible to use in the NFA engines too). In byte based programs, UTF-8 decoding is built into the automaton. 2. A new `Exec` type was introduced to implement the logic for compiling and choosing the right engine to use on each search. 3. Prefix literal detection was rewritten to work on bytes. 4. Benchmarks were overhauled and new ones were added to more carefully track the impact of various optimizations. 5. A new `HACKING.md` guide has been added that gives a high-level design overview of this crate. Other changes in this commit include: 1. Protection against stack overflows. All places that once required recursion have now either acquired a bound or have been converted to using a stack on the heap. 2. Update the Aho-Corasick dependency, which includes `memchr2` and `memchr3` optimizations. 3. Add PCRE benchmarks using the Rust `pcre` bindings. Closes #66, #146.
One of the reasons why RE2/C++ is so fast is because it has two implementations of regex matching: a limited DFA matcher (no sub-capture support) and a full NFA simulation. This crate has the latter, but not the former.
Adding a DFA matcher should be an implementation detail and shouldn't require any public facing changes.
This is a pretty involved project. I hope to find time to do this some day, but if someone else wants to tackle it, I'd be happy to help mentor it. (Part of this will be figuring out how to handle the
regex!
macro. Do we replicate the DFA there too like we do the NFA?)The text was updated successfully, but these errors were encountered: