-
Notifications
You must be signed in to change notification settings - Fork 26
Analyze complexity before actual compile? #11
Comments
It would be helpful to have a real example. It could be the case that a plain At some point in the future, I plan to use Interestingly, Becchi and Crowley published "A Hybrid Finite Automaton for Practical Deep Packet Inspection", which actually tries to answer precisely the question: can we tell if a given regex is going to cause exponential state explosion? It's not quite the same as, "will the DFA be to big to be practically useful," but maybe some insight can be learned. It has been years since I've read the paper, and sine I haven't tried to tackle this problem yet, I haven't re-read and can't remember much about it. When I do try to tackle this problem (I expect it will be a year before I do, or more), then I plan to absorb that paper and see what can be done with it. Failing all of that, my plan is to just use heuristics. e.g., if the regex has no large bounded repetitions and has no large Unicode character classes and is otherwise under a certain size, then try to build the DFA for it. I suspect this would be coupled with an option in this crate (which I plan to add) that would allow the construction of the DFA to fail once it got beyond a certain size (configured by the caller). My guess is that such heuristics would probably work well and might even be necessary, especially since we care about non-exponentially large DFAs. For example, there is nothing exponential about Note that for the past several months, I've been working on a ground-up rewrite of this crate in the |
Thanks for the feedback, very interesting. Some examples, just grabbed out of a test in my debug build: Regex: (I've used a "latin range unicode trick", Input string However, I also have a test for a much more complex pattern: Input string I found this a bit interesting because the 2nd pattern was an order of magnitude cheaper to match using edit: I tested the first edit 2: I have verified that the input patterns to Regex and Automata are equivalent, I have two distinct functions generating them, so the "clauses" end up in different order. (Also I'll mention that the patterns for the |
@audunhalland Thanks for the examples! Would it be possible to share your benchmark program in full? I'd love to look at precisely the code you're writing. If so, I can try to take a closer look at the performance profile here and either try to fix things or at least explain them. The performance profile of a I am also quite interested in your precise program because, for example, your bigger regex takes a lot longer to compile in my branch than it does for you. And I'd like to make sure I am comparing apples-to-apples here, but I can't do that without seeing precisely how you've compiled your pattern. In my branch, there is a
So in this case, it took 285ms to compile the DFA. And it is 7.5MB in size! Yikes. |
Note that in my rewrite, the crate supports all the same regex features (except Unicode word boundaries). So you can write anchor assertions. When I do that, the DFA gets much smaller:
And notice that its compile time is snappier. I made some improvements in that department in my rewrite as well. :-) This made me realize that you're probably compiling your DFAs in anchored mode (and yeah... you said that, I missed it, sorry!). Once I do that with your original regex, compilation time goes way down:
But at least with my branch, you don't need two separate regexes. You can use the same one for both. Anyway, would still love to see your benchmark program if that's feasible to do, so I could explore your problems more precisely. |
The code is part of a "Highlighter"-module in a search engine application using https://github.com/tantivy-search/tantivy. It's not open source and I'm developing for a client, but sure I can extract just this module to a separate repo. I can get to it tonight, probably. |
@audunhalland Yeah I mean the benchmarking aspect. e.g., You mention running something 400 times. I guess I assumed that was a small little benchmark program. But yeah, the important bit is making sure I know how you're compiling things and what not. |
Here's the code:
|
@audunhalland Thank you! That's really useful data, thank you. I don't see anything to be done in the moment other than to perhaps invent you're own heuristics from patterns you're seeing. Once I get to the point where I need to do that, I'll be sure to include these regexes in the benchmark suite. :-) |
We were discussing this recently over at the Helix project, actually. We're planning to switch over to A stop-gap solution we're considering is to abuse the (Of course, ultimately we probably want to use something like a lazy DFA or an NFA. But in practice this would be a good solution for the time being, especially given the typical searches done in a text editor.) |
Yeah that is a creative use. regex-automata 0.2 will have an API that gives a more explicit way to error if the DFA gets too large. And as of about a week ago, I do have the lazy DFA implemented. That will probably give you what you want, although there are some restrictions on how the state IDs are used. Namely, I believe the ultimate restriction will be, "the only state ID you can feed to the transition function is the state ID previously returned by the transition function." (Those can be given up in exchange for allowing unbounded memory growth.) |
I think this is perhaps an interesting future direction, but I don't think there is anything actionable at present here. |
Hi, I'm wondering if there exists a theoretical way to cheaply (ish?) analyze the resulting automaton complexity ahead of time, without actually building it.
My use case is that I have an opportunity to choose a pattern implementation dynamically, between a plain
Regex
that is faster to compile but (in some cases) much slower to match, and an automaton, which seems to always be much faster to match against. It would be helpful to have such a measurement up front, in order to help pick the tradeoff. I know that the docs say no untrusted patterns, but the user does not write the pattern directly: The patterns are generated based on "wildcards" (*
) in search words.If this sounds impossible, then so be it :)
Otherwise, thanks for a great lib!
The text was updated successfully, but these errors were encountered: