Analyze complexity before actual compile? #11

audunhalland · 2021-03-22T12:12:11Z

Hi, I'm wondering if there exists a theoretical way to cheaply (ish?) analyze the resulting automaton complexity ahead of time, without actually building it.

My use case is that I have an opportunity to choose a pattern implementation dynamically, between a plain Regex that is faster to compile but (in some cases) much slower to match, and an automaton, which seems to always be much faster to match against. It would be helpful to have such a measurement up front, in order to help pick the tradeoff. I know that the docs say no untrusted patterns, but the user does not write the pattern directly: The patterns are generated based on "wildcards" (*) in search words.

If this sounds impossible, then so be it :)

Otherwise, thanks for a great lib!

The text was updated successfully, but these errors were encountered:

BurntSushi · 2021-03-22T12:28:31Z

It would be helpful to have a real example. It could be the case that a plain Regex should just be made to be faster. :-)

At some point in the future, I plan to use regex-automata inside the regex crate. While the initial goal is to just get all the regex engines ported over to regex-automata, eventually the idea is that it will help fuel more optimizations. One such optimization is exactly the case you've described: in some limited cases, it might actually make sense to just build the whole DFA instead of using the lazy DFA. Doing this requires some kind of analysis that compiling the full DFA won't be too expensive. So I will have to solve this problem to some extent at one point or another. If I do, I think it's reasonable that this analysis could be expose in regex-automata (or maybe regex-syntax, it depends on its nature).

Interestingly, Becchi and Crowley published "A Hybrid Finite Automaton for Practical Deep Packet Inspection", which actually tries to answer precisely the question: can we tell if a given regex is going to cause exponential state explosion? It's not quite the same as, "will the DFA be to big to be practically useful," but maybe some insight can be learned. It has been years since I've read the paper, and sine I haven't tried to tackle this problem yet, I haven't re-read and can't remember much about it. When I do try to tackle this problem (I expect it will be a year before I do, or more), then I plan to absorb that paper and see what can be done with it.

Failing all of that, my plan is to just use heuristics. e.g., if the regex has no large bounded repetitions and has no large Unicode character classes and is otherwise under a certain size, then try to build the DFA for it. I suspect this would be coupled with an option in this crate (which I plan to add) that would allow the construction of the DFA to fail once it got beyond a certain size (configured by the caller). My guess is that such heuristics would probably work well and might even be necessary, especially since we care about non-exponentially large DFAs. For example, there is nothing exponential about \w{10}, but when Unicode is enabled (the default), compiling its full DFA is quite large!

Note that for the past several months, I've been working on a ground-up rewrite of this crate in the ag/work branch. (I may force-push to that branch, so beware.) It has largely reached feature parity with the current version of the crate, but also added lots of other stuff (like RegexSet functionality). The main engineering goal is to make it suitable for use in the regex crate, and so that is the primary forcing function here.

audunhalland · 2021-03-22T13:58:12Z

Thanks for the feedback, very interesting.

Some examples, just grabbed out of a test in my debug build:

Regex:
^(different)$|^(have)$|^(many)$|^(patterns)$|^(test)$|^(to)$|^wildca|^cl[\x{0000}-\x{024f}]*k!$
anchored DFA, equivalent to the above (just without groups):
different|have|many|patterns|test|to|((wildca)[\x{0000}-\x{024f}]*)|(cl[\x{0000}-\x{024f}]*k!)

(I've used a "latin range unicode trick", 0000-024f in order to reduce the DFA compile time. This is cheaper to use than .*)

Input string "clugulububiffjoglufagofibonggrik!". This is matched 400 times against each pattern impl:
Regex compile: 957.59µs
Regex match: 10.339507ms
DFA compile: 4.50037ms
DFA match: 1.570276ms

However, I also have a test for a much more complex pattern:
Regex:
^b[\x{0000}-\x{024f}]*r$|^int[\x{0000}-\x{024f}]*rnt$|^m[\x{0000}-\x{024f}]*n$|^mul[\x{0000}-\x{024f}]*gens$|^ne[\x{0000}-\x{024f}]*ne$|^og[\x{0000}-\x{024f}]*å$|^wil[\x{0000}-\x{024f}]*rd$|^br|^e|^go|^hel|^m|^mang|^me|^mønste|^no|^ogs|^tes|^v|^veldi|^wildc|eksten$|gså$|ldcards$
anchored DFA:
((br|e|go|hel|m|mang|me|mønste|no|ogs|tes|v|veldi|wildc)[\x{0000}-\x{024f}]*)|(b[\x{0000}-\x{024f}]*r)|(int[\x{0000}-\x{024f}]*rnt)|(m[\x{0000}-\x{024f}]*n)|(mul[\x{0000}-\x{024f}]*gens)|(ne[\x{0000}-\x{024f}]*ne)|(og[\x{0000}-\x{024f}]*å)|(wil[\x{0000}-\x{024f}]*rd)|([\x{0000}-\x{024f}]*(eksten|gså|ldcards))
(this is Norwegian, sorry about that 😬 )

Input string "muligens", matched 400 times:
Regex compile: 1.867883ms
Regex match: 1.863705ms
DFA compile: 24.825708ms
DFA match: 666.785µs

I found this a bit interesting because the 2nd pattern was an order of magnitude cheaper to match using Regex. It could be because of the length of the input string being shorter. But OTOH, the DFA compilation time has "exploded".

edit: I tested the first Regex pattern using a shorter input "clugrik!" (match for the ^cl[\x{0000}-\x{024f}]*k!$ "clause"), and this takes the Regex match performance from 10.339507ms for the long input to 4.44021ms for the short input. But that is still 4 times slower than the complex 2nd pattern 🤔

edit 2: I have verified that the input patterns to Regex and Automata are equivalent, I have two distinct functions generating them, so the "clauses" end up in different order. (Also I'll mention that the patterns for the Regex implmentation uses match groups for clauses that do not include wildcards, i.e. completely anchored from beginning to end)

BurntSushi · 2021-03-22T14:46:05Z

@audunhalland Thanks for the examples! Would it be possible to share your benchmark program in full? I'd love to look at precisely the code you're writing. If so, I can try to take a closer look at the performance profile here and either try to fix things or at least explain them. The performance profile of a regex::Regex is quite complex, since it has oodles of different kinds of optimizations. Many of them are heuristics, which means they can be slower than the "standard" approach in some cases, but the intent is that it's a lot faster in common cases. Alas, this is partially why I'm taking the approach I am with regex-automata: sometimes you want more predictable performance.

I am also quite interested in your precise program because, for example, your bigger regex takes a lot longer to compile in my branch than it does for you. And I'd like to make sure I am comparing apples-to-apples here, but I can't do that without seeing precisely how you've compiled your pattern.

In my branch, there is a regex-cli tool that will compile DFAs and what not and print various info about them:

$ regex-cli debug dfa dense -q '((br|e|go|hel|m|mang|me|mønste|no|ogs|tes|v|veldi|wildc)[\x{0000}-\x{024f}]*)|(b[\x{0000}-\x{024f}]*r)|(int[\x{0000}-\x{024f}]*rnt)|(m[\x{0000}-\x{024f}]*n)|(mul[\x{0000}-\x{024f}]*gens)|(ne[\x{0000}-\x{024f}]*ne)|(og[\x{0000}-\x{024f}]*å)|(wil[\x{0000}-\x{024f}]*rd)|([\x{0000}-\x{024f}]*(eksten|gså|ldcards))'
            parse time:  81.044µs
        translate time:  38.838µs
      compile nfa time:  211.982µs
compile dense dfa time:  285.643066ms
      dense dfa memory:  7573932
 dense alphabet length:  46
          dense stride:  64

So in this case, it took 285ms to compile the DFA. And it is 7.5MB in size! Yikes.

BurntSushi · 2021-03-22T14:49:22Z

Note that in my rewrite, the crate supports all the same regex features (except Unicode word boundaries). So you can write anchor assertions. When I do that, the DFA gets much smaller:

$ regex-cli debug dfa dense -q '^b[\x{0000}-\x{024f}]*r$|^int[\x{0000}-\x{024f}]*rnt$|^m[\x{0000}-\x{024f}]*n$|^mul[\x{0000}-\x{024f}]*gens$|^ne[\x{0000}-\x{024f}]*ne$|^og[\x{0000}-\x{024f}]*å$|^wil[\x{0000}-\x{024f}]*rd$|^br|^e|^go|^hel|^m|^mang|^me|^mønste|^no|^ogs|^tes|^v|^veldi|^wildc|eksten$|gså$|ldcards$'
            parse time:  47.86µs
        translate time:  23.17µs
      compile nfa time:  137.524µs
compile dense dfa time:  5.838039ms
      dense dfa memory:  106896
 dense alphabet length:  48
          dense stride:  64

And notice that its compile time is snappier. I made some improvements in that department in my rewrite as well. :-)

This made me realize that you're probably compiling your DFAs in anchored mode (and yeah... you said that, I missed it, sorry!). Once I do that with your original regex, compilation time goes way down:

$ regex-cli debug dfa dense -q -a '((br|e|go|hel|m|mang|me|mønste|no|ogs|tes|v|veldi|wildc)[\x{0000}-\x{024f}]*)|(b[\x{0000}-\x{024f}]*r)|(int[\x{0000}-\x{024f}]*rnt)|(m[\x{0000}-\x{024f}]*n)|(mul[\x{0000}-\x{024f}]*gens)|(ne[\x{0000}-\x{024f}]*ne)|(og[\x{0000}-\x{024f}]*å)|(wil[\x{0000}-\x{024f}]*rd)|([\x{0000}-\x{024f}]*(eksten|gså|ldcards))'
            parse time:  159.533µs
        translate time:  74.817µs
      compile nfa time:  431.497µs
compile dense dfa time:  6.098245ms
      dense dfa memory:  113980
 dense alphabet length:  46
          dense stride:  64

But at least with my branch, you don't need two separate regexes. You can use the same one for both.

Anyway, would still love to see your benchmark program if that's feasible to do, so I could explore your problems more precisely.

audunhalland · 2021-03-22T15:08:23Z

The code is part of a "Highlighter"-module in a search engine application using https://github.com/tantivy-search/tantivy. It's not open source and I'm developing for a client, but sure I can extract just this module to a separate repo. I can get to it tonight, probably.

BurntSushi · 2021-03-22T15:42:50Z

@audunhalland Yeah I mean the benchmarking aspect. e.g., You mention running something 400 times. I guess I assumed that was a small little benchmark program. But yeah, the important bit is making sure I know how you're compiling things and what not.

audunhalland · 2021-03-22T21:42:16Z

Here's the code:
https://github.com/audunhalland/regex-test/blob/main/src/token_matcher.rs#L172

main has been set up to run this test only. with cargo run --release it is of course much faster to compile the automata.

BurntSushi · 2021-03-24T15:09:08Z

@audunhalland Thank you! That's really useful data, thank you. I don't see anything to be done in the moment other than to perhaps invent you're own heuristics from patterns you're seeing. Once I get to the point where I need to do that, I'll be sure to include these regexes in the benchmark suite. :-)

cessen · 2021-08-01T16:31:21Z

I suspect this would be coupled with an option in this crate (which I plan to add) that would allow the construction of the DFA to fail once it got beyond a certain size (configured by the caller).

We were discussing this recently over at the Helix project, actually. We're planning to switch over to regex-automata for searching relatively soon, but ideally want to avoid exponential explosion when building the DFA.

A stop-gap solution we're considering is to abuse the StateID parameter of RegexBuilder::build_with_size() to limit the size of the DFA (by using u16 for example). Then we report an error to the user if their regex expression results in a too-large DFA. However, that's not the intended purpose of the StateID APIs, of course. And it also gives a very limited set of max sizes to choose from. So having an API that's actually meant for that would be great.

(Of course, ultimately we probably want to use something like a lazy DFA or an NFA. But in practice this would be a good solution for the time being, especially given the typical searches done in a text editor.)

BurntSushi · 2021-08-01T17:17:09Z

Yeah that is a creative use. regex-automata 0.2 will have an API that gives a more explicit way to error if the DFA gets too large.

And as of about a week ago, I do have the lazy DFA implemented. That will probably give you what you want, although there are some restrictions on how the state IDs are used. Namely, I believe the ultimate restriction will be, "the only state ID you can feed to the transition function is the state ID previously returned by the transition function." (Those can be given up in exchange for allowing unbounded memory growth.)

BurntSushi · 2023-07-07T17:14:34Z

I think this is perhaps an interesting future direction, but I don't think there is anything actionable at present here.

BurntSushi closed this as completed Jul 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analyze complexity before actual compile? #11

Analyze complexity before actual compile? #11

audunhalland commented Mar 22, 2021

BurntSushi commented Mar 22, 2021

audunhalland commented Mar 22, 2021 •

edited

Loading

BurntSushi commented Mar 22, 2021

BurntSushi commented Mar 22, 2021

audunhalland commented Mar 22, 2021

BurntSushi commented Mar 22, 2021

audunhalland commented Mar 22, 2021

BurntSushi commented Mar 24, 2021

cessen commented Aug 1, 2021

BurntSushi commented Aug 1, 2021

BurntSushi commented Jul 7, 2023

Analyze complexity before actual compile? #11

Analyze complexity before actual compile? #11

Comments

audunhalland commented Mar 22, 2021

BurntSushi commented Mar 22, 2021

audunhalland commented Mar 22, 2021 • edited Loading

BurntSushi commented Mar 22, 2021

BurntSushi commented Mar 22, 2021

audunhalland commented Mar 22, 2021

BurntSushi commented Mar 22, 2021

audunhalland commented Mar 22, 2021

BurntSushi commented Mar 24, 2021

cessen commented Aug 1, 2021

BurntSushi commented Aug 1, 2021

BurntSushi commented Jul 7, 2023

audunhalland commented Mar 22, 2021 •

edited

Loading