-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missed optimization for alterated word-lists — patterns like \b(foo|bar|baz|quux|...)\b
#787
Comments
The main problem here is, I think, that the assumption that finite automata means this problem doesn't exist. That only happens in theory, where you can play in pure-DFA land. And indeed, if you use ASCII word boundaries instead of Unicode word boundaries, the perf difference between your two benchmarks not only disappears, but becomes a lot faster:
The latter was taken after simply adding The issue here is the presence of the Unicode word boundary. It causes the Pike VM to be used because the lazy DFA can't handle it. (It is the only syntactic construct that the lazy DFA can't handle.) The Pike VM spends a lot of its time here shuffling the bytecode split instructions to handle the huge alternation. By applying your optimization, it spends less time doing that because more of its time will be spent inside a single (or fewer) alternations because you factored out common prefixes. Another approach here is to remove the word boundaries altogether and the parens. This triggers a naive detection algorithm that diverts the regex to Aho-Corasick:
Regretably, this makes things slower, because Aho-Corasick says, "oh, there aren't that many patterns, we can use Teddy on this instead of a DFA!" And for whatever reason, Teddy winds up being slower here than a good ol' DFA. (It usually isn't. This is one of the first cases I've seen where the difference is so stark.) Arguably, the regex crate should probably factor out common prefixes inside of alternations like you've done here, at the very least, to make the Pike VM and bounded backtracker run more quickly. |
Thanks for the quick and thorough response!
Ah, wonderful. Those numbers are much better. This is essentially what I was hoping for from the beginning. Admittedly, losing Unicode there is at least a tiny bit unfortunate — The ASCII boundary isn't quite right: it will consider something like That said, it's hard for me to think getting edge cases like (↑ admittedly dubious rationalization)
If nothing else, I'm glad I was able to shed light on a usage pattern which might be useful for benchmarking in the future. |
Yes, the situation with Unicode word boundaries is regrettable. I wish I had a good fix for it. The only trick that's employed is that the lazy DFA will attempt to treat a Unicode word boundary as if it were an ASCII boundary, and if it ever sees a non-ASCII byte, the lazy DFA quits and falls over to the Pike VM. The input you gave to the benchmark is definitely not all ASCII, so this trick didn't apply. But yes, I think your justification for using ASCII word boundaries is reasonably solid IMO.
Indeed. Next time I'm in that code, I'll check this test case out in more detail. It takes a lot of time to page Teddy into context. :-/ I suspect the problem is some combination of 1) high false positive rate in the prefilter, 2) validation being slow and 3) frequency of matches. |
Mostly just leaving myself a note here, but the regex optimization code you linked is not quite correct for leftmost-first or "preference order" matching. (Which is what this crate uses and also what all backtracking regex engines use.) For example,
You'll get the same result with the So either pygments is using a regex engine where this optimization is correct, or none of their use cases trigger this bug, or they specifically work around it somehow or pygments has a bug somewhere. I write this because I was thinking about how they did this without fouling things up, and admiring the simplicity of the I think this optimization is still doable somehow, but it is trickier than it looks. Alternatively, we only apply it if the result wouldn't alter match semantics. I think the rule for that would be: for any two strings I'm not 100% certain that rule is correct, but it's what popped out of my head. |
See also: #891 |
OK, so I think I'll be able to call this fixed once #656 lands. In particular, I've come up with a way to optimize alternations of literals that effectively moves it closer to a DFA as opposed to a naive NFA via removing most epsilon transitions. The fundamental problem with large alternations is that they get compiled into an NFA where a single state has an epsilon transition to each of the branches in the alternation. So every time you hit that NFA state, you wind up having to follow the epsilon transitions and "try" every branch. (A lot like a backtracker...) A DFA effectively removes this by getting rid of all epsilon transitions and is thus much faster. If your alternation is just a bunch of literals, one might imagine just shoving them into a trie and then compiling that down to an NFA. That works nicely because a trie is a DFA, and so something like The problem is that this crate (and all backtrackers) use leftmost-first or "preference order" for matching. As I described above, for example, Instead, what I came up with was to treat empty strings in the alternation as "barriers" that split up the alternation into chunks of branches. Branches within each chunk can be re-ordered freely, but you can't move a branch from one chunk to another chunk. It's a little hard to explain, but here's a before-and-after of what an NFA looks like for the regex
And now the after:
You can see that in the former case, there is a Once the clog is removed, this leads to substantial gains:
Note the The
So in this case, the rewritten "optimized" variant does worse than the "unoptimized" variant. The trie optimization doesn't apply to the "optimized" variant because the optimized variant is not a simple alternation of literals. There is nesting happening that inhibits the somewhat simplified analysis I added. So I guess it turns out that the automatic trie optimization ends up doing better here. Indeed, looking at the compiled NFA, the "optimized" variant has several largeish The main downside of this optimization is that it's easily broken. Even something like The "next level" version of this optimization is basically going to be a matter of selectively/smartly converting parts of an NFA into a DFA via epsilon removal. But I think we need a richer NFA data structure (like a full graph) to enable those. |
I usually close tickets on a commit-by-commit basis, but this refactor was so big that it wasn't feasible to do that. So ticket closures are marked here. Closes #244 Closes #259 Closes #476 Closes #644 Closes #675 Closes #824 Closes #961 Closes #68 Closes #510 Closes #787 Closes #891 Closes #429 Closes #517 Closes #579 Closes #779 Closes #850 Closes #921 Closes #976 Closes #1002 Closes #656
This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [regex](https://github.com/rust-lang/regex) | dependencies | minor | `1.8.4` -> `1.9.1` | --- ### Release Notes <details> <summary>rust-lang/regex (regex)</summary> ### [`v1.9.1`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#191-2023-07-07) [Compare Source](rust-lang/regex@1.9.0...1.9.1) \================== This is a patch release which fixes a memory usage regression. In the regex 1.9 release, one of the internal engines used a more aggressive allocation strategy than what was done previously. This patch release reverts to the prior on-demand strategy. Bug fixes: - [BUG #​1027](rust-lang/regex#1027): Change the allocation strategy for the backtracker to be less aggressive. ### [`v1.9.0`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#190-2023-07-05) [Compare Source](rust-lang/regex@1.8.4...1.9.0) \================== This release marks the end of a [years long rewrite of the regex crate internals](rust-lang/regex#656). Since this is such a big release, please report any issues or regressions you find. We would also love to hear about improvements as well. In addition to many internal improvements that should hopefully result in "my regex searches are faster," there have also been a few API additions: - A new `Captures::extract` method for quickly accessing the substrings that match each capture group in a regex. - A new inline flag, `R`, which enables CRLF mode. This makes `.` match any Unicode scalar value except for `\r` and `\n`, and also makes `(?m:^)` and `(?m:$)` match after and before both `\r` and `\n`, respectively, but never between a `\r` and `\n`. - `RegexBuilder::line_terminator` was added to further customize the line terminator used by `(?m:^)` and `(?m:$)` to be any arbitrary byte. - The `std` Cargo feature is now actually optional. That is, the `regex` crate can be used without the standard library. - Because `regex 1.9` may make binary size and compile times even worse, a new experimental crate called `regex-lite` has been published. It prioritizes binary size and compile times over functionality (like Unicode) and performance. It shares no code with the `regex` crate. New features: - [FEATURE #​244](rust-lang/regex#244): One can opt into CRLF mode via the `R` flag. e.g., `(?mR:$)` matches just before `\r\n`. - [FEATURE #​259](rust-lang/regex#259): Multi-pattern searches with offsets can be done with `regex-automata 0.3`. - [FEATURE #​476](rust-lang/regex#476): `std` is now an optional feature. `regex` may be used with only `alloc`. - [FEATURE #​644](rust-lang/regex#644): `RegexBuilder::line_terminator` configures how `(?m:^)` and `(?m:$)` behave. - [FEATURE #​675](rust-lang/regex#675): Anchored search APIs are now available in `regex-automata 0.3`. - [FEATURE #​824](rust-lang/regex#824): Add new `Captures::extract` method for easier capture group access. - [FEATURE #​961](rust-lang/regex#961): Add `regex-lite` crate with smaller binary sizes and faster compile times. - [FEATURE #​1022](rust-lang/regex#1022): Add `TryFrom` implementations for the `Regex` type. Performance improvements: - [PERF #​68](rust-lang/regex#68): Added a one-pass DFA engine for faster capture group matching. - [PERF #​510](rust-lang/regex#510): Inner literals are now used to accelerate searches, e.g., `\w+@​\w+` will scan for `@`. - [PERF #​787](rust-lang/regex#787), [PERF #​891](rust-lang/regex#891): Makes literal optimizations apply to regexes of the form `\b(foo|bar|quux)\b`. (There are many more performance improvements as well, but not all of them have specific issues devoted to them.) Bug fixes: - [BUG #​429](rust-lang/regex#429): Fix matching bugs related to `\B` and inconsistencies across internal engines. - [BUG #​517](rust-lang/regex#517): Fix matching bug with capture groups. - [BUG #​579](rust-lang/regex#579): Fix matching bug with word boundaries. - [BUG #​779](rust-lang/regex#779): Fix bug where some regexes like `(re)+` were not equivalent to `(re)(re)*`. - [BUG #​850](rust-lang/regex#850): Fix matching bug inconsistency between NFA and DFA engines. - [BUG #​921](rust-lang/regex#921): Fix matching bug where literal extraction got confused by `$`. - [BUG #​976](rust-lang/regex#976): Add documentation to replacement routines about dealing with fallibility. - [BUG #​1002](rust-lang/regex#1002): Use corpus rejection in fuzz testing. </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNi4wLjAiLCJ1cGRhdGVkSW5WZXIiOiIzNi44LjExIiwidGFyZ2V0QnJhbmNoIjoiZGV2ZWxvcCJ9--> Co-authored-by: cabr2-bot <[email protected]> Co-authored-by: crapStone <[email protected]> Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1957 Reviewed-by: crapStone <[email protected]> Co-authored-by: Calciumdibromid Bot <[email protected]> Co-committed-by: Calciumdibromid Bot <[email protected]>
Describe your feature request
I'm unsure if I should report this as a feature request for a currently-missing "optimization", or a performance bug due to a missed optimization. The feature request template is easier to fill out, so... Here we are.
TLDR: Matching against a long list of strings is slower than it should be. Manually (or automatically) optimizing it in ways that a computer should be able to easily perform (see the giant gross regexes in code blocks) seems to have up to a ~2x performance benefit in some cases, such as the provided benchmark.
I'm working on a crate that performs syntax highlighting of source code (something like a rust equivalent to pygments).
As you may or may not imagine, it performs a lot of regex matching. One of the most situations in this kind of thing is "match any of the words in this medium-sized list" (usually on the order of 10s of items — probably very rarely more than 100, if ever).
That is, this is trying to match a regex which conceptually looks something like
\b(foo|bar|baz|quux|...)\b
, and it's typically used for all kinds of things: keywords, builtin functions/types, well-known stuff from a language's stdlib, special constants, ...Apparently from what I gather, historically this sort of regex has been a big pain point in terms of performance for other regex-based highlighting engines. Most of them seem to have some comments or guidance around it (often "avoid these"). Interestingly, pygments has a function that compiles an optimized regex from a list of matching words. The code for this is here: https://github.com/pygments/pygments/blob/faf69c028f0efa7798a938163e8432d838acfda2/pygments/regexopt.py
I had hoped that this would be unnecessary for the
regex
crate. Even saying something in a fit of hubris like "It'll be fine..." and reassuring myself that "some sort of aho-corasick-ey thing" will save the day, and that "the python code seems like it's an ad-hoc finite automata simplification", which I don't have to worry about because the regex crate is so good at that stuff. (A combination of: making things up to sound smart and stating things as facts because I hope that they are true).Anyway, I wasn't going to completely handwave it, so I did a smoke test with a benchmark, and (you can probably guess the results already, since I'm here and I wrote them in the TLDR above)... It seems like it's in fact not entirely fine.
The concrete numbers on my machine were (... you can tell from the benchmark names that this didn't quite turn out how I had expected):
Specifically, I get a 2x speedup from using this (generated using the python
pygments.regex_opt
algorithm linked above):Compared to this (generated by putting some clothes around the result of
words.join('|')
:(These match something vaguely shaped like the list of Rust keywords and builtin types)
The runnable (after a small amount of legwork) benchmark is available here: https://gist.github.com/thomcc/eeb77d34d8924444b5778d2e451caac6
(Note that the actual operation it's benching isn't really what the highlighting loop looks like, although I guess it's not entirely different either)
Anyway, now that all that rambling and background si out of the way, you can probably guess what I'm asking for, but just to be concrete:
I would for to not need to perform this kind of optimization on the input. It seems like perhaps the regex crate could perform something like it itself, or maybe something else could be done.
P.S. Psst, I uh, also wouldn't mind some pointers for a workaround, if you've got any (unless it's likely to be an easy fix, anyway). Or more generally, advice for how to match against a long list of words efficiently:
\b
with some effort, for example).regex
?fst
) crate directly, and then figure out word boundaries after-the-fact usingcryingguessingmy wit and ingenuity?The text was updated successfully, but these errors were encountered: