-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rollup: bump MSRV to Rust 1.65, bug fixes, memory usage reductions, API improvements, more word boundary assertions and more #1098
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The name was quite vague, so add a little specificity.
Breaking lines in the middle of backticks appears to be bad juju for some Markdown renderers.
This fixes a bug that can occur when: 1. The regex has a Unicode word boundary. 2. The haystack contains some non-ASCII Unicode scalar value. 3. An inner or suffix literal optimization is in play. Specifically, this provokes a case where a match is detected in one of the meta engine's ad hoc DFA search routines, but before the match reaches its correct endpoint, a quit state is entered. (Because DFAs can't deal with Unicode word boundaries on non-ASCII haystacks.) The correct thing to do is to return a quit error and let the higher level logic divert to a different engine, but it was returning the match that it had found up until that point instead. The match returned is not technically incorrect in the sense that a match does indeed exist, but the offsets it reports may be shorter than what the true match actually is. So... if a quit state is entered, return an error regardless of whether a match has been found. Fixes #1046
This was previously using the raw representation of a `LookSet`, which is fine, but would have errantly overwritten bits unrelated to look-around assertions if they were set in a `LookSet`. This can't happen today because we don't have more than 10 assertions. And the one-pass DFA constructor specifically errors if more assertions exist and are in the pattern. But still, it seems like good form to mask out only the bits we care about.
This puts every Ast value behind a box to conserve space. It makes things like Vec<Ast> quite a bit smaller than what they would be otherwise, which is especially beneficial for the representation of concatenations and alternations. This doesn't quite solve the memory usage problems though, since an AstKind is still quite big (over 200 bytes). The next step will be boxing each of the variants of an AstKind which should hopefully resolve the issue. Ref #1090
This does reduce memory, but not as much as it is reduced if we don't box the Ast.
BurntSushi
force-pushed
the
ag/work
branch
4 times, most recently
from
October 8, 2023 14:22
87aa51c
to
eab70f5
Compare
BurntSushi
force-pushed
the
ag/work
branch
2 times, most recently
from
October 8, 2023 19:26
296b1b6
to
52140fa
Compare
BurntSushi
changed the title
rollup: bug fixes, memory usage reductions, API improvements, more word boundary assertions and more
rollup: bump MSRV to Rust 1.65, bug fixes, memory usage reductions, API improvements, more word boundary assertions and more
Oct 8, 2023
The AstKind experiment proved unfruitful. I think the issue here is that the savings on Vec<Ast> didn't prove to be enough to offset the extra heap allocation that resulted from the indirection. This seems to be a sweet spot. It would be nice to get Ast down below 16 bytes, but it's not clear how to do that (without much larger changes that I don't feel inclined to pursue). Fixes #1090
Basically, we never should have guaranteed that a particular HIR would (or wouldn't) be used if the 'u' flag was present (or absent). Such a guarantee generally results in too little flexibility, particularly when it comes to HIR's smart constructors. We could probably uphold that guarantee, but it's somewhat gnarly to do and would require rejiggering some of the HIR types. For example, we would probably need a literal that is an enum of `&str` or `&[u8]` that correctly preserves the Unicode flag. This in turn comes with a bigger complexity cost in various rewriting rules. In general, it's much simpler to require the caller to be prepared for any kind of HIR regardless of what the flags are. I feel somewhat justified in this position due to the fact that part of the point of the HIR is to erase all of the regex flags so that callers no longer need to worry about them. That is, the erasure is the point that provides a simplification for everyone downstream. Closes #1088
It turns out that requiring callers to provide an `Input` (and thus a `&[u8]` haystack) is a bit onerous for all cases. Namely, part of the point of `regex-automata` was to expose enough guts to make it tractable to write a streaming regex engine. A streaming regex engine, especially one that does a byte-at-a-time loop, is somewhat antithetical to having a haystack in a single `&[u8]` slice. This made computing start states possible but very awkward and quite unclear in terms of what the implementation would actually do with the haystack. This commit fixes that by exposing a lower level `start_state` method on both of the DFAs that can be called without materializing an `Input`. Instead, callers must create a new `start::Config` value which provides all of the information necessary for the DFA to compute the correct start state. This in turn also exposes the `crate::util::start` module. This is ultimately a breaking change because it adds a new required method to the `Automaton` trait. It also makes `start_state_forward` and `start_state_reverse` optional. It isn't really expected for callers to implement the `Automaton` trait themselves (and perhaps I will seal it so we can do such changes in the future without it being breaking), but still, this is technically breaking. Callers using `start_state_forward` or `start_state_reverse` with either DFA remain unchanged and unaffected. Closes #1031
That should cover all of them. Closes #1053
This reduces or eliminates allocation when combining Unicode classes and should make some things faster. It's unlikely for these optimizations to matter much in practice, but they are likely to help in niche or pathological cases where there are a lot of ops in a class. Closes #1051
This is in preparation for adding 8 new word boundary look-around assertions: \b{start}, \b{end}, \b{start-half} and \b{end-half}, along with Unicode and ASCII-only variants of each. Ref #469
This adds AST support for the following new assertions: \b{start}, \b{end}, \b{start-half}, \b{end-half}, \< and \>. The last two, \< and \>, are aliases for \b{start} and \b{end}. The parsing for this is a little suspect since there's a little ambiguity between, e.g., \b{5} and \b{start}, but we handle it by allowing the parser to look for one of the new special assertions, and then back-up if it fails to find one so that it can try to parse a counted repetition. Ref #469
This builds on the previous commit to bring word boundary support to the HIR, and updates AST->HIR translation to produce them from the corresponding AST elements. Ref #469
In this commit, all of the regex engines now support the new special word boundary assertions: \b{start}, \b{end}, \b{start-half} and \b{end-half}. Of course, when they are Unicode-aware, the DFAs will quit upon seeing a non-ASCII character, just like for the \b and \B assertions. For now, we don't add support to the one-pass DFA, since it would either make it use more memory or reduce the number of capture groups it supports. I think these assertions will be rare enough that it isn't worth adding support yet. This is a breaking change because it adds new variants to the `Look` enum.
This was substantially easier. Coupling, private abstractions and slow code are so much easier to deal with. Ref #469
It is almost completely wrong now. Instead of rewriting it---which would be a huge endeavor---we just point folks toward my blog on regex internals. Closes #1058
Some doc tests make 64-bit assumptions and fail on 32-bit. I'd be open to perhaps refactoring the tests somehow to make them work on both, but I literally have no easy way to run doc tests in a 32-bit environment. Without being able to actually run them myself, I don't feel comfortable doing anything other than squashing the tests in that case. Closes #1041
These panics I do not believe can occur from an actual pattern, since the parser will either never produce such things or will return an error. But still, the Ast->Hir translator shouldn't panic in such cases. Actually, the non-sensical Ast values are actually somewhat sensible, and they don't map to invalid regexes. These panics were likely the result of the regex crate not supporting empty patterns or "fail" patterns particularly well in the fast. But now that we do, we can just let the Asts through and generate the Hir you'd expect. Fixes #1047
This commit fixes a subtle *performance* bug in the start state computation. The issue here is rather tricky, but it boils down to the fact that the way the look-behind assertions are computed in the start state is not quite precisely equivalent to how they're computed during normal state generation. Namely, in normal state generation, we only compute look-behind assertions if the NFA actually has one (or one similar to it) in its graph somewhere. If it doesn't, then there's no point in saving whether the assertion is satisfied or not. Logically speaking, this doesn't matter too much, because if the look-around assertions don't match up with how they're computed in the start state, a new state will simply be created. Not a huge deal, but wasteful. The real problem is that the new state will no longer be considered a start state. It will just be like any other normal state. We rely on being able to detect start states at search time to know when to trigger the prefilter. So if we re-generate start states as non-start states, then we may end up not triggering the prefilter. That's bad. rebar actually caught this bug via the `imported/sherlock/line-boundary-sherlock-holmes` benchmark, which recorded a 20x slowdown due to the prefilter not running. Owch! This specifically was caused by the start states unconditionally attaching half-starting word boundary assertions whenever they were satisfied, where as normal state generation only does this when there is actually a half-starting word boundary assertion in the NFA. So this led to re-generating start states needlessly. Interestingly, the start state computation was unconditionally attaching all different types of look-behind assertions, and thus in theory, this problem already existed under different circumstances. My hypothesis is that it wasn't "as bad" because it was mostly limited to line terminators. But the half-starting word boundary assertion is much more broadly applicable. We remedy this not only for the half-starting word boundary assertion, but for all others as well. I also did manual mutation testing in this start state computation and found a few branches not covered by tests. We add those tests here. Thanks rebar!
This MSRV bump is mostly motivated by "good sense," and in particular, Rust 1.65 means we can use 'let ... else'. We don't actually start peppering the code with 'let ... else' just yet, but we fix a few outstanding small issues and update our Rust version everywhere. Also, Rust 1.65 is about a year old at time of writing. Let's keep the trains moving.
It's not feasible for us to check such things when deserializing a DFA, so we just have no real choice but to remove the assert and let the search proceed with incorrect results. I had previously wrote these as real asserts and then swapped them to debug_asserts, but of course, the fuzzer still trips over them. Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=60652
It's possible for DFA deserialization to result in an otherwise valid DFA, but one that records accelerated DFA states without any actual accelerator. We remedy that by checking for it at deserialization time. Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=60739 Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=61255 fixup
Otherwise we risk timeouts. Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=61484
This rejiggers some code so that we can more reliably check whether start state IDs are valid or not. Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=62726
I couldn't get this to reproduce. Maybe some of my recent changes to regex-syntax fixed this? Not sure. I'm not a huge fan of this fuzzer in general because it isn't really testing a rock solid guarantee that we provide. And the positions are tough to deal with. Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=62382
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.