Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rollup: bump MSRV to Rust 1.65, bug fixes, memory usage reductions, API improvements, more word boundary assertions and more #1098

Merged
merged 33 commits into from
Oct 9, 2023

Conversation

BurntSushi
Copy link
Member

No description provided.

The name was quite vague, so add a little specificity.
Breaking lines in the middle of backticks appears to be bad juju for
some Markdown renderers.
This fixes a bug that can occur when:

1. The regex has a Unicode word boundary.
2. The haystack contains some non-ASCII Unicode scalar value.
3. An inner or suffix literal optimization is in play.

Specifically, this provokes a case where a match is detected in one of
the meta engine's ad hoc DFA search routines, but before the match
reaches its correct endpoint, a quit state is entered. (Because DFAs
can't deal with Unicode word boundaries on non-ASCII haystacks.) The
correct thing to do is to return a quit error and let the higher level
logic divert to a different engine, but it was returning the match that
it had found up until that point instead. The match returned is not
technically incorrect in the sense that a match does indeed exist, but
the offsets it reports may be shorter than what the true match actually
is.

So... if a quit state is entered, return an error regardless of whether
a match has been found.

Fixes #1046
This was previously using the raw representation of a `LookSet`, which
is fine, but would have errantly overwritten bits unrelated to
look-around assertions if they were set in a `LookSet`.

This can't happen today because we don't have more than 10 assertions.
And the one-pass DFA constructor specifically errors if more assertions
exist and are in the pattern. But still, it seems like good form to mask
out only the bits we care about.
This puts every Ast value behind a box to conserve space. It makes
things like Vec<Ast> quite a bit smaller than what they would be
otherwise, which is especially beneficial for the representation of
concatenations and alternations.

This doesn't quite solve the memory usage problems though, since an
AstKind is still quite big (over 200 bytes). The next step will be
boxing each of the variants of an AstKind which should hopefully resolve
the issue.

Ref #1090
This does reduce memory, but not as much as it is reduced if we don't
box the Ast.
@BurntSushi BurntSushi force-pushed the ag/work branch 4 times, most recently from 87aa51c to eab70f5 Compare October 8, 2023 14:22
@BurntSushi BurntSushi force-pushed the ag/work branch 2 times, most recently from 296b1b6 to 52140fa Compare October 8, 2023 19:26
@BurntSushi BurntSushi changed the title rollup: bug fixes, memory usage reductions, API improvements, more word boundary assertions and more rollup: bump MSRV to Rust 1.65, bug fixes, memory usage reductions, API improvements, more word boundary assertions and more Oct 8, 2023
BurntSushi and others added 14 commits October 9, 2023 15:10
The AstKind experiment proved unfruitful. I think the issue here is that
the savings on Vec<Ast> didn't prove to be enough to offset the extra
heap allocation that resulted from the indirection.

This seems to be a sweet spot. It would be nice to get Ast down below 16
bytes, but it's not clear how to do that (without much larger changes
that I don't feel inclined to pursue).

Fixes #1090
Basically, we never should have guaranteed that a particular HIR would
(or wouldn't) be used if the 'u' flag was present (or absent). Such a
guarantee generally results in too little flexibility, particularly when
it comes to HIR's smart constructors.

We could probably uphold that guarantee, but it's somewhat gnarly to do
and would require rejiggering some of the HIR types. For example, we
would probably need a literal that is an enum of `&str` or `&[u8]` that
correctly preserves the Unicode flag. This in turn comes with a bigger
complexity cost in various rewriting rules.

In general, it's much simpler to require the caller to be prepared for
any kind of HIR regardless of what the flags are. I feel somewhat
justified in this position due to the fact that part of the point of the
HIR is to erase all of the regex flags so that callers no longer need to
worry about them. That is, the erasure is the point that provides a
simplification for everyone downstream.

Closes #1088
It turns out that requiring callers to provide an `Input` (and thus a
`&[u8]` haystack) is a bit onerous for all cases. Namely, part of the
point of `regex-automata` was to expose enough guts to make it tractable
to write a streaming regex engine. A streaming regex engine, especially
one that does a byte-at-a-time loop, is somewhat antithetical to having
a haystack in a single `&[u8]` slice. This made computing start states
possible but very awkward and quite unclear in terms of what the
implementation would actually do with the haystack.

This commit fixes that by exposing a lower level `start_state` method on
both of the DFAs that can be called without materializing an `Input`.
Instead, callers must create a new `start::Config` value which provides
all of the information necessary for the DFA to compute the correct
start state. This in turn also exposes the `crate::util::start` module.

This is ultimately a breaking change because it adds a new required
method to the `Automaton` trait. It also makes `start_state_forward` and
`start_state_reverse` optional. It isn't really expected for callers to
implement the `Automaton` trait themselves (and perhaps I will seal it
so we can do such changes in the future without it being breaking), but
still, this is technically breaking.

Callers using `start_state_forward` or `start_state_reverse` with either
DFA remain unchanged and unaffected.

Closes #1031
That should cover all of them.

Closes #1053
This reduces or eliminates allocation when combining Unicode classes and
should make some things faster. It's unlikely for these optimizations to
matter much in practice, but they are likely to help in niche or
pathological cases where there are a lot of ops in a class.

Closes #1051
This is in preparation for adding 8 new word boundary look-around
assertions: \b{start}, \b{end}, \b{start-half} and \b{end-half}, along
with Unicode and ASCII-only variants of each.

Ref #469
This adds AST support for the following new assertions:
\b{start}, \b{end}, \b{start-half}, \b{end-half}, \< and \>. The last
two, \< and \>, are aliases for \b{start} and \b{end}.

The parsing for this is a little suspect since there's a little
ambiguity between, e.g., \b{5} and \b{start}, but we handle it by
allowing the parser to look for one of the new special assertions, and
then back-up if it fails to find one so that it can try to parse a
counted repetition.

Ref #469
This builds on the previous commit to bring word boundary support to the
HIR, and updates AST->HIR translation to produce them from the
corresponding AST elements.

Ref #469
In this commit, all of the regex engines now support the new special
word boundary assertions: \b{start}, \b{end}, \b{start-half} and
\b{end-half}. Of course, when they are Unicode-aware, the DFAs will quit
upon seeing a non-ASCII character, just like for the \b and \B
assertions.

For now, we don't add support to the one-pass DFA, since it would either
make it use more memory or reduce the number of capture groups it
supports. I think these assertions will be rare enough that it isn't
worth adding support yet.

This is a breaking change because it adds new variants to the `Look`
enum.
This was substantially easier. Coupling, private abstractions and slow
code are so much easier to deal with.

Ref #469
It is almost completely wrong now. Instead of rewriting it---which would
be a huge endeavor---we just point folks toward my blog on regex
internals.

Closes #1058
BurntSushi and others added 12 commits October 9, 2023 15:48
Some doc tests make 64-bit assumptions and fail on 32-bit. I'd be open
to perhaps refactoring the tests somehow to make them work on both, but
I literally have no easy way to run doc tests in a 32-bit environment.
Without being able to actually run them myself, I don't feel comfortable
doing anything other than squashing the tests in that case.

Closes #1041
These panics I do not believe can occur from an actual pattern, since
the parser will either never produce such things or will return an
error. But still, the Ast->Hir translator shouldn't panic in such cases.

Actually, the non-sensical Ast values are actually somewhat sensible,
and they don't map to invalid regexes. These panics were likely the
result of the regex crate not supporting empty patterns or "fail"
patterns particularly well in the fast. But now that we do, we can just
let the Asts through and generate the Hir you'd expect.

Fixes #1047
This commit fixes a subtle *performance* bug in the start state
computation. The issue here is rather tricky, but it boils down to the
fact that the way the look-behind assertions are computed in the start
state is not quite precisely equivalent to how they're computed during
normal state generation. Namely, in normal state generation, we only
compute look-behind assertions if the NFA actually has one (or one
similar to it) in its graph somewhere. If it doesn't, then there's no
point in saving whether the assertion is satisfied or not.

Logically speaking, this doesn't matter too much, because if the
look-around assertions don't match up with how they're computed in the
start state, a new state will simply be created. Not a huge deal, but
wasteful. The real problem is that the new state will no longer be
considered a start state. It will just be like any other normal state.
We rely on being able to detect start states at search time to know when
to trigger the prefilter. So if we re-generate start states as non-start
states, then we may end up not triggering the prefilter. That's bad.

rebar actually caught this bug via the
`imported/sherlock/line-boundary-sherlock-holmes` benchmark, which
recorded a 20x slowdown due to the prefilter not running. Owch!

This specifically was caused by the start states unconditionally
attaching half-starting word boundary assertions whenever they were
satisfied, where as normal state generation only does this when there is
actually a half-starting word boundary assertion in the NFA. So this led
to re-generating start states needlessly.

Interestingly, the start state computation was unconditionally attaching
all different types of look-behind assertions, and thus in theory, this
problem already existed under different circumstances. My hypothesis is
that it wasn't "as bad" because it was mostly limited to line
terminators. But the half-starting word boundary assertion is much more
broadly applicable.

We remedy this not only for the half-starting word boundary assertion,
but for all others as well. I also did manual mutation testing in this
start state computation and found a few branches not covered by tests.
We add those tests here.

Thanks rebar!
This MSRV bump is mostly motivated by "good sense," and in particular,
Rust 1.65 means we can use 'let ... else'. We don't actually start
peppering the code with 'let ... else' just yet, but we fix a few
outstanding small issues and update our Rust version everywhere.

Also, Rust 1.65 is about a year old at time of writing. Let's keep the
trains moving.
It's not feasible for us to check such things when deserializing a DFA,
so we just have no real choice but to remove the assert and let the
search proceed with incorrect results. I had previously wrote these as
real asserts and then swapped them to debug_asserts, but of course, the
fuzzer still trips over them.

Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=60652
It's possible for DFA deserialization to result in an otherwise valid
DFA, but one that records accelerated DFA states without any actual
accelerator. We remedy that by checking for it at deserialization time.

Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=60739
Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=61255

fixup
This rejiggers some code so that we can more reliably check whether
start state IDs are valid or not.

Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=62726
I couldn't get this to reproduce. Maybe some of my recent changes to
regex-syntax fixed this? Not sure.

I'm not a huge fan of this fuzzer in general because it isn't really
testing a rock solid guarantee that we provide. And the positions are
tough to deal with.

Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=62382
@BurntSushi BurntSushi merged commit 2c44e2a into master Oct 9, 2023
@BurntSushi BurntSushi deleted the ag/work branch October 9, 2023 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants