-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major refactoring and performance improvements. #91
Conversation
r? @huonw (rust_highfive has picked a reviewer for you, use r? to override) |
Before:
After:
|
My thoughts, focusing on the main library (i.e. not the
Sorry I don't have any constructive criticism to offer. Well done! |
Wow! This is super impressive, nice work @BurntSushi! I think that you're definitely the most familiar with the design of this library, so you're probably the best person to review it as well :) From a high-level perspective, all I'd have to offer are:
I'm fine merging this whenever you're ready, although I'd just wait to get confirmation that it works ok on Windows (I can help with any automation issues, but I think a force-push of this branch will trigger the status checks). |
Overview of changes: * Instruction set has been redesigned to be smaller, mostly by collapsing empty-width matches into one instruction type. In addition to moving instruction-matching out of the matching engine, this makes matching engine code much simpler. * Rewrote input handling to use an inline representation of `Option<char>` and clearer position handling with the `Input` trait. * Added a new bounded backtracking matching engine that is invoked for small regexes/inputs. It's about twice as fast as the full NFA matching engine. * Implemented caching for both the NFA and backtracking engines. This avoids costly allocations on subsequent uses of the regex. * Overhauled prefix handling at both discovery and matching. Namely, sets of prefix literals can now be extracted from regexes. Depending on what the prefixes look like, an Aho-Corasick DFA is built from them. (This adds a dependency on the `aho-corasick` crate.) * When appropriate, use `memchr` to jump around in the input when there is a single common byte prefix. (This adds a dependency on the `memchr` crate.) * Bring the `regex!` macro up to date. Unfortunately, it still implements the full NFA matching engine and doesn't yet have access to the new prefix DFA handling. Thus, its performance has gotten *worse* than the dynamic implementation in most cases. The docs have been updated to reflect this change. Surprisingly, all of this required exactly one new application of `unsafe`, which is isolated in the `memchr` crate. (Aho-Corasick has no `unsafe` either!) There should be *no* breaking changes in this commit. The only public facing change is the addition of a method to the `Replacer` trait, but it comes with a default implementation so that existing implementors won't break. (Its purpose is to serve as a hint as to whether or not replacement strings need to be expanded. This is crucial to speeding up simple replacements.) Closes #21.
32b86e9
to
c86c025
Compare
@alexcrichton Fantastic. I know virtually nothing about Windows, but I was hoping this meant it had If tests pass, then I think I'm ready to merge. But I'm tired and it is not wise to merge such large changes while tired, so I will wait for tomorrow. :-) |
@WillEngler Any comments are good! I'm glad you like RE Aho-Corasick: It's possible that this will be obsoleted by a really good DFA implementation. :-) |
@alexcrichton Looks like the appveyor stuff failed. I'm not sure how to debug it (and it doesn't look related to |
Oh I think it may have just needed a rebase, but looks like it's all green now! |
Major refactoring and performance improvements.
Merged and published on crates.io in |
congrats! |
Overview of changes:
collapsing empty-width matches into one instruction type.
In addition to moving instruction-matching out of the matching
engine, this makes matching engine code much simpler.
Option<char>
and clearer position handling with theInput
trait.small regexes/inputs. It's about twice as fast as the full NFA
matching engine.
This avoids costly allocations on subsequent uses of the regex.
Namely, sets of prefix literals can now be extracted from regexes.
Depending on what the prefixes look like, an Aho-Corasick DFA
is built from them.
(This adds a dependency on the
aho-corasick
crate.)memchr
to jump around in the input whenthere is a single common byte prefix.
(This adds a dependency on the
memchr
crate.)regex!
macro up to date. Unfortunately, it stillimplements the full NFA matching engine and doesn't yet have
access to the new prefix DFA handling. Thus, its performance
has gotten worse than the dynamic implementation in most
cases. The docs have been updated to reflect this change.
Surprisingly, all of this required exactly one new application of
unsafe
, which is isolated in thememchr
crate. (Aho-Corasick has nounsafe
either!)There should be no breaking changes in this commit. The only public
facing change is the addition of a method to the
Replacer
trait, butit comes with a default implementation so that existing implementors
won't break. (Its purpose is to serve as a hint as to whether or not
replacement strings need to be expanded. This is crucial to speeding
up simple replacements.)
Closes #21.