-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search via regex-automata #211
Conversation
/// NOTE: based on DFA::find_at | ||
fn find(&self, text: RopeSlice, dfa: &impl DFA) -> Option<usize> { | ||
// TODO: check this inside main search | ||
// if dfa.is_anchored() && start > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally I should keep these checks, but then start & end need to be passed into find/rfind instead of doing the slicing inside prev/next
Are optimizations within the scope of this PR, or do you want to defer that to a later PR? |
let mut state = dfa.start_state(); | ||
let mut last_match = if dfa.is_dead_state(state) { | ||
return None; | ||
} else if dfa.is_match_state(state) { | ||
Some(0) | ||
} else { | ||
None | ||
}; | ||
|
||
let mut chunk_byte_offset = 0; | ||
|
||
for chunk in text.chunks() { | ||
for (i, &b) in chunk.as_bytes().iter().enumerate() { | ||
state = unsafe { dfa.next_state_unchecked(state, b) }; | ||
if dfa.is_match_or_dead_state(state) { | ||
if dfa.is_dead_state(state) { | ||
return last_match; | ||
} | ||
last_match = Some(chunk_byte_offset + i + 1); | ||
} | ||
} | ||
chunk_byte_offset += chunk.len(); | ||
} | ||
|
||
last_match |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure sure how good this part is. Did you took it from https://github.com/alacritty/alacritty/blob/3e867a056018c507d79396cb5c5b4b8309c609c2/alacritty_terminal/src/term/search.rs#L184-L282?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks about right to me, although you should definitely write tests for it.
Also, I'd suggest using dfa.next_state
and avoid the unsafe
unless you can witness a measurable perf improvement. The only thing next_state_unchecked
does is elide bound checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pickfire This is almost a direct copy of DFA::find_at
/rfind_at
, I've mainly just added another loop that feeds chunks into the dfa.
return; | ||
} | ||
|
||
match Searcher::new(input) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seemed to be blocking the input, even when I search (debug) "hello world" every character takes like few seconds and I need to wait. Maybe we should spawn it as a blocking task?
Is it possible to leverage something like regex to get more optimizations for now? But come back later to optimize it ourselves? |
It seemed to be fast enough when I run it on release mode searching for stuff within sqlite.c but on debug mode when I just search "hello world", after hello, every character is like blocking for a few seconds even when it never will work (since not even hello is there). Maybe we could have pass in history to searcher so if there are no results and (if possible) similar query then we stop searching altogether? The regex part I don't quite understand myself. I think the best person to review this is probably @BurntSushi himself. Let's summon him and hope he can save us. |
I got a panic, in https://github.com/clibs/sqlite/blob/master/sqlite3.c, search for backtrace
|
let snapshot = doc.selection(view_id).clone(); | ||
|
||
let prompt = Prompt::new( | ||
"search:".to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"search:".to_string(), | |
"search: ".to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kak
doesn't have any space between the prompt and the text either. We can change that but it should be done in a follow-up so we change all the prompts.
This comment has been minimized.
This comment has been minimized.
I think we can just check if the string |
I'm not sure when I'll get a chance to look at this PR, but have y'all seen this section of the docs? Are those tradeoffs acceptable to you? https://docs.rs/regex-automata/0.1.10/regex_automata/#differences-with-the-regex-crate DFA compilation, especially when Unicode mode is enabled, can be very very slow. |
Yes, but we're looking into possible optimizations. |
Sorry, if I'm going to be helpful, I think I need more context. Are there any specific questions y'all have? |
Archseer would probably be better to ask, though he's asleep right now. |
I prototyped some of the optimizations I suggested, it's a lot better now I think mainly by ignoring known invalid queries (* on alphanumeric only). Admittedly, I don't think the |
impl Searcher { | ||
pub fn new(pattern: &str) -> Result<Searcher, RegexError> { | ||
// Check case info for smart case | ||
let has_uppercase = pattern.chars().any(|c| c.is_uppercase()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
N.B. This is how ripgrep does smart case detection: https://github.com/BurntSushi/ripgrep/blob/9f924ee187d4c62aa6ebe4903d0cfc6507a5adb5/crates/regex/src/ast.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spotted that, should probably copy that behavior over. The current detection matches alacritty, might be worth pointing out there as well.
|
||
/// Returns the end offset of the longest match. If no match exists, then None is returned. | ||
/// NOTE: based on DFA::find_at | ||
fn find(&self, text: RopeSlice, dfa: &impl DFA) -> Option<usize> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this code is running afoul of this warning here: https://docs.rs/regex-automata/0.1.10/regex_automata/dense/enum.DenseDFA.html#the-dfa-trait
I would suggest unwrapping the DenseDFA
enum into a PremultipliedByteClass
, and then just that everywhere.
(In regex-automata 0.2
, this perf footgun will disappear since a dense DFA will no longer be an enum. It will just be a PremultipliedByteClass
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I should have paid more attention to the docs.
What else are you changing in 0.2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a total rewrite. This is the high level plan: rust-lang/regex#656
For DFAs in particular, they are getting support for everything except for Unicode word boundary. (So (?-u:\b)
, ^
and $
will work.) And they will also have multi-pattern support. But otherwise, all DFAs will be pre-multiplied with byte classes enabled, and state IDs will always be u32
. (So there are much fewer type parameters.)
And lots of compilation perf improvements. But not enough to make building large Unicode patterns bearable.
OK, so the reason why I pointed out the docs to trade offs is that building a DFA up-front can be a serious performance footgun. You can't escape it by "implementing optimizations" on top of it. The key here is that building a DFA can take exponential time and use an exorbitant amount of memory. But maybe this is OK, I'm not sure, it depends on what you're using it for. |
In rust-lang/regex#425, you mentioned |
IIRC, Hyperscan has long compile times too, but probably not as bad as full DFAs. Hard to say. I've never used Hyperscan in anger. I just know it's probably the most sophisticated open source regex engine. If you can accept it in your build chain and are okay with x86-only, then it's worth a shot? I think the real answer here is that you need to find a way to benchmark stuff. e.g., If a user types |
It's a bit unfortunate that |
Well, it's a ton of work. All of the SIMD stuff I do is x86 only too. There are just fallbacks for everything. With Hyperscan, AIUI, SIMD is more deeply threaded through the code. |
@BurntSushi Thanks for taking a look! I've been testing it primarily with simple patterns and it seemed to be instant in release mode, but I admit I didn't test it with more complex inputs, This is how the user interacts with search. We show an interactive query, then on The ideas to optimize this:
|
@BurntSushi Is it possible to have like an incrementally constructed DFA or regex? Since in our case the user will continuously type stuff or delete stuff, like from |
I haven't yet decided how this is going to work at the API level. I don't even know if literal optimizations should be in the domain of something like With that said, you don't need to examine the pattern yourself. You can extract prefixes using the
Not to my knowledge, although that isn't my area of specialty. It's possible there are some techniques for it, but I don't know them.
In my effort to port the various internal regex engines over to |
One thing that would be useful to me, personally, would be a write-up on how other text editors (emacs, vim) handle their regex searches. Do they roll their own regex engine? Do they make sacrifices like putting a limit on the maximum length of a match? Do they give up search speed in exchange for more flexibility? And so on. |
I just took a look at the source code of the regex engine for neovim, and these sorts of higher level questions are really hard to answer. After reading regex source for an hour, I don't think I made any headway into answering any of these questions other than the fact that, yes, they do roll their own regex engine. |
Kakoune has a custom implementation as well, it's probably a lot more compact than the neovim one: https://github.com/mawww/kakoune/blob/master/src/regex_impl.cc |
I'm tempted to merge |
Yeah, The kakoune tip was helpful. It looks like both it and vim use NFA simulations, which means (I think) they give up search speed in favor of more flexibility and (much) faster compile times. For kakoune in particular, it looks like the only interface it requires is an iterator that can move forwards or backwards over the input. (And this isn't quite like a Rust But this is a long ways off. But I'll definitely be keeping your use case in mind! |
No I mean: compiling If we went down the self-implemented route until |
Oh... That's... weird. I can't seem to reproduce that:
Ah I see... But if I compile it in reverse, it does indeed take forever. And indeed, compiling
The issue is the combination of the bounded repeat and an overlapping character preceding it. e.g., If
Such is life in the world of fully compiled DFAs. :-/ |
Going to close this draft since it has gone stale, we're relying on |
No description provided.