Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search via regex-automata #211

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions helix-core/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ tree-sitter = "0.19"
once_cell = "1.8"
arc-swap = "1"
regex = "1"
regex-automata = "0.1"

serde = { version = "1.0", features = ["derive"] }
toml = "0.5"
Expand Down
180 changes: 180 additions & 0 deletions helix-core/src/search.rs
Original file line number Diff line number Diff line change
Expand Up @@ -43,3 +43,183 @@ pub fn find_nth_prev(text: RopeSlice, ch: char, mut pos: usize, n: usize) -> Opt

Some(pos)
}

use regex_automata::{dense, DenseDFA, Error as RegexError, DFA};
use std::ops::Range;

#[derive(Debug)]
pub struct Searcher {
/// Locate end of match searching right.
right_fdfa: DenseDFA<Vec<usize>, usize>,
/// Locate start of match searching right.
right_rdfa: DenseDFA<Vec<usize>, usize>,

/// Locate start of match searching left.
left_fdfa: DenseDFA<Vec<usize>, usize>,
/// Locate end of match searching left.
left_rdfa: DenseDFA<Vec<usize>, usize>,
}

impl Searcher {
pub fn new(pattern: &str) -> Result<Searcher, RegexError> {
// Check case info for smart case
let has_uppercase = pattern.chars().any(|c| c.is_uppercase());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spotted that, should probably copy that behavior over. The current detection matches alacritty, might be worth pointing out there as well.


// Create Regex DFAs for all search directions.
let mut builder = dense::Builder::new();
let builder = builder.case_insensitive(!has_uppercase).unicode(false);

let left_fdfa = builder.clone().reverse(true).build(pattern)?;
let left_rdfa = builder
.clone()
.anchored(true)
.longest_match(true)
.build(pattern)?;

let right_fdfa = builder.clone().build(pattern)?;
let right_rdfa = builder
.anchored(true)
.longest_match(true)
.reverse(true)
.build(pattern)?;

Ok(Searcher {
right_fdfa,
right_rdfa,
left_fdfa,
left_rdfa,
})
}
pub fn search_prev(&self, text: RopeSlice, offset: usize) -> Option<Range<usize>> {
let text = text.slice(..offset);
let start = self.rfind(text, &self.left_fdfa)?;
let end = self.find(text.slice(start..), &self.left_rdfa)?;

Some(start..start + end)
}

pub fn search_next(&self, text: RopeSlice, offset: usize) -> Option<Range<usize>> {
let text = text.slice(offset..);
let end = self.find(text, &self.right_fdfa)?;
let start = self.rfind(text.slice(..end), &self.right_rdfa)?;

Some(offset + start..offset + end)
}

/// Returns the end offset of the longest match. If no match exists, then None is returned.
/// NOTE: based on DFA::find_at
fn find(&self, text: RopeSlice, dfa: &impl DFA) -> Option<usize> {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this code is running afoul of this warning here: https://docs.rs/regex-automata/0.1.10/regex_automata/dense/enum.DenseDFA.html#the-dfa-trait

I would suggest unwrapping the DenseDFA enum into a PremultipliedByteClass, and then just that everywhere.

(In regex-automata 0.2, this perf footgun will disappear since a dense DFA will no longer be an enum. It will just be a PremultipliedByteClass.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I should have paid more attention to the docs.

What else are you changing in 0.2?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a total rewrite. This is the high level plan: rust-lang/regex#656

For DFAs in particular, they are getting support for everything except for Unicode word boundary. (So (?-u:\b), ^ and $ will work.) And they will also have multi-pattern support. But otherwise, all DFAs will be pre-multiplied with byte classes enabled, and state IDs will always be u32. (So there are much fewer type parameters.)

And lots of compilation perf improvements. But not enough to make building large Unicode patterns bearable.

// TODO: check this inside main search
// if dfa.is_anchored() && start > 0 {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I should keep these checks, but then start & end need to be passed into find/rfind instead of doing the slicing inside prev/next

// return None;
// }

let mut state = dfa.start_state();
let mut last_match = if dfa.is_dead_state(state) {
return None;
} else if dfa.is_match_state(state) {
Some(0)
} else {
None
};

let mut chunk_byte_offset = 0;

for chunk in text.chunks() {
for (i, &b) in chunk.as_bytes().iter().enumerate() {
state = unsafe { dfa.next_state_unchecked(state, b) };
if dfa.is_match_or_dead_state(state) {
if dfa.is_dead_state(state) {
return last_match;
}
last_match = Some(chunk_byte_offset + i + 1);
}
}
chunk_byte_offset += chunk.len();
}

last_match
Comment on lines +117 to +141
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks about right to me, although you should definitely write tests for it.

Also, I'd suggest using dfa.next_state and avoid the unsafe unless you can witness a measurable perf improvement. The only thing next_state_unchecked does is elide bound checks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pickfire This is almost a direct copy of DFA::find_at/rfind_at, I've mainly just added another loop that feeds chunks into the dfa.

}

/// Returns the start offset of the longest match in reverse, by searching from the end of the
/// input towards the start of the input. If no match exists, then None is returned. In other
/// words, this has the same match semantics as find, but in reverse.
///
/// NOTE: based on DFA::rfind_at
fn rfind(&self, text: RopeSlice, dfa: &impl DFA) -> Option<usize> {
// if dfa.is_anchored() && start < bytes.len() {
// return None;
// }

let mut state = dfa.start_state();
let mut last_match = if dfa.is_dead_state(state) {
return None;
} else if dfa.is_match_state(state) {
Some(text.len_bytes())
} else {
None
};

// This is basically chunks().rev()
let (mut chunks, mut chunk_byte_offset, _, _) = text.chunks_at_byte(text.len_bytes());

while let Some(chunk) = chunks.prev() {
for (i, &b) in chunk.as_bytes().iter().rev().enumerate() {
state = unsafe { dfa.next_state_unchecked(state, b) };
if dfa.is_match_or_dead_state(state) {
if dfa.is_dead_state(state) {
return last_match;
}
last_match = Some(chunk_byte_offset - i - 1);
}
}
chunk_byte_offset -= chunk.len();
}
last_match
}
}

#[cfg(test)]
mod test {
use super::*;

#[test]
fn test_search_next() {
use crate::Rope;
let text = Rope::from("hello world!");

let searcher = Searcher::new(r"\w+").unwrap();

let result = searcher.search_next(text.slice(..), 0).unwrap();
let fragment = text.slice(result.start..result.end);
assert_eq!("hello", fragment);

let result = searcher.search_next(text.slice(..), result.end).unwrap();
let fragment = text.slice(result.start..result.end);
assert_eq!("world", fragment);

let result = searcher.search_next(text.slice(..), result.end);
assert!(result.is_none());
}

#[test]
fn test_search_prev() {
use crate::Rope;
let text = Rope::from("hello world!");

let searcher = Searcher::new(r"\w+").unwrap();

let result = searcher
.search_prev(text.slice(..), text.len_bytes())
.unwrap();
let fragment = text.slice(result.start..result.end);
assert_eq!("world", fragment);

let result = searcher.search_prev(text.slice(..), result.start).unwrap();
let fragment = text.slice(result.start..result.end);
assert_eq!("hello", fragment);

let result = searcher.search_prev(text.slice(..), result.start);
assert!(result.is_none());
}
}
143 changes: 93 additions & 50 deletions helix-term/src/commands.rs
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,9 @@ impl Command {
split_selection_on_newline, "Split selection on newlines",
search, "Search for regex pattern",
search_next, "Select next search match",
search_prev, "Select previous search match",
extend_search_next, "Add next search match to selection",
extend_search_prev, "Add previous search match to selection",
search_selection, "Use current selection as search pattern",
extend_line, "Select current line, if already selected, extend to next line",
extend_to_line_bounds, "Extend selection to line bounds (line-wise selection)",
Expand Down Expand Up @@ -1018,25 +1020,91 @@ fn split_selection_on_newline(cx: &mut Context) {
doc.set_selection(view.id, selection);
}

fn search_impl(doc: &mut Document, view: &mut View, contents: &str, regex: &Regex, extend: bool) {
// -> we always search from after the cursor.head
// TODO: be able to use selection as search query (*/alt *)

use helix_core::search::Searcher;

pub fn search(cx: &mut Context) {
let (view, doc) = current!(cx.editor);

// TODO: could probably share with select_on_matches?

let view_id = view.id;
let snapshot = doc.selection(view_id).clone();

let prompt = Prompt::new(
"search:".to_string(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"search:".to_string(),
"search: ".to_string(),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kak doesn't have any space between the prompt and the text either. We can change that but it should be done in a follow-up so we change all the prompts.

Some('\\'),
|_input: &str| Vec::new(), // this is fine because Vec::new() doesn't allocate
move |cx: &mut compositor::Context, input: &str, event: PromptEvent| {
match event {
PromptEvent::Abort => {
let (view, doc) = current!(cx.editor);
doc.set_selection(view.id, snapshot.clone());
}
PromptEvent::Validate => {
// TODO: push_jump to store selection just before jump
}
PromptEvent::Update => {
// skip empty input, TODO: trigger default
if input.is_empty() {
return;
}

match Searcher::new(input) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed to be blocking the input, even when I search (debug) "hello world" every character takes like few seconds and I need to wait. Maybe we should spawn it as a blocking task?

Ok(searcher) => {
let (view, doc) = current!(cx.editor);
// revert state to what it was before the last update
// TODO: also revert text
doc.set_selection(view.id, snapshot.clone());

cx.editor.search = Some(searcher);
_search(cx.editor, Direction::Forward, false);

// TODO: only store on enter (accept), not update
// cx.editor.registers.write('\\', vec![input.to_string()]);
}
Err(_err) => (), // TODO: mark command line as error
}
}
}
},
);

cx.push_layer(Box::new(prompt));
}

pub fn _search(editor: &mut Editor, direction: Direction, extend: bool) {
let (view, doc) = current!(editor);
let text = doc.text().clone(); // need to clone or we run into borrowing issues, but it's a cheap clone
let cursor = doc.selection(view.id).primary().cursor(text.slice(..));
let start = text.char_to_byte(cursor);

let mat = if let Some(searcher) = &editor.search {
// use find_at to find the next match after the cursor, loop around the end
// Careful, `Regex` uses `bytes` as offsets, not character indices!
match direction {
Direction::Backward => searcher
.search_prev(text.slice(..), start)
.or_else(|| searcher.search_prev(text.slice(..), text.len_bytes())),
Direction::Forward => searcher
.search_next(text.slice(..), start)
.or_else(|| searcher.search_next(text.slice(..), 0)),
}
} else {
None
};

// refetch to avoid borrowing problems
let (view, doc) = current!(editor);
let text = doc.text().slice(..);
let selection = doc.selection(view.id);

// Get the right side of the primary block cursor.
let start = text.char_to_byte(graphemes::next_grapheme_boundary(
text,
selection.primary().cursor(text),
));

// use find_at to find the next match after the cursor, loop around the end
// Careful, `Regex` uses `bytes` as offsets, not character indices!
let mat = regex
.find_at(contents, start)
.or_else(|| regex.find(contents));
// TODO: message on wraparound
if let Some(mat) = mat {
let start = text.byte_to_char(mat.start());
let end = text.byte_to_char(mat.end());
let start = text.byte_to_char(mat.start);
let end = text.byte_to_char(mat.end);

if end == 0 {
// skip empty matches that don't make sense
Expand All @@ -1052,48 +1120,22 @@ fn search_impl(doc: &mut Document, view: &mut View, contents: &str, regex: &Rege
doc.set_selection(view.id, selection);
align_view(doc, view, Align::Center);
};
view.ensure_cursor_in_view(doc);
}

// TODO: use one function for search vs extend
fn search(cx: &mut Context) {
let (_, doc) = current!(cx.editor);

// TODO: could probably share with select_on_matches?

// HAXX: sadly we can't avoid allocating a single string for the whole buffer since we can't
// feed chunks into the regex yet
let contents = doc.text().slice(..).to_string();

let prompt = ui::regex_prompt(
cx,
"search:".to_string(),
move |view, doc, registers, regex| {
search_impl(doc, view, &contents, &regex, false);
// TODO: only store on enter (accept), not update
registers.write('\\', vec![regex.as_str().to_string()]);
},
);

cx.push_layer(Box::new(prompt));
pub fn search_next(cx: &mut Context) {
_search(cx.editor, Direction::Forward, false);
}

fn search_next_impl(cx: &mut Context, extend: bool) {
let (view, doc) = current!(cx.editor);
let registers = &mut cx.editor.registers;
if let Some(query) = registers.read('\\') {
let query = query.first().unwrap();
let contents = doc.text().slice(..).to_string();
let regex = Regex::new(query).unwrap();
search_impl(doc, view, &contents, &regex, extend);
}
pub fn extend_search_next(cx: &mut Context) {
_search(cx.editor, Direction::Forward, true);
}

fn search_next(cx: &mut Context) {
search_next_impl(cx, false);
pub fn search_prev(cx: &mut Context) {
_search(cx.editor, Direction::Backward, false);
}

fn extend_search_next(cx: &mut Context) {
search_next_impl(cx, true);
pub fn extend_search_prev(cx: &mut Context) {
_search(cx.editor, Direction::Backward, true);
}

fn search_selection(cx: &mut Context) {
Expand All @@ -1102,7 +1144,8 @@ fn search_selection(cx: &mut Context) {
let query = doc.selection(view.id).primary().fragment(contents);
let regex = regex::escape(&query);
cx.editor.registers.write('\\', vec![regex]);
search_next(cx);
let msg = format!("register '{}' set to '{}'", '\\', query);
cx.editor.set_status(msg);
}

fn extend_line(cx: &mut Context) {
Expand Down
Loading