-
Notifications
You must be signed in to change notification settings - Fork 831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable fancy regex #1586
Enable fancy regex #1586
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice there were some complaints about compiling without onig while the doc says it's optional
impl Pattern for &Regex { | ||
fn find_matches( | ||
&self, | ||
inside: &str, | ||
) -> Result<Vec<(Offsets, bool)>, Box<dyn Error + Send + Sync + 'static>> { | ||
if inside.is_empty() { | ||
return Ok(vec![((0, 0), false)]); | ||
} | ||
|
||
let mut prev = 0; | ||
let mut splits = Vec::with_capacity(inside.len()); | ||
for match_ in self.find_iter(inside) { | ||
let match_ = match_?; | ||
let start = match_.start(); | ||
let end = match_.end(); | ||
if prev != start { | ||
splits.push(((prev, start), false)); | ||
} | ||
splits.push(((start, end), true)); | ||
prev = end; | ||
} | ||
if prev != inside.len() { | ||
splits.push(((prev, inside.len()), false)) | ||
} | ||
Ok(splits) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is copied from onig?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed.
tokenizers
enable the use of 2 regex engines.The main supported one is
onig
, the other one is unstable and used byunstable_wasm
and it'sfancy_regex
(onig being C cannot be compiled as is. on wasm targets).This PR fixes the code so that it leverages
tokenizers
regexes instead ofonig
directly in the python bindings, which allows swapping one for the other.Unfortunately
FancyRegex
is slower.