Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an implimentation of Tuned Boyer-Moore. #419

Merged
merged 1 commit into from
Dec 11, 2017

Conversation

ethanpailes
Copy link
Contributor

While the existing literal string searching algorithm
leveraging memchr is quite fast, in some case more
traditional approaches still make sense. This patch
provides an implimentation of Tuned Boyer-Moore as
laid out in Fast String Searching by Hume & Sunday.
Some refinements to their work were gleened from the
grep source.

See: #408
See: ripgrep-617

@@ -5,7 +5,7 @@ version = "0.1.0"
authors = ["The Rust Project Developers"]
license = "MIT/Apache-2.0"
repository = "https://github.com/rust-lang/regex"
documentation = "http://doc.rust-lang.org/regex/regex_syntax/index.html"
documentation = "http://doc.rust-lang.org/regex/regex/index.html"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a random nit I had when reading the cargo file. It seems odd for the benchmark suite to point to the syntax crate for documentation. I'm not at all attached to this change.

@@ -514,7 +514,7 @@ another matching engine with fixed memory requirements.
extern crate aho_corasick;
extern crate memchr;
extern crate thread_local;
#[cfg(test)] extern crate quickcheck;
#[macro_use] #[cfg(test)] extern crate quickcheck;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Required to use the quickcheck! macro. If there is a reason to avoid pulling quickcheck macros I can refactor to use the quickcheck function.

@@ -392,7 +402,7 @@ impl SingleByteSet {
///
/// TODO(burntsushi): Add some amount of shifting to this.
#[derive(Clone, Debug)]
pub struct SingleSearch {
pub struct MemchrSearch {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed this guy because I thought the old name would be confusing in the presense of multiple search algorithms focused on finding a single literal.

@ethanpailes ethanpailes force-pushed the boyer-moore-integration branch 2 times, most recently from e7c07d8 to 89d6df6 Compare November 29, 2017 03:33
@ethanpailes
Copy link
Contributor Author

@BurntSushi, do you have time to give this a quick "seems worthwhile" or "not right now"? It doesn't have to be a full review.

@BurntSushi
Copy link
Member

@ethanpailes Thanks for the ping! Looking over this, I think this actually looks pretty dang good. I don't have any specific comments so I think we should just bring it in. Thank you so much for doing this. :-) @bors r+

@bors
Copy link
Contributor

bors commented Dec 8, 2017

📌 Commit 89d6df6 has been approved by BurntSushi

@bors
Copy link
Contributor

bors commented Dec 8, 2017

⌛ Testing commit 89d6df6 with merge f4efb78...

bors added a commit that referenced this pull request Dec 8, 2017
Add an implimentation of Tuned Boyer-Moore.

While the existing literal string searching algorithm
leveraging memchr is quite fast, in some case more
traditional approaches still make sense. This patch
provides an implimentation of Tuned Boyer-Moore as
laid out in Fast String Searching by Hume & Sunday.
Some refinements to their work were gleened from the
grep source.

See: #408
See: [ripgrep-617](BurntSushi/ripgrep#617)
@bors
Copy link
Contributor

bors commented Dec 8, 2017

💔 Test failed - status-appveyor

@ethanpailes ethanpailes changed the title Add an implimentation of Tuned Boyer-Moore. [WIP] Add an implimentation of Tuned Boyer-Moore. Dec 8, 2017
@ethanpailes ethanpailes force-pushed the boyer-moore-integration branch from 89d6df6 to b6e5353 Compare December 8, 2017 17:57
@ethanpailes ethanpailes force-pushed the boyer-moore-integration branch 4 times, most recently from 362e6b4 to 37f6976 Compare December 8, 2017 23:31
@ethanpailes ethanpailes changed the title [WIP] Add an implimentation of Tuned Boyer-Moore. Add an implimentation of Tuned Boyer-Moore. Dec 8, 2017
@ethanpailes ethanpailes changed the title Add an implimentation of Tuned Boyer-Moore. [WIP] Add an implimentation of Tuned Boyer-Moore. Dec 8, 2017
@ethanpailes ethanpailes changed the title [WIP] Add an implimentation of Tuned Boyer-Moore. Add an implimentation of Tuned Boyer-Moore. Dec 8, 2017
@ethanpailes ethanpailes force-pushed the boyer-moore-integration branch from 37f6976 to 286cccf Compare December 9, 2017 02:02
@ethanpailes
Copy link
Contributor Author

@BurntSushi, assuming tests pass, I'm now happy with this. The turnover length for patterns of uncommon chars seems to be around 7 or 8.

I want to take another crack at finding an absolute length turnover where TBM always wins even if the pattern chars are not that common, but that can go in another PR.

@BurntSushi
Copy link
Member

@ethanpailes Looks like quickcheck caught a bug?

While the existing literal string searching algorithm
leveraging memchr is quite fast, in some case more
traditional approaches still make sense. This patch
provides an implimentation of Tuned Boyer-Moore as
laid out in Fast String Searching by Hume & Sunday.
Some refinements to their work were gleened from the
grep source.

See: rust-lang#408
See: BurntSushi/ripgrep#617
@ethanpailes
Copy link
Contributor Author

So it does, I'll take another look.

@ethanpailes ethanpailes force-pushed the boyer-moore-integration branch from 286cccf to 597029c Compare December 9, 2017 18:08
@ethanpailes
Copy link
Contributor Author

I managed to blow away all my bug fixes when I updated the heuristic. I'm so good at git. Sorry about that.

@ethanpailes ethanpailes force-pushed the boyer-moore-integration branch 2 times, most recently from 468ab20 to 2a3cc31 Compare December 9, 2017 21:11
@ethanpailes ethanpailes force-pushed the boyer-moore-integration branch from 2a3cc31 to d5be839 Compare December 9, 2017 21:47
@ethanpailes
Copy link
Contributor Author

@BurntSushi, I've had quickcheck running in a loop locally for a few hours, so it should be good to go now.

@BurntSushi
Copy link
Member

Thanks! Awesome work! @bors r+

@bors
Copy link
Contributor

bors commented Dec 11, 2017

📌 Commit d5be839 has been approved by BurntSushi

@bors
Copy link
Contributor

bors commented Dec 11, 2017

⌛ Testing commit d5be839 with merge 83c0b2f...

bors added a commit that referenced this pull request Dec 11, 2017
Add an implimentation of Tuned Boyer-Moore.

While the existing literal string searching algorithm
leveraging memchr is quite fast, in some case more
traditional approaches still make sense. This patch
provides an implimentation of Tuned Boyer-Moore as
laid out in Fast String Searching by Hume & Sunday.
Some refinements to their work were gleened from the
grep source.

See: #408
See: [ripgrep-617](BurntSushi/ripgrep#617)
@bors
Copy link
Contributor

bors commented Dec 11, 2017

☀️ Test successful - status-appveyor, status-travis
Approved by: BurntSushi
Pushing 83c0b2f to master...

@bors bors merged commit d5be839 into rust-lang:master Dec 11, 2017
@ethanpailes ethanpailes deleted the boyer-moore-integration branch December 11, 2017 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants