-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for C++ std::regex
to benchmarks
#459
Conversation
* bench/Cargo.toml: add `re-stdcpp` feature * bench/build.rs: add `cstdcpp` library to bench build * bench/compile: add `re-stdcpp` feature to bench compile script * bench/run: add `re-stdcpp` feature to bench run script * bench/src/bench.rs: use `ffi::stdcpp::Regex`, define its `text!` macro, and `Text` type * bench/src/ffi/mod.rs: declare `stdcpp` module * bench/src/ffi/stdcpp.cpp: implement C API using C++ `std::regex` * bench/src/ffi/stdcpp.rs: Rust `Regex` API implementation using C++ `std::regex` C API wrapper * bench/src/main.rs: add stdcpp to bench main * bench/src/misc.rs: - do not run `no_exponential` benchmark for `re-stdcpp` feature because `libstdc++` `std::regex` implementation currently seems to have exponential behavior here - do not run `match_class_unicode` benchmark for `re-stdcpp` feature because `std::regex` ECMAScript grammar does not support unicode character classes * bench/src/sherlock.rs: - do not run `name_sherlock_nocase`, `name_holmes_nocase`, `name_sherlock_holmes_nocase`, `name_alt3_nocase`, `name_alt4_nocase`, `name_alt5_nocase`, `the_nocase`, `everything_greedy_nl`, and `line_boundary_sherlock_holmes` benchmarks for `re-stdcpp` feature because `std::regex` ECMAScript grammar does not support inline modifier syntax - do not run `letters`, `letters_upper`, and `letters_lower` benchmarks for `re-stdcpp` feature because `std::regex` ECMAScript grammar does not support unicode character classes - use a different regex for `everything_greedy` benchmark because `std::regex` '.' does not match '\r' - `words` benchmark for `std::regex` matches RE2 test result, so use that test for `re-stdcpp` feature as well - do not run `holmes_coword_watson` benchmark for `re-stdcpp` feature because `libstdc++` `std::regex` implementation currently seems to have exponential behavior here
This is amazing! Thanks so much for doing this. I am also surprised at how slow
|
I'm not a C++ expert, but I can't spot any obvious deficiencies in your bindings either. e.g., I don't see any extra allocations going on. |
Yeah, I was pretty surprised by the results as well. The implementation here follows the RE2 one fairly closely, obviously adjusted to use the std::regex API, so I think at least those two implementations should be comparable. On the other hand, I wasn't sure how much overhead there was between the Rust native implementation and the FFI implementations, especially considering the use of unsafe blocks for the FFI implementations. I would also like to try using LLVM/Clang when building to test the |
The only overhead is probably a function call. One possibility would be to implement the iterator inside the glue code, which might give you inlining. However, this will likely only help in benchmarks that have a ton of matches. (Most don't.) Benchmarking against Clang probably won't show too much of a difference, especially with these speeds. Benchmarking Boost is a good idea though! :-) |
* bench/Cargo.toml: add `libcxx` feature * bench/build.rs: link against `libc++` when `libcxx` feature is specified with `re-stdcpp` feature. The `cc::Build::cpp_set_stdlib("c++")` method is not used because GCC does not support the `-stdlib` flag. * bench/run: add `stdcpp-libcxx` option for benchmarking `libc++` `std::regex` implementation. This requires the user to specify the compiler `CXX` and compiler flags `CXXFLAGS` necessary for compiling with `libc++`[1]. [1] https://libcxx.llvm.org/docs/UsingLibcxx.html
I've added support for linking against the
|
I've added support for benchmarking against Boost.Regex. The benchmark implementation reuses the Rust and C/C++ FFIs from
|
Small update: I've enabled some additional tests for
|
This is very similar to the `std::regex` benchmark implementation since Boost.Regex and `std::regex` have very similar APIs and regex grammar support. As such, it uses the `stdcpp` Rust and C FFIs to reduce code duplication. * bench/Cargo.toml: add `re-boost` feature * bench/build.rs: add `cboost` library to bench build. This uses a compiler preprocessor definition to indicate whether or not to use Boost when compiling the `stdcpp` FFI. * bench/compile: add `re-boost` feature to bench compile script * bench/run: add `re-boost` feature to bench run script * bench/src/bench.rs: use `ffi::stdcpp::Regex`, define its `text!` macro, and `Text` type for feature `re-boost` * bench/src/ffi/mod.rs: declare `stdcpp` module for `re-boost` feature * bench/src/ffi/stdcpp.cpp: implement C API using C++ `boost::regex`. The Boost.Regex API is very similar to the `std::regex` API and therefore only uses a different namespace. * bench/src/main.rs: add boost to bench main * bench/src/misc.rs: - do not run `match_class_unicode` benchmark for `re-boost` feature because `boost::regex` ECMAScript grammar does not support unicode character classes * bench/src/sherlock.rs: - do not run `letters`, `letters_upper`, and `letters_lower` benchmarks for `re-boost` feature because `boost::regex` ECMAScript grammar does not support unicode character classes - use a different regex for `everything_greedy` benchmark because `boost::regex` '.' does not match '\r' - `words` benchmark for `boost::regex` matches RE2 test result, so use that test for `re-boost` feature as well. Also fixes conditional compilation issue for `re-stdcpp`. - do not run `holmes_coword_watson` benchmark for `re-boost` feature because Boost.Regex implementation currently seems to have exponential behavior here
Fixed some conditional compilation issues which were not enabling some tests as I had thought. On a side note, I do not particularly like the syntax for conditional compilation in Rust. It is not obvious how it works when compared to C/C++ preprocessor directives. In any case, I have some comparisons (82 tests) between
|
A comment about libstdc++ (and libc++) |
@mkrupcale I finally had a chance to test this out and it all works! Thanks so much for doing this. :-) |
@BurntSushi no problem. Is there a reason why this pull request was closed, however? Seeing as there are benchmarks against other C and C++ regex implementations in this codebase, I'm not sure what the problem is with merging this additional set of C++ regex benchmarks. While I was doing this for my own interests, I don't see why it couldn't be useful to others and for future comparisons. |
@mkrupcale Oh! This was merged in 4e3a107, which should be linked to this PR. I did the merge manually in order to clean up the commit messages. Sadly, the Github UI doesn't really distinguish between "manually merged PR" and "closed PR without merging." |
Ah, that's my mistake. I didn't realize that was a reference to the merge commit. Apologies. |
Yeah, lately I've been trying to keep a cleaner history for all of my projects by doing the following:
Your PR however was lovingly split into 3 sensible commits. So the only way to merge that while also touching up the messages is manually. It's not clear if I will give up on this task yet or not, but so far it has been working somewhat well. :-) |
Detailed changes
re-stdcpp
featurecstdcpp
library to bench buildre-stdcpp
feature to bench compile scriptre-stdcpp
feature to bench run scriptffi::stdcpp::Regex
, define itstext!
macro, andText
typestdcpp
modulestd::regex
Regex
API implementation using C++std::regex
C API wrapperno_exponential
benchmark forre-stdcpp
feature becauselibstdc++
std::regex
implementation currently seems to have exponential behavior herematch_class_unicode
benchmark forre-stdcpp
feature becausestd::regex
ECMAScript grammar does not support unicode character classesname_sherlock_nocase
,name_holmes_nocase
,name_sherlock_holmes_nocase
,name_alt3_nocase
,name_alt4_nocase
,name_alt5_nocase
,the_nocase
,everything_greedy_nl
, andline_boundary_sherlock_holmes
benchmarks forre-stdcpp
feature becausestd::regex
ECMAScript grammar does not support inline modifier syntaxletters
,letters_upper
, andletters_lower
benchmarks forre-stdcpp
feature becausestd::regex
ECMAScript grammar does not support unicode character classeseverything_greedy
benchmark becausestd::regex
'.' does not match '\r'words
benchmark forstd::regex
matches RE2 test result, so use that test forre-stdcpp
feature as wellholmes_coword_watson
benchmark forre-stdcpp
feature becauselibstdc++
std::regex
implementation currently seems to have exponential behavior hereResults
Unfortunately, the
libstdc++
std::regex
implementation (version 7.3.1) on my machine did not do too well, often being tens or hundreds of times slower than PCRE2 or RE2. Some example results forlibstdc++
are shown here: