Skip to content

Commit

Permalink
Auto merge of #436 - rust-lang:ag/misc-fixes, r=BurntSushi
Browse files Browse the repository at this point in the history
remove regex plugin + rollup + chores

This PR:

* Removes the regex compiler plugin. It's been broken for quite some time and nobody has seemed to notice. It's time for it to go. See commit cc7b00c for details.
* Setup a Cargo workspace for this repo.
* Update deps in various places. This includes updating simd to `0.2.1`, which fixes a build failure on Rust nightly.
* Name the frequency analysis based memchr search "freqy packed."
* Rolls up the other open PRs #401, #410 and #433.
  • Loading branch information
bors committed Dec 30, 2017
2 parents 83c0b2f + 4152e18 commit f3425da
Show file tree
Hide file tree
Showing 34 changed files with 364 additions and 2,179 deletions.
3 changes: 3 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,6 @@ env:
notifications:
email:
on_success: never
branches:
only:
- master
17 changes: 13 additions & 4 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ authors = ["The Rust Project Developers"]
license = "MIT/Apache-2.0"
readme = "README.md"
repository = "https://github.com/rust-lang/regex"
documentation = "https://doc.rust-lang.org/regex"
documentation = "https://docs.rs/regex"
homepage = "https://github.com/rust-lang/regex"
description = """
An implementation of regular expressions for Rust. This implementation uses
Expand All @@ -17,6 +17,9 @@ categories = ["text-processing"]
travis-ci = { repository = "rust-lang/regex" }
appveyor = { repository = "rust-lang-libs/regex" }

[workspace]
members = ["bench", "regex-capi", "regex-debug", "regex-syntax"]

[dependencies]
# For very fast prefix literal matching.
aho-corasick = "0.6.0"
Expand All @@ -27,17 +30,17 @@ thread_local = "0.3.2"
# For parsing regular expressions.
regex-syntax = { path = "regex-syntax", version = "0.4.1" }
# For accelerating text search.
simd = { version = "0.1.1", optional = true }
simd = { version = "0.2.1", optional = true }
# For compiling UTF-8 decoding into automata.
utf8-ranges = "1.0.0"

[dev-dependencies]
# For examples.
lazy_static = "1"
# For property based tests.
quickcheck = { version = "0.5", default-features = false }
quickcheck = { version = "0.6", default-features = false }
# For generating random test data.
rand = "0.3.15"
rand = "0.4"

[features]
# Enable to use the unstable pattern traits defined in std.
Expand Down Expand Up @@ -94,5 +97,11 @@ name = "backtrack-utf8bytes"
path = "tests/test_backtrack_bytes.rs"
name = "backtrack-bytes"

[profile.release]
debug = true

[profile.bench]
debug = true

[profile.test]
debug = true
87 changes: 40 additions & 47 deletions HACKING.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,37 +185,36 @@ A regular expression program is essentially a sequence of opcodes produced by
the compiler plus various facts about the regular expression (such as whether
it is anchored, its capture names, etc.).

### The regex! macro (or why `regex::internal` exists)

The `regex!` macro is defined in the `regex_macros` crate as a compiler plugin,
which is maintained in this repository. The `regex!` macro compiles a regular
expression at compile time into specialized Rust code.

The `regex!` macro was written when this library was first conceived and
unfortunately hasn't changed much since then. In particular, it encodes the
entire Pike VM into stack allocated space (no heap allocation is done). When
`regex!` was first written, this provided a substantial speed boost over
so-called "dynamic" regexes compiled at runtime, and in particular had much
lower overhead per match. This was because the only matching engine at the
time was the Pike VM. The addition of other matching engines has inverted
the relationship; the `regex!` macro is almost never faster than the dynamic
variant. (In fact, it is typically substantially slower.)

In order to build the `regex!` macro this way, it must have access to some
internals of the regex library, which is in a distinct crate. (Compiler plugins
must be part of a distinct crate.) Namely, it must be able to compile a regular
expression and access its opcodes. The necessary internals are exported as part
of the top-level `internal` module in the regex library, but is hidden from
public documentation. In order to present a uniform API between programs build
by the `regex!` macro and their dynamic analoges, the `Regex` type is an enum
whose variants are hidden from public documentation.

In the future, the `regex!` macro should probably work more like Ragel, but
it's not clear how hard this is. In particular, the `regex!` macro should be
able to support all the features of dynamic regexes, which may be hard to do
with a Ragel-style implementation approach. (Which somewhat suggests that the
`regex!` macro may also need to grow conditional execution logic like the
dynamic variants, which seems rather grotesque.)
### The regex! macro

The `regex!` macro no longer exists. It was developed in a bygone era as a
compiler plugin during the infancy of the regex crate. Back then, then only
matching engine in the crate was the Pike VM. The `regex!` macro was, itself,
also a Pike VM. The only advantages it offered over the dynamic Pike VM that
was built at runtime were the following:

1. Syntax checking was done at compile time. Your Rust program wouldn't
compile if your regex didn't compile.
2. Reduction of overhead that was proportional to the size of the regex.
For the most part, this overhead consisted of heap allocation, which
was nearly eliminated in the compiler plugin.

The main takeaway here is that the compiler plugin was a marginally faster
version of a slow regex engine. As the regex crate evolved, it grew other regex
engines (DFA, bounded backtracker) and sophisticated literal optimizations.
The regex macro didn't keep pace, and it therefore became (dramatically) slower
than the dynamic engines. The only reason left to use it was for the compile
time guarantee that your regex is correct. Fortunately, Clippy (the Rust lint
tool) has a lint that checks your regular expression validity, which mostly
replaces that use case.

Additionally, the regex compiler plugin stopped receiving maintenance. Nobody
complained. At that point, it seemed prudent to just remove it.

Will a compiler plugin be brought back? The future is murky, but there is
definitely an opportunity there to build something that is faster than the
dynamic engines in some cases. But it will be challenging! As of now, there
are no plans to work on this.


## Testing
Expand All @@ -236,7 +235,6 @@ the AT&T test suite) and code generate tests for each matching engine. The
approach we use in this library is to create a Cargo.toml entry point for each
matching engine we want to test. The entry points are:

* `tests/test_plugin.rs` - tests the `regex!` macro
* `tests/test_default.rs` - tests `Regex::new`
* `tests/test_default_bytes.rs` - tests `bytes::Regex::new`
* `tests/test_nfa.rs` - tests `Regex::new`, forced to use the NFA
Expand All @@ -261,18 +259,14 @@ entry points, it can take a while to compile everything. To reduce compile
times slightly, try using `cargo test --test default`, which will only use the
`tests/test_default.rs` entry point.

N.B. To run tests for the `regex!` macro, use:

cargo test --manifest-path regex_macros/Cargo.toml


## Benchmarking

The benchmarking in this crate is made up of many micro-benchmarks. Currently,
there are two primary sets of benchmarks: the benchmarks that were adopted
at this library's inception (in `benches/src/misc.rs`) and a newer set of
at this library's inception (in `bench/src/misc.rs`) and a newer set of
benchmarks meant to test various optimizations. Specifically, the latter set
contain some analysis and are in `benches/src/sherlock.rs`. Also, the latter
contain some analysis and are in `bench/src/sherlock.rs`. Also, the latter
set are all executed on the same lengthy input whereas the former benchmarks
are executed on strings of varying length.

Expand All @@ -284,7 +278,6 @@ separately from the main regex crate.
Benchmarking follows a similarly wonky setup as tests. There are multiple entry
points:

* `bench_rust_plugin.rs` - benchmarks the `regex!` macro
* `bench_rust.rs` - benchmarks `Regex::new`
* `bench_rust_bytes.rs` benchmarks `bytes::Regex::new`
* `bench_pcre.rs` - benchmarks PCRE
Expand All @@ -299,36 +292,36 @@ library benchmarks (especially RE2).
If you're hacking on one of the matching engines and just want to see
benchmarks, then all you need to run is:

$ ./run-bench rust
$ ./bench/run rust

If you want to compare your results with older benchmarks, then try:

$ ./run-bench rust | tee old
$ ./bench/run rust | tee old
$ ... make it faster
$ ./run-bench rust | tee new
$ cargo-benchcmp old new --improvements
$ ./bench/run rust | tee new
$ cargo benchcmp old new --improvements

The `cargo-benchcmp` utility is available here:
https://github.com/BurntSushi/cargo-benchcmp

The `run-bench` utility can run benchmarks for PCRE and Oniguruma too. See
`./run-bench --help`.
The `./bench/run` utility can run benchmarks for PCRE and Oniguruma too. See
`./bench/bench --help`.

## Dev Docs

When digging your teeth into the codebase for the first time, the
crate documentation can be a great resource. By default `rustdoc`
will strip out all documentation of private crate members in an
effort to help consumers of the crate focus on the *interface*
without having to concern themselves with the *implimentation*.
without having to concern themselves with the *implementation*.
Normally this is a great thing, but if you want to start hacking
on regex internals it is not what you want. Many of the private members
of this crate are well documented with rustdoc style comments, and
it would be a shame to miss out on the opportunity that presents.
You can generate the private docs with:

```
> rustdoc --crate-name docs src/lib.rs -o target/doc -L target/debug/deps --no-defaults --passes collapse-docs --passes unindent-comments
$ rustdoc --crate-name docs src/lib.rs -o target/doc -L target/debug/deps --no-defaults --passes collapse-docs --passes unindent-comments
```

Then just point your browser at `target/doc/regex/index.html`.
Expand Down
2 changes: 1 addition & 1 deletion PERFORMANCE.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ Your friendly guide to understanding the performance characteristics of this
crate.

This guide assumes some familiarity with the public API of this crate, which
can be found here: http://doc.rust-lang.org/regex/regex/index.html
can be found here: https://docs.rs/regex

## Theory vs. Practice

Expand Down
38 changes: 3 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,13 @@ by [RE2](https://github.com/google/re2).

### Documentation

[Module documentation with examples](https://doc.rust-lang.org/regex).
[Module documentation with examples](https://docs.rs/regex).
The module documentation also include a comprehensive description of the syntax
supported.

Documentation with examples for the various matching functions and iterators
can be found on the
[`Regex` type](https://doc.rust-lang.org/regex/regex/struct.Regex.html).
[`Regex` type](https://docs.rs/regex/*/regex/struct.Regex.html).

### Usage

Expand Down Expand Up @@ -188,37 +188,6 @@ assert!(!matches.matched(5));
assert!(matches.matched(6));
```

### Usage: `regex!` compiler plugin

**WARNING**: The `regex!` compiler plugin is orders of magnitude slower than
the normal `Regex::new(...)` usage. You should not use the compiler plugin
unless you have a very special reason for doing so. The performance difference
may be the temporary, but the path forward at this point isn't clear.

The `regex!` compiler plugin will compile your regexes at compile time. **This
only works with a nightly compiler.**

Here is a small example:

```rust
#![feature(plugin)]

#![plugin(regex_macros)]
extern crate regex;

fn main() {
let re = regex!(r"(\d{4})-(\d{2})-(\d{2})");
let caps = re.captures("2010-03-14").unwrap();

assert_eq!("2010", caps[1]);
assert_eq!("03", caps[2]);
assert_eq!("14", caps[3]);
}
```

Notice that we never `unwrap` the result of `regex!`. This is because your
*program* won't compile if the regex doesn't compile. (Try `regex!("(")`.)


### Usage: a regular expression parser

Expand All @@ -228,8 +197,7 @@ execution. This may be useful if you're implementing your own regex engine or
otherwise need to do analysis on the syntax of a regular expression. It is
otherwise not recommended for general use.

[Documentation for `regex-syntax` with
examples](https://doc.rust-lang.org/regex/regex_syntax/index.html).
[Documentation for `regex-syntax` with examples](https://docs.rs/regex-syntax).

# License

Expand Down
5 changes: 3 additions & 2 deletions appveyor.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,9 @@ install:
- SET PATH=%PATH%;C:\MinGW\bin
- rustc -V
- cargo -V

build: false

test_script:
- cargo test --verbose --jobs 4
branches:
only:
- master
30 changes: 10 additions & 20 deletions bench/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,26 +5,27 @@ version = "0.1.0"
authors = ["The Rust Project Developers"]
license = "MIT/Apache-2.0"
repository = "https://github.com/rust-lang/regex"
documentation = "http://doc.rust-lang.org/regex/regex/index.html"
documentation = "https://docs.rs/regex"
homepage = "https://github.com/rust-lang/regex"
description = "Regex benchmarks for Rust's and other engines."
build = "build.rs"
workspace = ".."

[dependencies]
docopt = "0.6"
lazy_static = "0.1"
docopt = "0.8"
lazy_static = "1"
libc = "0.2"
onig = { version = "1.2", optional = true }
onig = { version = "3", optional = true }
libpcre-sys = { version = "0.2", optional = true }
memmap = "0.2"
regex = { version = "0.2.0", path = "..", features = ["simd-accel"] }
regex_macros = { version = "0.2.0", path = "../regex_macros", optional = true }
regex-syntax = { version = "0.4.0", path = "../regex-syntax" }
rustc-serialize = "0.3"
serde = "1"
serde_derive = "1"

[build-dependencies]
gcc = "0.3"
pkg-config = "0.3"
cc = "1"
pkg-config = "0.3.9"

[[bin]]
name = "regex-run-one"
Expand All @@ -40,29 +41,18 @@ bench = false
# Doing anything else will probably result in weird "duplicate definition"
# compiler errors.
#
# Tip: use the run-bench script in the root of this repository to run
# benchmarks.
# Tip: use the `bench/run` script (in this directory) to run benchmarks.
[features]
re-pcre1 = ["libpcre-sys"]
re-pcre2 = []
re-onig = ["onig"]
re-re2 = []
re-rust = []
re-rust-bytes = []
re-rust-plugin = ["regex_macros"]
re-tcl = []

[[bench]]
name = "bench"
path = "src/bench.rs"
test = false
bench = true

[profile.release]
debug = true

[profile.bench]
debug = true

[profile.test]
debug = true
Loading

0 comments on commit f3425da

Please sign in to comment.