Skip to content

Commit

Permalink
Auto merge of #310 - BurntSushi:rfc, r=BurntSushi
Browse files Browse the repository at this point in the history
regex 0.2

0.2.0
=====
This is a new major release of the regex crate, and is an implementation of the
[regex 1.0 RFC](https://github.com/rust-lang/rfcs/blob/master/text/1620-regex-1.0.md).
We are releasing a `0.2` first, and if there are no major problems, we will
release a `1.0` shortly. For `0.2`, the minimum *supported* Rust version is
1.12.

There are a number of **breaking changes** in `0.2`. They are split into two
types. The first type correspond to breaking changes in regular expression
syntax. The second type correspond to breaking changes in the API.

Breaking changes for regex syntax:

* POSIX character classes now require double bracketing. Previously, the regex
  `[:upper:]` would parse as the `upper` POSIX character class. Now it parses
  as the character class containing the characters `:upper:`. The fix to this
  change is to use `[[:upper:]]` instead. Note that variants like
  `[[:upper:][:blank:]]` continue to work.
* The character `[` must always be escaped inside a character class.
* The characters `&`, `-` and `~` must be escaped if any one of them are
  repeated consecutively. For example, `[&]`, `[\&]`, `[\&\&]`, `[&-&]` are all
  equivalent while `[&&]` is illegal. (The motivation for this and the prior
  change is to provide a backwards compatible path for adding character class
  set notation.)
* A `bytes::Regex` now has Unicode mode enabled by default (like the main
  `Regex` type). This means regexes compiled with `bytes::Regex::new` that
  don't have the Unicode flag set should add `(?-u)` to recover the original
  behavior.

Breaking changes for the regex API:

* `find` and `find_iter` now **return `Match` values instead of
  `(usize, usize)`.** `Match` values have `start` and `end` methods, which
  return the match offsets. `Match` values also have an `as_str` method,
  which returns the text of the match itself.
* The `Captures` type now only provides a single iterator over all capturing
  matches, which should replace uses of `iter` and `iter_pos`. Uses of
  `iter_named` should use the `capture_names` method on `Regex`.
* The `replace` methods now return `Cow` values. The `Cow::Borrowed` variant
  is returned when no replacements are made.
* The `Replacer` trait has been completely overhauled. This should only
  impact clients that implement this trait explicitly. Standard uses of
  the `replace` methods should continue to work unchanged.
* The `quote` free function has been renamed to `escape`.
* The `Regex::with_size_limit` method has been removed. It is replaced by
  `RegexBuilder::size_limit`.
* The `RegexBuilder` type has switched from owned `self` method receivers to
  `&mut self` method receivers. Most uses will continue to work unchanged, but
  some code may require naming an intermediate variable to hold the builder.
* The free `is_match` function has been removed. It is replaced by compiling
  a `Regex` and calling its `is_match` method.
* The `PartialEq` and `Eq` impls on `Regex` have been dropped. If you relied
  on these impls, the fix is to define a wrapper type around `Regex`, impl
  `Deref` on it and provide the necessary impls.
* The `is_empty` method on `Captures` has been removed. This always returns
  `false`, so its use is superfluous.
* The `Syntax` variant of the `Error` type now contains a string instead of
  a `regex_syntax::Error`. If you were examining syntax errors more closely,
  you'll need to explicitly use the `regex_syntax` crate to re-parse the regex.
* The `InvalidSet` variant of the `Error` type has been removed since it is
  no longer used.
* Most of the iterator types have been renamed to match conventions. If you
  were using these iterator types explicitly, please consult the documentation
  for its new name. For example, `RegexSplits` has been renamed to `Split`.

A number of bugs have been fixed:

* [BUG #151](#151):
  The `Replacer` trait has been changed to permit the caller to control
  allocation.
* [BUG #165](#165):
  Remove the free `is_match` function.
* [BUG #166](#166):
  Expose more knobs (available in `0.1`) and remove `with_size_limit`.
* [BUG #168](#168):
  Iterators produced by `Captures` now have the correct lifetime parameters.
* [BUG #175](#175):
  Fix a corner case in the parsing of POSIX character classes.
* [BUG #178](#178):
  Drop the `PartialEq` and `Eq` impls on `Regex`.
* [BUG #179](#179):
  Remove `is_empty` from `Captures` since it always returns false.
* [BUG #276](#276):
  Position of named capture can now be retrieved from a `Captures`.
* [BUG #296](#296):
  Remove winapi/kernel32-sys dependency on UNIX.
* [BUG #307](#307):
  Fix error on emscripten.
  • Loading branch information
bors committed Dec 31, 2016
2 parents e2f0850 + ac3ab6d commit 52fdae7
Show file tree
Hide file tree
Showing 52 changed files with 2,018 additions and 1,608 deletions.
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
language: rust
rust:
- 1.3.0
- 1.12.0
- stable
- beta
- nightly
Expand Down
143 changes: 119 additions & 24 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,101 @@
0.2.0
=====
This is a new major release of the regex crate, and is an implementation of the
[regex 1.0 RFC](https://github.com/rust-lang/rfcs/blob/master/text/1620-regex-1.0.md).
We are releasing a `0.2` first, and if there are no major problems, we will
release a `1.0` shortly. For `0.2`, the minimum *supported* Rust version is
1.12.

There are a number of **breaking changes** in `0.2`. They are split into two
types. The first type correspond to breaking changes in regular expression
syntax. The second type correspond to breaking changes in the API.

Breaking changes for regex syntax:

* POSIX character classes now require double bracketing. Previously, the regex
`[:upper:]` would parse as the `upper` POSIX character class. Now it parses
as the character class containing the characters `:upper:`. The fix to this
change is to use `[[:upper:]]` instead. Note that variants like
`[[:upper:][:blank:]]` continue to work.
* The character `[` must always be escaped inside a character class.
* The characters `&`, `-` and `~` must be escaped if any one of them are
repeated consecutively. For example, `[&]`, `[\&]`, `[\&\&]`, `[&-&]` are all
equivalent while `[&&]` is illegal. (The motivation for this and the prior
change is to provide a backwards compatible path for adding character class
set notation.)
* A `bytes::Regex` now has Unicode mode enabled by default (like the main
`Regex` type). This means regexes compiled with `bytes::Regex::new` that
don't have the Unicode flag set should add `(?-u)` to recover the original
behavior.

Breaking changes for the regex API:

* `find` and `find_iter` now **return `Match` values instead of
`(usize, usize)`.** `Match` values have `start` and `end` methods, which
return the match offsets. `Match` values also have an `as_str` method,
which returns the text of the match itself.
* The `Captures` type now only provides a single iterator over all capturing
matches, which should replace uses of `iter` and `iter_pos`. Uses of
`iter_named` should use the `capture_names` method on `Regex`.
* The `at` method on the `Captures` type has been renamed to `get`, and it
now returns a `Match`. Similarly, the `name` method on `Captures` now returns
a `Match`.
* The `replace` methods now return `Cow` values. The `Cow::Borrowed` variant
is returned when no replacements are made.
* The `Replacer` trait has been completely overhauled. This should only
impact clients that implement this trait explicitly. Standard uses of
the `replace` methods should continue to work unchanged. If you implement
the `Replacer` trait, please consult the new documentation.
* The `quote` free function has been renamed to `escape`.
* The `Regex::with_size_limit` method has been removed. It is replaced by
`RegexBuilder::size_limit`.
* The `RegexBuilder` type has switched from owned `self` method receivers to
`&mut self` method receivers. Most uses will continue to work unchanged, but
some code may require naming an intermediate variable to hold the builder.
* The free `is_match` function has been removed. It is replaced by compiling
a `Regex` and calling its `is_match` method.
* The `PartialEq` and `Eq` impls on `Regex` have been dropped. If you relied
on these impls, the fix is to define a wrapper type around `Regex`, impl
`Deref` on it and provide the necessary impls.
* The `is_empty` method on `Captures` has been removed. This always returns
`false`, so its use is superfluous.
* The `Syntax` variant of the `Error` type now contains a string instead of
a `regex_syntax::Error`. If you were examining syntax errors more closely,
you'll need to explicitly use the `regex_syntax` crate to re-parse the regex.
* The `InvalidSet` variant of the `Error` type has been removed since it is
no longer used.
* Most of the iterator types have been renamed to match conventions. If you
were using these iterator types explicitly, please consult the documentation
for its new name. For example, `RegexSplits` has been renamed to `Split`.

A number of bugs have been fixed:

* [BUG #151](https://github.com/rust-lang/regex/issues/151):
The `Replacer` trait has been changed to permit the caller to control
allocation.
* [BUG #165](https://github.com/rust-lang/regex/issues/165):
Remove the free `is_match` function.
* [BUG #166](https://github.com/rust-lang/regex/issues/166):
Expose more knobs (available in `0.1`) and remove `with_size_limit`.
* [BUG #168](https://github.com/rust-lang/regex/issues/168):
Iterators produced by `Captures` now have the correct lifetime parameters.
* [BUG #175](https://github.com/rust-lang/regex/issues/175):
Fix a corner case in the parsing of POSIX character classes.
* [BUG #178](https://github.com/rust-lang/regex/issues/178):
Drop the `PartialEq` and `Eq` impls on `Regex`.
* [BUG #179](https://github.com/rust-lang/regex/issues/179):
Remove `is_empty` from `Captures` since it always returns false.
* [BUG #276](https://github.com/rust-lang/regex/issues/276):
Position of named capture can now be retrieved from a `Captures`.
* [BUG #296](https://github.com/rust-lang/regex/issues/296):
Remove winapi/kernel32-sys dependency on UNIX.
* [BUG #307](https://github.com/rust-lang/regex/issues/307):
Fix error on emscripten.


0.1.80
======
* [PR #292](https://github.com/rust-lang-nursery/regex/pull/292):
* [PR #292](https://github.com/rust-lang/regex/pull/292):
Fixes bug #291, which was introduced by PR #290.

0.1.79
Expand All @@ -9,13 +104,13 @@

0.1.78
======
* [PR #290](https://github.com/rust-lang-nursery/regex/pull/290):
* [PR #290](https://github.com/rust-lang/regex/pull/290):
Fixes bug #289, which caused some regexes with a certain combination
of literals to match incorrectly.

0.1.77
======
* [PR #281](https://github.com/rust-lang-nursery/regex/pull/281):
* [PR #281](https://github.com/rust-lang/regex/pull/281):
Fixes bug #280 by disabling all literal optimizations when a pattern
is partially anchored.

Expand All @@ -25,9 +120,9 @@

0.1.75
======
* [PR #275](https://github.com/rust-lang-nursery/regex/pull/275):
* [PR #275](https://github.com/rust-lang/regex/pull/275):
Improves match verification performance in the Teddy SIMD searcher.
* [PR #278](https://github.com/rust-lang-nursery/regex/pull/278):
* [PR #278](https://github.com/rust-lang/regex/pull/278):
Replaces slow substring loop in the Teddy SIMD searcher with Aho-Corasick.
* Implemented DoubleEndedIterator on regex set match iterators.

Expand All @@ -36,7 +131,7 @@
* Release regex-syntax 0.3.5 with a minor bug fix.
* Fix bug #272.
* Fix bug #277.
* [PR #270](https://github.com/rust-lang-nursery/regex/pull/270):
* [PR #270](https://github.com/rust-lang/regex/pull/270):
Fixes bugs #264, #268 and an unreported where the DFA cache size could be
drastically under estimated in some cases (leading to high unexpected memory
usage).
Expand All @@ -48,55 +143,55 @@

0.1.72
======
* [PR #262](https://github.com/rust-lang-nursery/regex/pull/262):
* [PR #262](https://github.com/rust-lang/regex/pull/262):
Fixes a number of small bugs caught by fuzz testing (AFL).

0.1.71
======
* [PR #236](https://github.com/rust-lang-nursery/regex/pull/236):
* [PR #236](https://github.com/rust-lang/regex/pull/236):
Fix a bug in how suffix literals were extracted, which could lead
to invalid match behavior in some cases.

0.1.70
======
* [PR #231](https://github.com/rust-lang-nursery/regex/pull/231):
* [PR #231](https://github.com/rust-lang/regex/pull/231):
Add SIMD accelerated multiple pattern search.
* [PR #228](https://github.com/rust-lang-nursery/regex/pull/228):
* [PR #228](https://github.com/rust-lang/regex/pull/228):
Reintroduce the reverse suffix literal optimization.
* [PR #226](https://github.com/rust-lang-nursery/regex/pull/226):
* [PR #226](https://github.com/rust-lang/regex/pull/226):
Implements NFA state compression in the lazy DFA.
* [PR #223](https://github.com/rust-lang-nursery/regex/pull/223):
* [PR #223](https://github.com/rust-lang/regex/pull/223):
A fully anchored RegexSet can now short-circuit.

0.1.69
======
* [PR #216](https://github.com/rust-lang-nursery/regex/pull/216):
* [PR #216](https://github.com/rust-lang/regex/pull/216):
Tweak the threshold for running backtracking.
* [PR #217](https://github.com/rust-lang-nursery/regex/pull/217):
* [PR #217](https://github.com/rust-lang/regex/pull/217):
Add upper limit (from the DFA) to capture search (for the NFA).
* [PR #218](https://github.com/rust-lang-nursery/regex/pull/218):
* [PR #218](https://github.com/rust-lang/regex/pull/218):
Add rure, a C API.

0.1.68
======
* [PR #210](https://github.com/rust-lang-nursery/regex/pull/210):
* [PR #210](https://github.com/rust-lang/regex/pull/210):
Fixed a performance bug in `bytes::Regex::replace` where `extend` was used
instead of `extend_from_slice`.
* [PR #211](https://github.com/rust-lang-nursery/regex/pull/211):
* [PR #211](https://github.com/rust-lang/regex/pull/211):
Fixed a bug in the handling of word boundaries in the DFA.
* [PR #213](https://github.com/rust-lang-nursery/regex/pull/213):
* [PR #213](https://github.com/rust-lang/pull/213):
Added RE2 and Tcl to the benchmark harness. Also added a CLI utility from
running regexes using any of the following regex engines: PCRE1, PCRE2,
Oniguruma, RE2, Tcl and of course Rust's own regexes.

0.1.67
======
* [PR #201](https://github.com/rust-lang-nursery/regex/pull/201):
* [PR #201](https://github.com/rust-lang/regex/pull/201):
Fix undefined behavior in the `regex!` compiler plugin macro.
* [PR #205](https://github.com/rust-lang-nursery/regex/pull/205):
* [PR #205](https://github.com/rust-lang/regex/pull/205):
More improvements to DFA performance. Competitive with RE2. See PR for
benchmarks.
* [PR #209](https://github.com/rust-lang-nursery/regex/pull/209):
* [PR #209](https://github.com/rust-lang/regex/pull/209):
Release 0.1.66 was semver incompatible since it required a newer version
of Rust than previous releases. This PR fixes that. (And `0.1.66` was
yanked.)
Expand All @@ -110,11 +205,11 @@
complexity. It was replaced with a more limited optimization where, given any
regex of the form `re$`, it will be matched in reverse from the end of the
haystack.
* [PR #202](https://github.com/rust-lang-nursery/regex/pull/202):
* [PR #202](https://github.com/rust-lang/regex/pull/202):
The inner loop of the DFA was heavily optimized to improve cache locality
and reduce the overall number of instructions run on each iteration. This
represents the first use of `unsafe` in `regex` (to elide bounds checks).
* [PR #200](https://github.com/rust-lang-nursery/regex/pull/200):
* [PR #200](https://github.com/rust-lang/regex/pull/200):
Use of the `mempool` crate (which used thread local storage) was replaced
with a faster version of a similar API in @Amanieu's `thread_local` crate.
It should reduce contention when using a regex from multiple threads
Expand All @@ -124,5 +219,5 @@
(Includes a comparison with PCRE1's JIT and Oniguruma.)
* A bug where word boundaries weren't being matched correctly in the DFA was
fixed. This only affected use of `bytes::Regex`.
* [#160](https://github.com/rust-lang-nursery/regex/issues/160):
* [#160](https://github.com/rust-lang/regex/issues/160):
`Captures` now has a `Debug` impl.
16 changes: 8 additions & 8 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "regex"
version = "0.1.80" #:version
version = "0.2.0" #:version
authors = ["The Rust Project Developers"]
license = "MIT/Apache-2.0"
readme = "README.md"
Expand All @@ -16,23 +16,23 @@ finite automata and guarantees linear time matching on all inputs.
# For very fast prefix literal matching.
aho-corasick = "0.5.3"
# For skipping along search text quickly when a leading byte is known.
memchr = "0.1.9"
memchr = "1"
# For managing regex caches quickly across multiple threads.
thread_local = "0.2.4"
thread_local = "0.3.2"
# For parsing regular expressions.
regex-syntax = { path = "regex-syntax", version = "0.3.8" }
regex-syntax = { path = "regex-syntax", version = "0.4.0" }
# For accelerating text search.
simd = { version = "0.1.0", optional = true }
# For compiling UTF-8 decoding into automata.
utf8-ranges = "0.1.3"
utf8-ranges = "1"

[dev-dependencies]
# For examples.
lazy_static = "0.1"
lazy_static = "0.2.2"
# For property based tests.
quickcheck = "0.2"
quickcheck = "0.4.1"
# For generating random test data.
rand = "0.3"
rand = "0.3.15"

[features]
# Enable to use the unstable pattern traits defined in std.
Expand Down
43 changes: 21 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,15 @@
regex
=====

A Rust library for parsing, compiling, and executing regular expressions.
This particular implementation of regular expressions guarantees execution
in linear time with respect to the size of the regular expression and
search text by using finite automata. In particular, it makes use of both
NFAs and DFAs when matching. Much of the syntax and implementation is inspired
A Rust library for parsing, compiling, and executing regular expressions. Its
syntax is similar to Perl-style regular expressions, but lacks a few features
like look around and backreferences. In exchange, all searches execute in
linear time with respect to the size of the regular expression and search text.
Much of the syntax and implementation is inspired
by [RE2](https://github.com/google/re2).

[![Build Status](https://travis-ci.org/rust-lang-nursery/regex.svg?branch=master)](https://travis-ci.org/rust-lang-nursery/regex)
[![Build status](https://ci.appveyor.com/api/projects/status/22g48bo866qr4u77?svg=true)](https://ci.appveyor.com/project/alexcrichton/regex)
[![Coverage Status](https://coveralls.io/repos/github/rust-lang-nursery/regex/badge.svg?branch=master)](https://coveralls.io/github/rust-lang-nursery/regex?branch=master)
[![Build Status](https://travis-ci.org/rust-lang/regex.svg?branch=master)](https://travis-ci.org/rust-lang/regex)
[![Build status](https://ci.appveyor.com/api/projects/status/github/rust-lang/regex?svg=true)](https://ci.appveyor.com/project/rust-lang-libs/regex)
[![Coverage Status](https://coveralls.io/repos/github/rust-lang/regex/badge.svg?branch=master)](https://coveralls.io/github/rust-lang/regex?branch=master)
[![](http://meritbadge.herokuapp.com/regex)](https://crates.io/crates/regex)

### Documentation
Expand All @@ -29,7 +28,7 @@ Add this to your `Cargo.toml`:

```toml
[dependencies]
regex = "0.1"
regex = "0.2"
```

and this to your crate root:
Expand All @@ -56,9 +55,9 @@ fn main() {
").unwrap();
let caps = re.captures("2010-03-14").unwrap();

assert_eq!("2010", caps.name("year").unwrap());
assert_eq!("03", caps.name("month").unwrap());
assert_eq!("14", caps.name("day").unwrap());
assert_eq!("2010", caps["year"]);
assert_eq!("03", caps["month"]);
assert_eq!("14", caps["day"]);
}
```

Expand All @@ -82,9 +81,9 @@ fn main() {
// because the only way for the regex to match is if all of the
// capture groups match. This is not true in general though!
println!("year: {}, month: {}, day: {}",
caps.at(1).unwrap(),
caps.at(2).unwrap(),
caps.at(3).unwrap());
caps.get(1).unwrap().as_str(),
caps.get(2).unwrap().as_str(),
caps.get(3).unwrap().as_str());
}
}
```
Expand Down Expand Up @@ -137,8 +136,8 @@ means the main API can't be used for searching arbitrary bytes.
To match on arbitrary bytes, use the `regex::bytes::Regex` API. The API
is identical to the main API, except that it takes an `&[u8]` to search
on instead of an `&str`. By default, `.` will match any *byte* using
`regex::bytes::Regex`, while `.` will match any encoded Unicode *codepoint*
using the main API.
`regex::bytes::Regex`, while `.` will match any *UTF-8 encoded Unicode scalar
value* using the main API.

This example shows how to find all null-terminated strings in a slice of bytes:

Expand All @@ -152,7 +151,7 @@ let text = b"foo\x00bar\x00baz\x00";
// The unwrap is OK here since a match requires the `cstr` capture to match.
let cstrs: Vec<&[u8]> =
re.captures_iter(text)
.map(|c| c.name("cstr").unwrap())
.map(|c| c.name("cstr").unwrap().as_bytes())
.collect();
assert_eq!(vec![&b"foo"[..], &b"bar"[..], &b"baz"[..]], cstrs);
```
Expand Down Expand Up @@ -211,9 +210,9 @@ fn main() {
let re = regex!(r"(\d{4})-(\d{2})-(\d{2})");
let caps = re.captures("2010-03-14").unwrap();

assert_eq!("2010", caps.at(1).unwrap());
assert_eq!("03", caps.at(2).unwrap());
assert_eq!("14", caps.at(3).unwrap());
assert_eq!("2010", caps[1]);
assert_eq!("03", caps[2]);
assert_eq!("14", caps[3]);
}
```

Expand Down
6 changes: 3 additions & 3 deletions bench/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,9 @@ libc = "0.2"
onig = { version = "0.4", optional = true }
libpcre-sys = { version = "0.2", optional = true }
memmap = "0.2"
regex = { version = "0.1", path = "..", features = ["simd-accel"] }
regex_macros = { version = "0.1", path = "../regex_macros", optional = true }
regex-syntax = { version = "0.3", path = "../regex-syntax" }
regex = { version = "0.2.0", path = "..", features = ["simd-accel"] }
regex_macros = { version = "0.2.0", path = "../regex_macros", optional = true }
regex-syntax = { version = "0.4.0", path = "../regex-syntax" }
rustc-serialize = "0.3"

[build-dependencies]
Expand Down
Loading

0 comments on commit 52fdae7

Please sign in to comment.