Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regex 0.2 #310

Merged
merged 27 commits into from
Dec 31, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
d44a9f9
Switch bytes::Regex to using Unicode mode by default.
BurntSushi May 2, 2016
ebd26e9
Update Replacer trait for Unicode regexes.
BurntSushi May 2, 2016
f98219b
Remove the is_empty method on Captures.
BurntSushi May 7, 2016
83fce85
Drop the PartialEq and Eq impls on Regex.
BurntSushi May 7, 2016
d12042b
Use correct lifetimes for SubCaptures and SubCapturesNamed types.
BurntSushi May 7, 2016
e1a94bb
Remove Regex::with_size_limit.
BurntSushi May 7, 2016
cfd887d
Remove free is_match function.
BurntSushi May 7, 2016
24f86b0
Rename RegexSplits to Splits.
BurntSushi May 7, 2016
a6722a3
Reorganize capture slot handling, but don't make any public API changes.
BurntSushi May 7, 2016
2632c2f
Rename many of the iterator types.
BurntSushi May 17, 2016
52165d6
Use `Cow` for replacements.
BurntSushi May 17, 2016
2805811
Update the Error type.
BurntSushi May 18, 2016
384e937
find/find_iter now return a Match instead of (usize, usize).
BurntSushi Aug 5, 2016
fab4069
Remove the submatch iterators.
BurntSushi Aug 5, 2016
1f7f5c9
Fix tests.
BurntSushi Aug 5, 2016
403b27a
Switch to more idiomatic builder definition.
BurntSushi Aug 21, 2016
3f1fde5
Rename iterator types to match `std` conventions.
BurntSushi Aug 21, 2016
8ee9262
Changed the name of quote to escape.
Nov 15, 2016
bc06024
Make ASCII classes consistent with other engines.
BurntSushi Dec 30, 2016
dd120a9
Require escaping of [, &, - and ~ in classes.
BurntSushi Dec 30, 2016
374f139
Add SubCaptureMatches iterator on Captures.
BurntSushi Dec 30, 2016
c4faddf
Remove custom extend_from_slice implementation.
BurntSushi Dec 30, 2016
66c6ddf
Fix performance bug with Match.
BurntSushi Dec 31, 2016
0c59d41
Add RegexSetBuilder.
BurntSushi Dec 31, 2016
63132b5
Documentation updates and clean ups.
BurntSushi Dec 31, 2016
f094d15
Update github links.
BurntSushi Dec 31, 2016
ac3ab6d
Bump versions everywhere and update CHANGELOG.
BurntSushi Dec 30, 2016
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
language: rust
rust:
- 1.3.0
- 1.12.0
- stable
- beta
- nightly
Expand Down
143 changes: 119 additions & 24 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,101 @@
0.2.0
=====
This is a new major release of the regex crate, and is an implementation of the
[regex 1.0 RFC](https://github.com/rust-lang/rfcs/blob/master/text/1620-regex-1.0.md).
We are releasing a `0.2` first, and if there are no major problems, we will
release a `1.0` shortly. For `0.2`, the minimum *supported* Rust version is
1.12.

There are a number of **breaking changes** in `0.2`. They are split into two
types. The first type correspond to breaking changes in regular expression
syntax. The second type correspond to breaking changes in the API.

Breaking changes for regex syntax:

* POSIX character classes now require double bracketing. Previously, the regex
`[:upper:]` would parse as the `upper` POSIX character class. Now it parses
as the character class containing the characters `:upper:`. The fix to this
change is to use `[[:upper:]]` instead. Note that variants like
`[[:upper:][:blank:]]` continue to work.
* The character `[` must always be escaped inside a character class.
* The characters `&`, `-` and `~` must be escaped if any one of them are
repeated consecutively. For example, `[&]`, `[\&]`, `[\&\&]`, `[&-&]` are all
equivalent while `[&&]` is illegal. (The motivation for this and the prior
change is to provide a backwards compatible path for adding character class
set notation.)
* A `bytes::Regex` now has Unicode mode enabled by default (like the main
`Regex` type). This means regexes compiled with `bytes::Regex::new` that
don't have the Unicode flag set should add `(?-u)` to recover the original
behavior.

Breaking changes for the regex API:

* `find` and `find_iter` now **return `Match` values instead of
`(usize, usize)`.** `Match` values have `start` and `end` methods, which
return the match offsets. `Match` values also have an `as_str` method,
which returns the text of the match itself.
* The `Captures` type now only provides a single iterator over all capturing
matches, which should replace uses of `iter` and `iter_pos`. Uses of
`iter_named` should use the `capture_names` method on `Regex`.
* The `at` method on the `Captures` type has been renamed to `get`, and it
now returns a `Match`. Similarly, the `name` method on `Captures` now returns
a `Match`.
* The `replace` methods now return `Cow` values. The `Cow::Borrowed` variant
is returned when no replacements are made.
* The `Replacer` trait has been completely overhauled. This should only
impact clients that implement this trait explicitly. Standard uses of
the `replace` methods should continue to work unchanged. If you implement
the `Replacer` trait, please consult the new documentation.
* The `quote` free function has been renamed to `escape`.
* The `Regex::with_size_limit` method has been removed. It is replaced by
`RegexBuilder::size_limit`.
* The `RegexBuilder` type has switched from owned `self` method receivers to
`&mut self` method receivers. Most uses will continue to work unchanged, but
some code may require naming an intermediate variable to hold the builder.
* The free `is_match` function has been removed. It is replaced by compiling
a `Regex` and calling its `is_match` method.
* The `PartialEq` and `Eq` impls on `Regex` have been dropped. If you relied
on these impls, the fix is to define a wrapper type around `Regex`, impl
`Deref` on it and provide the necessary impls.
* The `is_empty` method on `Captures` has been removed. This always returns
`false`, so its use is superfluous.
* The `Syntax` variant of the `Error` type now contains a string instead of
a `regex_syntax::Error`. If you were examining syntax errors more closely,
you'll need to explicitly use the `regex_syntax` crate to re-parse the regex.
* The `InvalidSet` variant of the `Error` type has been removed since it is
no longer used.
* Most of the iterator types have been renamed to match conventions. If you
were using these iterator types explicitly, please consult the documentation
for its new name. For example, `RegexSplits` has been renamed to `Split`.

A number of bugs have been fixed:

* [BUG #151](https://github.com/rust-lang/regex/issues/151):
The `Replacer` trait has been changed to permit the caller to control
allocation.
* [BUG #165](https://github.com/rust-lang/regex/issues/165):
Remove the free `is_match` function.
* [BUG #166](https://github.com/rust-lang/regex/issues/166):
Expose more knobs (available in `0.1`) and remove `with_size_limit`.
* [BUG #168](https://github.com/rust-lang/regex/issues/168):
Iterators produced by `Captures` now have the correct lifetime parameters.
* [BUG #175](https://github.com/rust-lang/regex/issues/175):
Fix a corner case in the parsing of POSIX character classes.
* [BUG #178](https://github.com/rust-lang/regex/issues/178):
Drop the `PartialEq` and `Eq` impls on `Regex`.
* [BUG #179](https://github.com/rust-lang/regex/issues/179):
Remove `is_empty` from `Captures` since it always returns false.
* [BUG #276](https://github.com/rust-lang/regex/issues/276):
Position of named capture can now be retrieved from a `Captures`.
* [BUG #296](https://github.com/rust-lang/regex/issues/296):
Remove winapi/kernel32-sys dependency on UNIX.
* [BUG #307](https://github.com/rust-lang/regex/issues/307):
Fix error on emscripten.


0.1.80
======
* [PR #292](https://github.com/rust-lang-nursery/regex/pull/292):
* [PR #292](https://github.com/rust-lang/regex/pull/292):
Fixes bug #291, which was introduced by PR #290.

0.1.79
Expand All @@ -9,13 +104,13 @@

0.1.78
======
* [PR #290](https://github.com/rust-lang-nursery/regex/pull/290):
* [PR #290](https://github.com/rust-lang/regex/pull/290):
Fixes bug #289, which caused some regexes with a certain combination
of literals to match incorrectly.

0.1.77
======
* [PR #281](https://github.com/rust-lang-nursery/regex/pull/281):
* [PR #281](https://github.com/rust-lang/regex/pull/281):
Fixes bug #280 by disabling all literal optimizations when a pattern
is partially anchored.

Expand All @@ -25,9 +120,9 @@

0.1.75
======
* [PR #275](https://github.com/rust-lang-nursery/regex/pull/275):
* [PR #275](https://github.com/rust-lang/regex/pull/275):
Improves match verification performance in the Teddy SIMD searcher.
* [PR #278](https://github.com/rust-lang-nursery/regex/pull/278):
* [PR #278](https://github.com/rust-lang/regex/pull/278):
Replaces slow substring loop in the Teddy SIMD searcher with Aho-Corasick.
* Implemented DoubleEndedIterator on regex set match iterators.

Expand All @@ -36,7 +131,7 @@
* Release regex-syntax 0.3.5 with a minor bug fix.
* Fix bug #272.
* Fix bug #277.
* [PR #270](https://github.com/rust-lang-nursery/regex/pull/270):
* [PR #270](https://github.com/rust-lang/regex/pull/270):
Fixes bugs #264, #268 and an unreported where the DFA cache size could be
drastically under estimated in some cases (leading to high unexpected memory
usage).
Expand All @@ -48,55 +143,55 @@

0.1.72
======
* [PR #262](https://github.com/rust-lang-nursery/regex/pull/262):
* [PR #262](https://github.com/rust-lang/regex/pull/262):
Fixes a number of small bugs caught by fuzz testing (AFL).

0.1.71
======
* [PR #236](https://github.com/rust-lang-nursery/regex/pull/236):
* [PR #236](https://github.com/rust-lang/regex/pull/236):
Fix a bug in how suffix literals were extracted, which could lead
to invalid match behavior in some cases.

0.1.70
======
* [PR #231](https://github.com/rust-lang-nursery/regex/pull/231):
* [PR #231](https://github.com/rust-lang/regex/pull/231):
Add SIMD accelerated multiple pattern search.
* [PR #228](https://github.com/rust-lang-nursery/regex/pull/228):
* [PR #228](https://github.com/rust-lang/regex/pull/228):
Reintroduce the reverse suffix literal optimization.
* [PR #226](https://github.com/rust-lang-nursery/regex/pull/226):
* [PR #226](https://github.com/rust-lang/regex/pull/226):
Implements NFA state compression in the lazy DFA.
* [PR #223](https://github.com/rust-lang-nursery/regex/pull/223):
* [PR #223](https://github.com/rust-lang/regex/pull/223):
A fully anchored RegexSet can now short-circuit.

0.1.69
======
* [PR #216](https://github.com/rust-lang-nursery/regex/pull/216):
* [PR #216](https://github.com/rust-lang/regex/pull/216):
Tweak the threshold for running backtracking.
* [PR #217](https://github.com/rust-lang-nursery/regex/pull/217):
* [PR #217](https://github.com/rust-lang/regex/pull/217):
Add upper limit (from the DFA) to capture search (for the NFA).
* [PR #218](https://github.com/rust-lang-nursery/regex/pull/218):
* [PR #218](https://github.com/rust-lang/regex/pull/218):
Add rure, a C API.

0.1.68
======
* [PR #210](https://github.com/rust-lang-nursery/regex/pull/210):
* [PR #210](https://github.com/rust-lang/regex/pull/210):
Fixed a performance bug in `bytes::Regex::replace` where `extend` was used
instead of `extend_from_slice`.
* [PR #211](https://github.com/rust-lang-nursery/regex/pull/211):
* [PR #211](https://github.com/rust-lang/regex/pull/211):
Fixed a bug in the handling of word boundaries in the DFA.
* [PR #213](https://github.com/rust-lang-nursery/regex/pull/213):
* [PR #213](https://github.com/rust-lang/pull/213):
Added RE2 and Tcl to the benchmark harness. Also added a CLI utility from
running regexes using any of the following regex engines: PCRE1, PCRE2,
Oniguruma, RE2, Tcl and of course Rust's own regexes.

0.1.67
======
* [PR #201](https://github.com/rust-lang-nursery/regex/pull/201):
* [PR #201](https://github.com/rust-lang/regex/pull/201):
Fix undefined behavior in the `regex!` compiler plugin macro.
* [PR #205](https://github.com/rust-lang-nursery/regex/pull/205):
* [PR #205](https://github.com/rust-lang/regex/pull/205):
More improvements to DFA performance. Competitive with RE2. See PR for
benchmarks.
* [PR #209](https://github.com/rust-lang-nursery/regex/pull/209):
* [PR #209](https://github.com/rust-lang/regex/pull/209):
Release 0.1.66 was semver incompatible since it required a newer version
of Rust than previous releases. This PR fixes that. (And `0.1.66` was
yanked.)
Expand All @@ -110,11 +205,11 @@
complexity. It was replaced with a more limited optimization where, given any
regex of the form `re$`, it will be matched in reverse from the end of the
haystack.
* [PR #202](https://github.com/rust-lang-nursery/regex/pull/202):
* [PR #202](https://github.com/rust-lang/regex/pull/202):
The inner loop of the DFA was heavily optimized to improve cache locality
and reduce the overall number of instructions run on each iteration. This
represents the first use of `unsafe` in `regex` (to elide bounds checks).
* [PR #200](https://github.com/rust-lang-nursery/regex/pull/200):
* [PR #200](https://github.com/rust-lang/regex/pull/200):
Use of the `mempool` crate (which used thread local storage) was replaced
with a faster version of a similar API in @Amanieu's `thread_local` crate.
It should reduce contention when using a regex from multiple threads
Expand All @@ -124,5 +219,5 @@
(Includes a comparison with PCRE1's JIT and Oniguruma.)
* A bug where word boundaries weren't being matched correctly in the DFA was
fixed. This only affected use of `bytes::Regex`.
* [#160](https://github.com/rust-lang-nursery/regex/issues/160):
* [#160](https://github.com/rust-lang/regex/issues/160):
`Captures` now has a `Debug` impl.
16 changes: 8 additions & 8 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "regex"
version = "0.1.80" #:version
version = "0.2.0" #:version
authors = ["The Rust Project Developers"]
license = "MIT/Apache-2.0"
readme = "README.md"
Expand All @@ -16,23 +16,23 @@ finite automata and guarantees linear time matching on all inputs.
# For very fast prefix literal matching.
aho-corasick = "0.5.3"
# For skipping along search text quickly when a leading byte is known.
memchr = "0.1.9"
memchr = "1"
# For managing regex caches quickly across multiple threads.
thread_local = "0.2.4"
thread_local = "0.3.2"
# For parsing regular expressions.
regex-syntax = { path = "regex-syntax", version = "0.3.8" }
regex-syntax = { path = "regex-syntax", version = "0.4.0" }
# For accelerating text search.
simd = { version = "0.1.0", optional = true }
# For compiling UTF-8 decoding into automata.
utf8-ranges = "0.1.3"
utf8-ranges = "1"

[dev-dependencies]
# For examples.
lazy_static = "0.1"
lazy_static = "0.2.2"
# For property based tests.
quickcheck = "0.2"
quickcheck = "0.4.1"
# For generating random test data.
rand = "0.3"
rand = "0.3.15"

[features]
# Enable to use the unstable pattern traits defined in std.
Expand Down
43 changes: 21 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,15 @@
regex
=====

A Rust library for parsing, compiling, and executing regular expressions.
This particular implementation of regular expressions guarantees execution
in linear time with respect to the size of the regular expression and
search text by using finite automata. In particular, it makes use of both
NFAs and DFAs when matching. Much of the syntax and implementation is inspired
A Rust library for parsing, compiling, and executing regular expressions. Its
syntax is similar to Perl-style regular expressions, but lacks a few features
like look around and backreferences. In exchange, all searches execute in
linear time with respect to the size of the regular expression and search text.
Much of the syntax and implementation is inspired
by [RE2](https://github.com/google/re2).

[![Build Status](https://travis-ci.org/rust-lang-nursery/regex.svg?branch=master)](https://travis-ci.org/rust-lang-nursery/regex)
[![Build status](https://ci.appveyor.com/api/projects/status/22g48bo866qr4u77?svg=true)](https://ci.appveyor.com/project/alexcrichton/regex)
[![Coverage Status](https://coveralls.io/repos/github/rust-lang-nursery/regex/badge.svg?branch=master)](https://coveralls.io/github/rust-lang-nursery/regex?branch=master)
[![Build Status](https://travis-ci.org/rust-lang/regex.svg?branch=master)](https://travis-ci.org/rust-lang/regex)
[![Build status](https://ci.appveyor.com/api/projects/status/github/rust-lang/regex?svg=true)](https://ci.appveyor.com/project/rust-lang-libs/regex)
[![Coverage Status](https://coveralls.io/repos/github/rust-lang/regex/badge.svg?branch=master)](https://coveralls.io/github/rust-lang/regex?branch=master)
[![](http://meritbadge.herokuapp.com/regex)](https://crates.io/crates/regex)

### Documentation
Expand All @@ -29,7 +28,7 @@ Add this to your `Cargo.toml`:

```toml
[dependencies]
regex = "0.1"
regex = "0.2"
```

and this to your crate root:
Expand All @@ -56,9 +55,9 @@ fn main() {
").unwrap();
let caps = re.captures("2010-03-14").unwrap();

assert_eq!("2010", caps.name("year").unwrap());
assert_eq!("03", caps.name("month").unwrap());
assert_eq!("14", caps.name("day").unwrap());
assert_eq!("2010", caps["year"]);
assert_eq!("03", caps["month"]);
assert_eq!("14", caps["day"]);
}
```

Expand All @@ -82,9 +81,9 @@ fn main() {
// because the only way for the regex to match is if all of the
// capture groups match. This is not true in general though!
println!("year: {}, month: {}, day: {}",
caps.at(1).unwrap(),
caps.at(2).unwrap(),
caps.at(3).unwrap());
caps.get(1).unwrap().as_str(),
caps.get(2).unwrap().as_str(),
caps.get(3).unwrap().as_str());
}
}
```
Expand Down Expand Up @@ -137,8 +136,8 @@ means the main API can't be used for searching arbitrary bytes.
To match on arbitrary bytes, use the `regex::bytes::Regex` API. The API
is identical to the main API, except that it takes an `&[u8]` to search
on instead of an `&str`. By default, `.` will match any *byte* using
`regex::bytes::Regex`, while `.` will match any encoded Unicode *codepoint*
using the main API.
`regex::bytes::Regex`, while `.` will match any *UTF-8 encoded Unicode scalar
value* using the main API.

This example shows how to find all null-terminated strings in a slice of bytes:

Expand All @@ -152,7 +151,7 @@ let text = b"foo\x00bar\x00baz\x00";
// The unwrap is OK here since a match requires the `cstr` capture to match.
let cstrs: Vec<&[u8]> =
re.captures_iter(text)
.map(|c| c.name("cstr").unwrap())
.map(|c| c.name("cstr").unwrap().as_bytes())
.collect();
assert_eq!(vec![&b"foo"[..], &b"bar"[..], &b"baz"[..]], cstrs);
```
Expand Down Expand Up @@ -211,9 +210,9 @@ fn main() {
let re = regex!(r"(\d{4})-(\d{2})-(\d{2})");
let caps = re.captures("2010-03-14").unwrap();

assert_eq!("2010", caps.at(1).unwrap());
assert_eq!("03", caps.at(2).unwrap());
assert_eq!("14", caps.at(3).unwrap());
assert_eq!("2010", caps[1]);
assert_eq!("03", caps[2]);
assert_eq!("14", caps[3]);
}
```

Expand Down
6 changes: 3 additions & 3 deletions bench/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,9 @@ libc = "0.2"
onig = { version = "0.4", optional = true }
libpcre-sys = { version = "0.2", optional = true }
memmap = "0.2"
regex = { version = "0.1", path = "..", features = ["simd-accel"] }
regex_macros = { version = "0.1", path = "../regex_macros", optional = true }
regex-syntax = { version = "0.3", path = "../regex-syntax" }
regex = { version = "0.2.0", path = "..", features = ["simd-accel"] }
regex_macros = { version = "0.2.0", path = "../regex_macros", optional = true }
regex-syntax = { version = "0.4.0", path = "../regex-syntax" }
rustc-serialize = "0.3"

[build-dependencies]
Expand Down
Loading