Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement matching on byte strings #85

Closed
vi opened this issue May 19, 2015 · 91 comments
Closed

implement matching on byte strings #85

vi opened this issue May 19, 2015 · 91 comments

Comments

@vi
Copy link

vi commented May 19, 2015

Currently as far as I see regexes here are only for Unicode text. But they can be used to parse binary files as well (to to parse a mixture of binary and text).

For example, how can one implement strings --all using regexes in Rust?

@BurntSushi
Copy link
Member

This is an interesting question. When building regex, I never even considered the possibility of running a regex on something that wasn't Unicode. If you're matching against, say, &[u8], then how are Unicode structures inside the regex interpreted? For example, \pN or even \w (which is Unicode aware).

For example, how can one implement strings --all using regexes in Rust?

Would a lossy conversion to a Unicode string work? Seems like it should.

@vi
Copy link
Author

vi commented May 19, 2015

In Python, when maching over byte array the regex itself is also a byte array (there is nice b"123" syntax for it).

When matching over byte buffer, I may want to capture some binary blob delimited by text, like in BEGIN_BINARY_BLOB\n(.*)\nEND_BINARY_BLOB.

@BurntSushi
Copy link
Member

I don't follow. Sorry. I'm not talking about the encoding of the regex, I'm talking about the definition of its Unicode aware character classes. For example see http://doc.rust-lang.org/regex/regex/index.html#perl-character-classes-(unicode-friendly) and http://doc.rust-lang.org/regex/regex/index.html#matching-one-character.

@BurntSushi
Copy link
Member

I think the best solution to this---if I understand correctly---is to do lossy Unicode conversion before handing it off to a regex.

If that is for some reason not acceptable, then another way to go about this would be to provide an alternate implementation of Input that restricts characters to single bytes. The problem is: how are non-ASCII single bytes handled? I think the Char type will have to become polymorphic.

It seems doable, but I'm skeptical of whether it's worth it. I'll leave this open for now.

@BurntSushi BurntSushi changed the title Shall there also be byte regexes like in Python? implement matching on byte strings Jun 19, 2015
@vi
Copy link
Author

vi commented Jun 23, 2015

Lossy Unicode conversion can break binary data that we may want to extract using the regular expression.

One alternative is to convert from some 8-bit encoding like cp1251, do the regex then convert the result back to cp1251. This way the binary data should be preserved, but it's hacky and not efficient.

The main idea is regular expressions are not only for handling text. Without a special easy syntax for byte buffers like Python's b"123\x00\x00\xFF" they would be obviously are less convenient.

@BurntSushi
Copy link
Member

Rust has byte string literals.

This does seem doable. Inputs to the various matching functions would need to become polymorphic (i.e., accept either a &str or a &[u8]), and input handling within the engines would need to be polymorphic too.

@BurntSushi
Copy link
Member

I've been thinking about this lately, because I'm currently working on converting the matching engines to be byte based instead of character based. This means that, e.g., . matches Unicode scalar values, which means it would be effectively useless on anything other than UTF-8 encoded data. For example, say I wanted to search for .*?myname.*?. If the regex engine claims to be able to run on binary data, then those dots will foil your intention because they won't consume arbitrary bytes---only valid UTF-8 encodings of scalar values. I contend that this is actually a pretty valid use case, because it means a regex can operate on any data that contains ASCII, which is reasonably common in the wild in my experience. (e.g., latin-1.)

"OK, so add a flag that makes . match bytes instead of characters." Well, that's problematic, because the entire existing API is focused around strings, which must be UTF-8 encoded. If a . can successfully match any byte, then the byte boundaries returned are no longer guaranteed to be on UTF-8 boundaries. Whoops.

So this feature requires some serious thought at the API level.

@vi
Copy link
Author

vi commented Oct 19, 2015

Missing byte regexes can invite hacks.

Imagine a sequence: somebody wants to parse apart some piece of data. At first he just uses strings and regexes, everything works.

But later he adds features/refactors and realizes that this should be byte buffer, not a string. So regexes should sometimes now extract a string, but other times extract a byte buffer.

Case 1: byte regexes supported. User processed with refactoring, using byte regexes. Nicer, more secure.

Case 2: only string regexes supported. User aborts refactoring, letting everything to be a string (with some escaping or just latin-1 everywhere). Uglier, less secure.

@BurntSushi
Copy link
Member

@vi I don't think I'm questioning the utility of byte based regexes. The question isn't, "Why should we want them?" The question is, "How do we do it?" My other comments have described some of the issues.

@vi
Copy link
Author

vi commented Oct 19, 2015

Just a parallel set of structs/traits won't do? Like Java's InputStreamWhatever and ReaderWhatever.

@BurntSushi
Copy link
Member

@vi That would double the size of today's API.

@flying-sheep
Copy link
Contributor

i think it would be useful for fast operations on (mainly) ASCII buffers, e.g. ASCII-based protocols and gene sequence data (which is often stored as ASCII sequence of GTACGTACGATCGATCG...)

byte regexes wouldn’t know anything about unicode and operate bytewise, e.g. . would match any byte (except maybe \n which would be the canonical newline byte like in python bytestrings)

@BurntSushi
Copy link
Member

@flying-sheep I personally can't recall ever using regexes on sequence data during my tenure as a computational biologist. Regardless, that use case is supported today, since all valid ASCII encodings are valid UTF-8 encodings.

byte regexes wouldn’t know anything about unicode and operate bytewise, e.g. . would match any byte (except maybe \n which would be the canonical newline byte like in python bytestrings)

That is what I suggested in a comment above. The naive solution (add a new "byte based" Regex type with all the methods on Regex today, except on &[u8] instead of &str) doubles the size of the existing API.

@flying-sheep
Copy link
Contributor

I personally can't recall ever using regexes on sequence data during my tenure as a computational biologist

incidentally that’s what i’m doing literally right now for a very good reason 😉

if you’re interested: some single cell sequencing protocols have the barcodes responsible for single cell IDs stored in RNA-Seq reads. in ways that read something like “the barcode is the first 5bp, then come 3-6 Gs, followed by the read sequence and an optional poly-A tail”. and i really don’t know a better way to express that kind of flexibility better than .{5}G{3,6}.+?A*

since all valid ASCII encodings are valid UTF-8 encodings.

obviously, but i meant being able to ignore non-decodable bytes and indeed never going through a decoding step before applying regexes.

The naive solution (add a new "byte based" Regex type with all the methods on Regex today, except on &[u8] instead of &str) doubles the size of the existing API.

yup, sadly. but apart from faciliating that by using a code generator or macro magic i don’t see a better way. we all know strings aren’t as similar to byte sequences as most people like to think, but regexes make sense for both.

@BurntSushi
Copy link
Member

obviously, but i meant being able to ignore non-decodable bytes and indeed never going through a decoding step before applying regexes.

Right, but does sequence data contain non-decodable bytes? (I would consider avoiding the decoding step for performance reasons to be a bit of a niche use case.)

yup, sadly. but apart from faciliating that by using a code generator or macro magic i don’t see a better way. we all know strings aren’t as similar to byte sequences as most people like to think, but regexes make sense for both.

I am just extremely hesistant to make any decisions whose result is "double the size of the API." That is an enormous cost IMO. I'm not concerned at all about the code duplication, as I'm sure there are plenty of ways to eliminate it. (The first step is making the matching engines byte based instead of codepoint based, which is required before any progress can be made on this issue.) I'm concerned about the size of the API for complexity sake. Maybe we could put it in a separate sub-module or something.

@flying-sheep
Copy link
Contributor

Right, but does sequence data contain non-decodable bytes?

no, that would be the other use case of ASCII protocols

(I would consider avoiding the decoding step for performance reasons to be a bit of a niche use case.)

things here have to be !!fast sometimes. not decoding seems like an useful step.

@BurntSushi
Copy link
Member

I'm not saying it isn't a valid use case. I know I certainly need it sometimes. But use cases have to be weighed with their costs, which is all I'm trying to do.

Once again, to be clear, I really do believe supporting byte-based regexes is something we should persue. But:

  1. The implementation to support it isn't there yet.
  2. The API needs some serious thought so that we don't blow our complexity budget.

@BurntSushi
Copy link
Member

@flying-sheep Just wanted to clarify: addressing your use case of "I have ASCII text and I don't want to pay for Unicode" can be done without supporting byte based regexes in the API (with respect to CPU time). If the automata are byte based instead of codepoint based, then UTF-8 decoding is built into the automaton. If all you have is ASCII text, then the states that decode other codepoints are simply never reached. Their only cost would be memory usage (and take up space in the cache lines).

@vi
Copy link
Author

vi commented Nov 3, 2015

Maybe byte regexes can be a separate crate, like binregex? Globally, AIP would be still doubled, but projects not using byte regexes shoud not feel it.

@BurntSushi
Copy link
Member

@vi That is unfortunately tricky to do. The actual implementation of byte based regexes and Unicode regexes should be the same. It's only the API that's different and maybe one or two steps in the compiler (e.g., . is any byte instead of any codepoint). Putting the byte based API into a separate crate means exposing enough of the internals of regex to make that possible. A sub-module I guess would be better.

@vi
Copy link
Author

vi commented Nov 3, 2015

What about completely independent implementations? Obvious drawback: major duplication of code, increased resource usage when both byte and string regexes are in use...

@BurntSushi
Copy link
Member

@vi I can't imagine why anyone would subject themselves to the pain of maintaining such a thing.

@vi
Copy link
Author

vi commented Nov 3, 2015

Code-generating robot doesn't feel pain.

@birkenfeld
Copy link

Since #146 is scheduled for 1.0, could this be considered for the same milestone?

@BurntSushi
Copy link
Member

@birkenfeld There is unfortunately significant work between #146 and this issue. #146 makes it possible to run any of the matching engines on raw bytes, but this issue requires at least these additional things:

  1. Some way to signal that . matches any byte as opposed to any Unicode scalar value. In the library today, this is always a Unicode scalar value. This has to be handled with care, because as soon as this semantic is added, the boundaries returned are no longer guaranteed to correspond to a valid UTF-8 sequence boundary.
  2. Given (1), how does one express "match any Unicode scalar value" if . is already "match any byte"? It seems like it would be useful to have both. A flag? A distinct syntax for it? If it's a flag, then that flag must be explicitly rejected (or ignored) by the String based regexes.
  3. Craft a public API that supports byte based matching. Is it a separate crate? Is it a sub-module? The latter seems like significantly less work, since the former would require splitting the internals of regex out into its own crate. (I am somewhat strongly opposed to putting a byte-based regex into the top-level API of this crate, if only because it's a niche use case that will otherwise clutter up the docs.)

@birkenfeld
Copy link

No pressure. Just thought I'd ask :)

@jneem
Copy link
Contributor

jneem commented Feb 25, 2016

I would argue that it's perfectly fine not to support a niche feature if it makes your public API worse. Not that it's the end of the world, of course, but I would feel much better about it if there was at least one person who had a specific application for the feature...

@vi
Copy link
Author

vi commented Feb 25, 2016

CompoundExpr feels like an overengineering. Byte-default, but with (?u: ... ), maybe also with additional method to retrieve &str instead of (or in addition to) &[u8] from groups inside (?u: ... ) feels reasonable.

For example searching for (?u:(.*)) should extract valid UTF-8 snippets from mixed text/binary data.

@BurntSushi
Copy link
Member

Yes, it'd be quite useful if, for example, you were writing a tool to search in arbitrary data whose encoding may not be known (e.g., files on a file system). Imagine being able to use, for example, \p{Lu} with grep. It won't always work of course, but there's a lot of UTF-8 out there. :-)

@birkenfeld
Copy link

It won't always work of course, but there's a lot of UTF-8 out there. :-)

I agree.

@jneem
Copy link
Contributor

jneem commented Feb 25, 2016

If you're reading files on a file system then each file is probably encoded consistently, even if encodings might differ between files. So then you could just use a UTF-8 regex (provided that its API were extended to allow searching in a &[u8]). The only reason I can think of for having a mixed regex is if you're searching in some file that has embedded UTF-8 but also some non-ASCII binary structure that you need to match. While such things could exist in principle, I've never needed to run a regex on one.

@BurntSushi
Copy link
Member

HTML documents are a perfect example of common documents that aren't always encoded correctly. Everyone loves running regexes on HTML. :-)

The only reason I can think of for having a mixed regex is if you're searching in some file that has embedded UTF-8 but also some non-ASCII binary structure that you need to match.

It's not necessarily just for when you need to match non-ASCII binary structure, it's necessary for matching UTF-8 among possibly non-UTF-8 bytes. The whole problem with a UTF-8 regex today is that it will immediately stall at the very first invalid byte it sees. You can't utter a regex that will let you skip over it. For example, when one writes [^a] in a UTF-8 regex today, that doesn't match "any byte except a," it matches "any UTF-8 encoding of a codepoint that is not a."

@jneem
Copy link
Contributor

jneem commented Feb 25, 2016

It's not necessarily just for when you need to match non-ASCII binary structure, it's necessary for matching UTF-8 among possibly non-UTF-8 bytes.

My regex_dfa crate is capable of finding a UTF-8 encoded match within a text that is not fully UTF-8 (although not using the public API, which takes a &str). That part's an implementation detail of the regex engine, not of the regex grammar.

I still think that this mixed-mode regex is only useful if you actually want to find matches that span both UTF-8 and non-UTF-8 regions. I still can't imagine writing such a regex (even if I had a HTML file that I knew was a mishmash of different encodings). But since it seems that everyone but me thinks this is a useful feature, I'm going to bow out of the dispute now...

@vi
Copy link
Author

vi commented Feb 25, 2016

Shall I try to think up an example[s] where I might need a mixed regex?

@BurntSushi
Copy link
Member

@vi Examples are always useful. I think @jneem's main issue is that this feature complicates the AST, but I think any support of this at all (even without UTF-8 mixed matching) complicates the AST. It'd be nicer to encode the new invariants into the type system, but you end up losing mixed matching and the resulting AST isn't exactly nice either. :-/

@BurntSushi
Copy link
Member

@jneem also suggested building this directly into the matching engines, which is possible, but you still need to be able to support matching arbitrary bytes in the AST, which means using u8 in a few places instead of char and making sure things like \xFF correspond to the byte \xFF instead of the codepoint \xFF (which is represented in UTF-8 with two bytes).

@BurntSushi
Copy link
Member

I guess some of that stuff could be pushed off to the compiler if we didn't support mixed mode matching, but then you need to make sure to deal with the Perl classes (\s, \w and \d) correctly, which are translated to their character class forms in the parser. Blech.

@vi
Copy link
Author

vi commented Feb 25, 2016

In any case public API should be designed in a way that assing mixed mode is not an incompatible change, so the engine may be changed under the hood.

It can even be a documented, but not implemented "teaser" API to gather votes for the feature request.

@vi
Copy link
Author

vi commented Feb 25, 2016

Example 1

Hacky extraction of title from Matroska files

perl -ne 'if (/\x7B\xA9([\x80-\xFE]|[\x40-\x7F].)(.+)/u) { print "$2\n"; }' somefile.mkv

This example lacks mixed regexes, so outputs binary garbage after the title. Switching on Unicode mode causes it to fail with Malformed UTF-8 character.

let mkvsnippet = b"\x12\xd0\x3b\x5f\x7b\xa9\x85\x53\x70\x6f\x72\x74\x4d\x80\x98\x54\x76\x68\x65";
let re = BinaryRegex::new(r"\x7B\xA9([\x80-\xFE]|[\x40-\xFF].)(?u:(.*))");
let caps = re.captures(mkvsnippet);
let approximate_title : String = caps.at_str(1);
let encoded_title_length: Vec<u8> = caps.at(0);

Example 2

Peeking at /proc/$SOMEPID/environ or /proc/$SOMEPID/cmdline and extracting some string from it. The files are binary (zero-separated), yet can contain UTF-8 text. I don't remember easy example where UTF-8-encoded data gets passed thingh environment... Maybe CGI?

Example 3

Simple /usr/bin/strings analogue, but for extracting UTF-8 snippets from arbitrary binary file.

Something like r"(?u:(\p{Assigned}{8,}))".

@BurntSushi
Copy link
Member

@vi Those are pretty cute, thanks. :-)

BurntSushi added a commit that referenced this issue Mar 10, 2016
This commit enables support for compiling regular expressions that can
match on arbitrary byte slices. In particular, we add a new sub-module
called `bytes` that duplicates the API of the top-level module, except
`&str` for subjects is replaced by `&[u8]`. Additionally, Unicode
support in the regular expression is disabled by default but can be
selectively re-enabled with the `u` flag. (Unicode support cannot be
selectively disabled in the standard top-level API.)

Most of the interesting changes occurred in the `regex-syntax` crate,
where the AST now explicitly distinguishes between "ASCII compatible"
expressions and Unicode aware expressions.

This PR makes a few other changes out of convenience:

1. The DFA now knows how to "give up" if it's flushing its cache too
often. When the DFA gives up, either backtracking or the NFA algorithm
take over, which provides better performance.
2. Benchmarks were added for Oniguruma.
3. The benchmarks in general were overhauled to be defined in one place
by using conditional compilation.
4. The tests have been completely reorganized to make it easier to split
up the tests depending on which regex engine we're using. For example,
we occasionally need to be able to write tests specifically for
`regex::Regex` or specifically for `regex::bytes::Regex`.
5. Fixes a bug where NUL bytes weren't represented correctly in the byte
class optimization for the DFA.

Closes #85.
BurntSushi added a commit that referenced this issue Mar 10, 2016
This commit enables support for compiling regular expressions that can
match on arbitrary byte slices. In particular, we add a new sub-module
called `bytes` that duplicates the API of the top-level module, except
`&str` for subjects is replaced by `&[u8]`. Additionally, Unicode
support in the regular expression is disabled by default but can be
selectively re-enabled with the `u` flag. (Unicode support cannot be
selectively disabled in the standard top-level API.)

Most of the interesting changes occurred in the `regex-syntax` crate,
where the AST now explicitly distinguishes between "ASCII compatible"
expressions and Unicode aware expressions.

This PR makes a few other changes out of convenience:

1. The DFA now knows how to "give up" if it's flushing its cache too
often. When the DFA gives up, either backtracking or the NFA algorithm
take over, which provides better performance.
2. Benchmarks were added for Oniguruma.
3. The benchmarks in general were overhauled to be defined in one place
by using conditional compilation.
4. The tests have been completely reorganized to make it easier to split
up the tests depending on which regex engine we're using. For example,
we occasionally need to be able to write tests specifically for
`regex::Regex` or specifically for `regex::bytes::Regex`.
5. Fixes a bug where NUL bytes weren't represented correctly in the byte
class optimization for the DFA.

Closes #85.
BurntSushi added a commit that referenced this issue Mar 10, 2016
This commit enables support for compiling regular expressions that can
match on arbitrary byte slices. In particular, we add a new sub-module
called `bytes` that duplicates the API of the top-level module, except
`&str` for subjects is replaced by `&[u8]`. Additionally, Unicode
support in the regular expression is disabled by default but can be
selectively re-enabled with the `u` flag. (Unicode support cannot be
selectively disabled in the standard top-level API.)

Most of the interesting changes occurred in the `regex-syntax` crate,
where the AST now explicitly distinguishes between "ASCII compatible"
expressions and Unicode aware expressions.

This PR makes a few other changes out of convenience:

1. The DFA now knows how to "give up" if it's flushing its cache too
often. When the DFA gives up, either backtracking or the NFA algorithm
take over, which provides better performance.
2. Benchmarks were added for Oniguruma.
3. The benchmarks in general were overhauled to be defined in one place
by using conditional compilation.
4. The tests have been completely reorganized to make it easier to split
up the tests depending on which regex engine we're using. For example,
we occasionally need to be able to write tests specifically for
`regex::Regex` or specifically for `regex::bytes::Regex`.
5. Fixes a bug where NUL bytes weren't represented correctly in the byte
class optimization for the DFA.

Closes #85.
BurntSushi added a commit that referenced this issue Mar 10, 2016
This commit enables support for compiling regular expressions that can
match on arbitrary byte slices. In particular, we add a new sub-module
called `bytes` that duplicates the API of the top-level module, except
`&str` for subjects is replaced by `&[u8]`. Additionally, Unicode
support in the regular expression is disabled by default but can be
selectively re-enabled with the `u` flag. (Unicode support cannot be
selectively disabled in the standard top-level API.)

Most of the interesting changes occurred in the `regex-syntax` crate,
where the AST now explicitly distinguishes between "ASCII compatible"
expressions and Unicode aware expressions.

This PR makes a few other changes out of convenience:

1. The DFA now knows how to "give up" if it's flushing its cache too
often. When the DFA gives up, either backtracking or the NFA algorithm
take over, which provides better performance.
2. Benchmarks were added for Oniguruma.
3. The benchmarks in general were overhauled to be defined in one place
by using conditional compilation.
4. The tests have been completely reorganized to make it easier to split
up the tests depending on which regex engine we're using. For example,
we occasionally need to be able to write tests specifically for
`regex::Regex` or specifically for `regex::bytes::Regex`.
5. Fixes a bug where NUL bytes weren't represented correctly in the byte
class optimization for the DFA.

Closes #85.
@Alphare
Copy link

Alphare commented Dec 5, 2019

Very sorry to necropost, but I can't seem to find an issue or discussion about the regex pattern itself being a &[u8] instead of a &str.

For context, I am working on the Mercurial Rust implementation of .hgignore.

There regex are user-defined, written in any arbitrary encoding (though ASCII compatible for special characters otherwise I don't know how that could possibly work) to match against filenames and paths that follow the same rule.
We already have an HgPath type (and its owned variant HgPathBuf) to handle filenames and paths in both an encoding and platform-agnostic fashion, something that is not possible with OsStr, but I am struggling to see how I can use the bytes module of regex in that case, since Regex takes a &str, which is still WTF-8.

I hope to be wrong or that the doc just needs to be clearer, hehe.
Thanks in advance.

@BurntSushi
Copy link
Member

@Alphare Thanks for the question. In the future, I'd prefer a new issue for
questions like this. I don't mind bumping old PRs/issues per se, but your
question is orthogonal to this. That is, "matching on &[u8]" and "&[u8]
patterns" are quite different.

There is a somewhat related discussion in
#451

For context, I am working on the Mercurial Rust implementation of .hgignore.

Nice! IIRC, Mercurial's implementation of .hgignore uses Python to execute
regexes. There are several incompatibilities between this regex engine and
Python's, so it may be quite tricky to preserve backward compatibility here. I
don't know to what extent that's a concern, but I figured I'd mention it in
case it's a show-stopper.

There regex are user-defined, written in any arbitrary encoding (though ASCII
compatible for special characters otherwise I don't know how that could
possibly work) to match against filenames and paths that follow the same
rule.

The important aspect of matching on &[u8] is that regexes can match invalid
UTF-8. Combine that with the fact that you can write (?-u:\xFF) to refer to
the byte \xFF (rather than the codepoint U+00FF) means that you should be
able to accomplish what you want. All you really need to do is replace all
invalid UTF-8 bytes in your pattern with their corresponding hex escape
sequence. And I think that should do it.

@Alphare
Copy link

Alphare commented Dec 5, 2019

@BurntSushi Thank you for the quick response. I will open a separate issue in the future, sorry about that!

Mercurial uses either RE2 or Python's re module, the latter being a fallback for backtracking expressions and other things that RE2 does not support, both of which do differ from regex, as you said. There is also the issue of regex not supporting empty alternation, maybe concerns over performance that has to be benchmarked, etc..

I think the safest bet for me and Mercurial as a project is to also use RE2 from Rust, and then reconsider using regex further down the road.

Again, I appreciate your help. Maybe I can ping ripgrep when I get a crate with good enough support for hgignore. ;)

@BurntSushi
Copy link
Member

If you're already using RE2, then regex should be extremely close to that, since RE2 directly inspired this library. If you're worried about corner cases like empty alternations, then I would just do whatever it is you're planning on doing to handle look-around and backreferences (since RE2 does not support those either).

@Alphare
Copy link

Alphare commented Dec 5, 2019

I would be very interested in a feature comparison between the two engines actually. For now Re2 will be "plug-and-play" and will come with fewer unanswered questions and that will be a lot easier to both write and to get through review.

@BurntSushi
Copy link
Member

The main difference is that this library is faster, primarily due to its more advanced literal optimizations. As for actual features, the most significant difference is that this library has much better support for Unicode and enables such things by default. e.g., \w is Unicode aware by default, where as RE2 is not. But Unicode can be selectively disabled to get ASCII only versions of classes like \w. This library also supports character class set operations (intersection, difference, symmetric difference) as well as nested character classes.

You can read more about Unicode support here: https://github.com/rust-lang/regex/blob/master/UNICODE.md

Otherwise, the two libraries are pretty much identical in terms of semantics.

It's still not clear to me how you propose to address backtracking and backreferences...

For now Re2 will be "plug-and-play" and will come with fewer unanswered questions and that will be a lot easier to both write and to get through review.

Was this always your plan? Otherwise, it seems strange to reverse course based on my comments here.

@Alphare
Copy link

Alphare commented Dec 6, 2019

The main difference is that this library is faster, primarily due to its more advanced literal optimizations.

Literal optimization as in a string literal in source code or is it something completely different?

Otherwise, the two libraries are pretty much identical in terms of semantics.

I see, I wonder how easy it would be to "fuzz out" all differences between the two aside from the obvious ones you cited.

It's still not clear to me how you propose to address backtracking and backreferences...

For now, the plan is to fall back to Python (baring a potentially incredible slowdown) for unsupported patterns while maybe giving a warning for the user.

For now Re2 will be "plug-and-play" and will come with fewer unanswered questions and that will be a lot easier to both write and to get through review.

Was this always your plan? Otherwise, it seems strange to reverse course based on my comments here.

The plan among the people discussing this issue has so far been to use RE2 because it's already being used in the Python code and used on huge repositories where edge-cases would have been found.
It's also what the first Rust POC used: at that time, the lack of support of empty alternations and the escaping needed to work on bytes was additional work that was deemed not worth it, as some benchmarking on a big repo showed that RE2 was faster. I believe the data for those benchmark was lost and the methodology may not be too relevant anymore because of different pattern used, etc..

The only reason I opened this issue was to make sure that there could be a way of using regex after the initial version with RE2 and to size up how much work it would be, because I prefer using a Rust-native engine than a C++ binding.

I hope that makes sense.

@BurntSushi
Copy link
Member

Literal optimization as in a string literal in source code or is it something completely different?

I don't think source code has anything to do with it. e.g., In the regex (foo|bar|quux)+, there exists an alternation of literals foo, bar and quux. In this regex library, it will do an analysis to detect those literals and accelerate the search, using SIMD where possible. RE2 doesn't do this, other than in the very simple case where every match starts with the same byte.

as some benchmarking on a big repo showed that RE2 was faster

Interesting. If y'all ever do benchmarking again, I'd be curious to know more details.

I hope that makes sense.

Yup, thanks!

@Alphare
Copy link

Alphare commented Dec 6, 2019

I don't think source code has anything to do with it. e.g., In the regex (foo|bar|quux)+, there exists an alternation of literals foo, bar and quux. In this regex library, it will do an analysis to detect those literals and accelerate the search, using SIMD where possible. RE2 doesn't do this, other than in the very simple case where every match starts with the same byte.

Very cool indeed.

Interesting. If y'all ever do benchmarking again, I'd be curious to know more details.

I will let you know when we get there, it should be interesting. If the methodology is good enough, you could even use it in your docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants