-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement matching on byte strings #85
Comments
This is an interesting question. When building
Would a lossy conversion to a Unicode string work? Seems like it should. |
In Python, when maching over byte array the regex itself is also a byte array (there is nice When matching over byte buffer, I may want to capture some binary blob delimited by text, like in |
I don't follow. Sorry. I'm not talking about the encoding of the regex, I'm talking about the definition of its Unicode aware character classes. For example see http://doc.rust-lang.org/regex/regex/index.html#perl-character-classes-(unicode-friendly) and http://doc.rust-lang.org/regex/regex/index.html#matching-one-character. |
I think the best solution to this---if I understand correctly---is to do lossy Unicode conversion before handing it off to a regex. If that is for some reason not acceptable, then another way to go about this would be to provide an alternate implementation of It seems doable, but I'm skeptical of whether it's worth it. I'll leave this open for now. |
Lossy Unicode conversion can break binary data that we may want to extract using the regular expression. One alternative is to convert from some 8-bit encoding like cp1251, do the regex then convert the result back to cp1251. This way the binary data should be preserved, but it's hacky and not efficient. The main idea is regular expressions are not only for handling text. Without a special easy syntax for byte buffers like Python's |
Rust has byte string literals. This does seem doable. Inputs to the various matching functions would need to become polymorphic (i.e., accept either a |
I've been thinking about this lately, because I'm currently working on converting the matching engines to be byte based instead of character based. This means that, e.g., "OK, so add a flag that makes So this feature requires some serious thought at the API level. |
Missing byte regexes can invite hacks. Imagine a sequence: somebody wants to parse apart some piece of data. At first he just uses strings and regexes, everything works. But later he adds features/refactors and realizes that this should be byte buffer, not a string. So regexes should sometimes now extract a string, but other times extract a byte buffer. Case 1: byte regexes supported. User processed with refactoring, using byte regexes. Nicer, more secure. Case 2: only string regexes supported. User aborts refactoring, letting everything to be a string (with some escaping or just latin-1 everywhere). Uglier, less secure. |
@vi I don't think I'm questioning the utility of byte based regexes. The question isn't, "Why should we want them?" The question is, "How do we do it?" My other comments have described some of the issues. |
Just a parallel set of structs/traits won't do? Like Java's InputStreamWhatever and ReaderWhatever. |
@vi That would double the size of today's API. |
i think it would be useful for fast operations on (mainly) ASCII buffers, e.g. ASCII-based protocols and gene sequence data (which is often stored as ASCII sequence of GTACGTACGATCGATCG...) byte regexes wouldn’t know anything about unicode and operate bytewise, e.g. |
@flying-sheep I personally can't recall ever using regexes on sequence data during my tenure as a computational biologist. Regardless, that use case is supported today, since all valid ASCII encodings are valid UTF-8 encodings.
That is what I suggested in a comment above. The naive solution (add a new "byte based" |
incidentally that’s what i’m doing literally right now for a very good reason 😉 if you’re interested: some single cell sequencing protocols have the barcodes responsible for single cell IDs stored in RNA-Seq reads. in ways that read something like “the barcode is the first 5bp, then come 3-6 Gs, followed by the read sequence and an optional poly-A tail”. and i really don’t know a better way to express that kind of flexibility better than
obviously, but i meant being able to ignore non-decodable bytes and indeed never going through a decoding step before applying regexes.
yup, sadly. but apart from faciliating that by using a code generator or macro magic i don’t see a better way. we all know strings aren’t as similar to byte sequences as most people like to think, but regexes make sense for both. |
Right, but does sequence data contain non-decodable bytes? (I would consider avoiding the decoding step for performance reasons to be a bit of a niche use case.)
I am just extremely hesistant to make any decisions whose result is "double the size of the API." That is an enormous cost IMO. I'm not concerned at all about the code duplication, as I'm sure there are plenty of ways to eliminate it. (The first step is making the matching engines byte based instead of codepoint based, which is required before any progress can be made on this issue.) I'm concerned about the size of the API for complexity sake. Maybe we could put it in a separate sub-module or something. |
no, that would be the other use case of ASCII protocols
things here have to be !!fast sometimes. not decoding seems like an useful step. |
I'm not saying it isn't a valid use case. I know I certainly need it sometimes. But use cases have to be weighed with their costs, which is all I'm trying to do. Once again, to be clear, I really do believe supporting byte-based regexes is something we should persue. But:
|
@flying-sheep Just wanted to clarify: addressing your use case of "I have ASCII text and I don't want to pay for Unicode" can be done without supporting byte based regexes in the API (with respect to CPU time). If the automata are byte based instead of codepoint based, then UTF-8 decoding is built into the automaton. If all you have is ASCII text, then the states that decode other codepoints are simply never reached. Their only cost would be memory usage (and take up space in the cache lines). |
Maybe byte regexes can be a separate crate, like |
@vi That is unfortunately tricky to do. The actual implementation of byte based regexes and Unicode regexes should be the same. It's only the API that's different and maybe one or two steps in the compiler (e.g., |
What about completely independent implementations? Obvious drawback: major duplication of code, increased resource usage when both byte and string regexes are in use... |
@vi I can't imagine why anyone would subject themselves to the pain of maintaining such a thing. |
Code-generating robot doesn't feel pain. |
Since #146 is scheduled for 1.0, could this be considered for the same milestone? |
@birkenfeld There is unfortunately significant work between #146 and this issue. #146 makes it possible to run any of the matching engines on raw bytes, but this issue requires at least these additional things:
|
No pressure. Just thought I'd ask :) |
I would argue that it's perfectly fine not to support a niche feature if it makes your public API worse. Not that it's the end of the world, of course, but I would feel much better about it if there was at least one person who had a specific application for the feature... |
For example searching for |
Yes, it'd be quite useful if, for example, you were writing a tool to search in arbitrary data whose encoding may not be known (e.g., files on a file system). Imagine being able to use, for example, |
I agree. |
If you're reading files on a file system then each file is probably encoded consistently, even if encodings might differ between files. So then you could just use a UTF-8 regex (provided that its API were extended to allow searching in a &[u8]). The only reason I can think of for having a mixed regex is if you're searching in some file that has embedded UTF-8 but also some non-ASCII binary structure that you need to match. While such things could exist in principle, I've never needed to run a regex on one. |
HTML documents are a perfect example of common documents that aren't always encoded correctly. Everyone loves running regexes on HTML. :-)
It's not necessarily just for when you need to match non-ASCII binary structure, it's necessary for matching UTF-8 among possibly non-UTF-8 bytes. The whole problem with a UTF-8 regex today is that it will immediately stall at the very first invalid byte it sees. You can't utter a regex that will let you skip over it. For example, when one writes |
My I still think that this mixed-mode regex is only useful if you actually want to find matches that span both UTF-8 and non-UTF-8 regions. I still can't imagine writing such a regex (even if I had a HTML file that I knew was a mishmash of different encodings). But since it seems that everyone but me thinks this is a useful feature, I'm going to bow out of the dispute now... |
Shall I try to think up an example[s] where I might need a mixed regex? |
@vi Examples are always useful. I think @jneem's main issue is that this feature complicates the AST, but I think any support of this at all (even without UTF-8 mixed matching) complicates the AST. It'd be nicer to encode the new invariants into the type system, but you end up losing mixed matching and the resulting AST isn't exactly nice either. :-/ |
@jneem also suggested building this directly into the matching engines, which is possible, but you still need to be able to support matching arbitrary bytes in the AST, which means using |
I guess some of that stuff could be pushed off to the compiler if we didn't support mixed mode matching, but then you need to make sure to deal with the Perl classes ( |
In any case public API should be designed in a way that assing mixed mode is not an incompatible change, so the engine may be changed under the hood. It can even be a documented, but not implemented "teaser" API to gather votes for the feature request. |
Example 1Hacky extraction of title from Matroska files
This example lacks mixed regexes, so outputs binary garbage after the title. Switching on Unicode mode causes it to fail with let mkvsnippet = b"\x12\xd0\x3b\x5f\x7b\xa9\x85\x53\x70\x6f\x72\x74\x4d\x80\x98\x54\x76\x68\x65";
let re = BinaryRegex::new(r"\x7B\xA9([\x80-\xFE]|[\x40-\xFF].)(?u:(.*))");
let caps = re.captures(mkvsnippet);
let approximate_title : String = caps.at_str(1);
let encoded_title_length: Vec<u8> = caps.at(0); Example 2Peeking at Example 3Simple Something like |
@vi Those are pretty cute, thanks. :-) |
This commit enables support for compiling regular expressions that can match on arbitrary byte slices. In particular, we add a new sub-module called `bytes` that duplicates the API of the top-level module, except `&str` for subjects is replaced by `&[u8]`. Additionally, Unicode support in the regular expression is disabled by default but can be selectively re-enabled with the `u` flag. (Unicode support cannot be selectively disabled in the standard top-level API.) Most of the interesting changes occurred in the `regex-syntax` crate, where the AST now explicitly distinguishes between "ASCII compatible" expressions and Unicode aware expressions. This PR makes a few other changes out of convenience: 1. The DFA now knows how to "give up" if it's flushing its cache too often. When the DFA gives up, either backtracking or the NFA algorithm take over, which provides better performance. 2. Benchmarks were added for Oniguruma. 3. The benchmarks in general were overhauled to be defined in one place by using conditional compilation. 4. The tests have been completely reorganized to make it easier to split up the tests depending on which regex engine we're using. For example, we occasionally need to be able to write tests specifically for `regex::Regex` or specifically for `regex::bytes::Regex`. 5. Fixes a bug where NUL bytes weren't represented correctly in the byte class optimization for the DFA. Closes #85.
This commit enables support for compiling regular expressions that can match on arbitrary byte slices. In particular, we add a new sub-module called `bytes` that duplicates the API of the top-level module, except `&str` for subjects is replaced by `&[u8]`. Additionally, Unicode support in the regular expression is disabled by default but can be selectively re-enabled with the `u` flag. (Unicode support cannot be selectively disabled in the standard top-level API.) Most of the interesting changes occurred in the `regex-syntax` crate, where the AST now explicitly distinguishes between "ASCII compatible" expressions and Unicode aware expressions. This PR makes a few other changes out of convenience: 1. The DFA now knows how to "give up" if it's flushing its cache too often. When the DFA gives up, either backtracking or the NFA algorithm take over, which provides better performance. 2. Benchmarks were added for Oniguruma. 3. The benchmarks in general were overhauled to be defined in one place by using conditional compilation. 4. The tests have been completely reorganized to make it easier to split up the tests depending on which regex engine we're using. For example, we occasionally need to be able to write tests specifically for `regex::Regex` or specifically for `regex::bytes::Regex`. 5. Fixes a bug where NUL bytes weren't represented correctly in the byte class optimization for the DFA. Closes #85.
This commit enables support for compiling regular expressions that can match on arbitrary byte slices. In particular, we add a new sub-module called `bytes` that duplicates the API of the top-level module, except `&str` for subjects is replaced by `&[u8]`. Additionally, Unicode support in the regular expression is disabled by default but can be selectively re-enabled with the `u` flag. (Unicode support cannot be selectively disabled in the standard top-level API.) Most of the interesting changes occurred in the `regex-syntax` crate, where the AST now explicitly distinguishes between "ASCII compatible" expressions and Unicode aware expressions. This PR makes a few other changes out of convenience: 1. The DFA now knows how to "give up" if it's flushing its cache too often. When the DFA gives up, either backtracking or the NFA algorithm take over, which provides better performance. 2. Benchmarks were added for Oniguruma. 3. The benchmarks in general were overhauled to be defined in one place by using conditional compilation. 4. The tests have been completely reorganized to make it easier to split up the tests depending on which regex engine we're using. For example, we occasionally need to be able to write tests specifically for `regex::Regex` or specifically for `regex::bytes::Regex`. 5. Fixes a bug where NUL bytes weren't represented correctly in the byte class optimization for the DFA. Closes #85.
This commit enables support for compiling regular expressions that can match on arbitrary byte slices. In particular, we add a new sub-module called `bytes` that duplicates the API of the top-level module, except `&str` for subjects is replaced by `&[u8]`. Additionally, Unicode support in the regular expression is disabled by default but can be selectively re-enabled with the `u` flag. (Unicode support cannot be selectively disabled in the standard top-level API.) Most of the interesting changes occurred in the `regex-syntax` crate, where the AST now explicitly distinguishes between "ASCII compatible" expressions and Unicode aware expressions. This PR makes a few other changes out of convenience: 1. The DFA now knows how to "give up" if it's flushing its cache too often. When the DFA gives up, either backtracking or the NFA algorithm take over, which provides better performance. 2. Benchmarks were added for Oniguruma. 3. The benchmarks in general were overhauled to be defined in one place by using conditional compilation. 4. The tests have been completely reorganized to make it easier to split up the tests depending on which regex engine we're using. For example, we occasionally need to be able to write tests specifically for `regex::Regex` or specifically for `regex::bytes::Regex`. 5. Fixes a bug where NUL bytes weren't represented correctly in the byte class optimization for the DFA. Closes #85.
Very sorry to necropost, but I can't seem to find an issue or discussion about the regex pattern itself being a For context, I am working on the Mercurial Rust implementation of There regex are user-defined, written in any arbitrary encoding (though ASCII compatible for special characters otherwise I don't know how that could possibly work) to match against filenames and paths that follow the same rule. I hope to be wrong or that the doc just needs to be clearer, hehe. |
@Alphare Thanks for the question. In the future, I'd prefer a new issue for There is a somewhat related discussion in
Nice! IIRC, Mercurial's implementation of
The important aspect of matching on |
@BurntSushi Thank you for the quick response. I will open a separate issue in the future, sorry about that! Mercurial uses either I think the safest bet for me and Mercurial as a project is to also use Again, I appreciate your help. Maybe I can ping |
If you're already using RE2, then |
I would be very interested in a feature comparison between the two engines actually. For now |
The main difference is that this library is faster, primarily due to its more advanced literal optimizations. As for actual features, the most significant difference is that this library has much better support for Unicode and enables such things by default. e.g., You can read more about Unicode support here: https://github.com/rust-lang/regex/blob/master/UNICODE.md Otherwise, the two libraries are pretty much identical in terms of semantics. It's still not clear to me how you propose to address backtracking and backreferences...
Was this always your plan? Otherwise, it seems strange to reverse course based on my comments here. |
Literal optimization as in a string literal in source code or is it something completely different?
I see, I wonder how easy it would be to "fuzz out" all differences between the two aside from the obvious ones you cited.
For now, the plan is to fall back to Python (baring a potentially incredible slowdown) for unsupported patterns while maybe giving a warning for the user.
The plan among the people discussing this issue has so far been to use RE2 because it's already being used in the Python code and used on huge repositories where edge-cases would have been found. The only reason I opened this issue was to make sure that there could be a way of using I hope that makes sense. |
I don't think source code has anything to do with it. e.g., In the regex
Interesting. If y'all ever do benchmarking again, I'd be curious to know more details.
Yup, thanks! |
Very cool indeed.
I will let you know when we get there, it should be interesting. If the methodology is good enough, you could even use it in your docs. |
Currently as far as I see regexes here are only for Unicode text. But they can be used to parse binary files as well (to to parse a mixture of binary and text).
For example, how can one implement
strings --all
using regexes in Rust?The text was updated successfully, but these errors were encountered: