-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regex 1.0 #1620
regex 1.0 #1620
Conversation
### Expansion concerns | ||
[expansion-concerns]: #expansion-concerns | ||
|
||
There are a few possible avenues for expansion, and we take measures to make |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possibility here is "compatibility" flags, e.g. if there is a breaking change for whatever reason, people can opt-in to it with some sort of flag like (?X)
, ala (?i)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, good idea, added!
I wonder if They could return pub struct Found<'a> {
string: &'a str,
start: usize,
end: usize,
}
impl<'a> Found<'a> {
fn as_str(&self) -> &'a str { self.string }
fn as_indices(&self) -> (usize, usize) { (self.start, self.end) }
} (I'm not happy with the names.) |
`String`. | ||
* The `Error` type no longer has the `InvalidSet` variant, since the error is | ||
no longer possible. Its `Syntax` variant was also modified to wrap a `String` | ||
instead of a `regex_syntax::Error`. If you need access to specific parse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it could wrap a type defined in regex
that internally contains a regex_syntax::Error
(the type may just implement Error
, Display
etc. for now), so we can choose to expose more detailed info in future if we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a universally better idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
@huonw I do like I think I can buy that Another alternative is adding a new method with a different return type, but I've tried hard to resist that urge. Another alternative is to just have |
Hmm, neither the byte or string based regex implementations would be suited for matching on It would be nice if the actual matching code, and all FSM related code could be generic over an element type. That way the regular expression syntax would be specific to the use (so it might only define regular expression syntax for unicode strings and byte sequences initially) but the rest of the algorithm is reusable, such that external crates could define their own "front ends" to the matching engine. It would also allow future expansions to other string-like types (such as defining a regex syntax for matching |
The failure I was imagining is that code with several strings (possibly several that get regex'd) could accidentally mix up which strings are being sliced: Of course, you're in closer touch with the API/the crate than me so you're likely to have a better sense for the trade-offs here. |
Nope. The FSM implementation is currently very much coupled to UTF-8.
I think this is a valuable thing, but is out of scope for this RFC. This RFC is not for breaking new ground and inventing a general purpose regular expression interface. I perceive this RFC to be for stabilizing well trodden ground. I'll be more clear: I personally don't want to design a regex interface. It's hard work, requires lots of experimentation, lots of use and at least two real implementations of said interface before I could believe it is good enough to stabilize. |
Your point is totally valid and your suggestion clearly has more safety. But that little bit of extra indirection is a touch annoying. I honestly don't think there is a clear right answer. Perhaps others can weigh in! |
What about |
I was on the (likely unfounded) assumption that the |
There is an implementation of |
@eddyb I don't follow. Could you explain a touch more please? |
I wasn't saying that we should block anything, just that I don't think we need to return anything more than I'm curious as to what you think about the "janky API", though. Do you think the |
I don't know. I'm not sure what "second class citizen" means in this context. It seems like it depends on how you're using it. I said "janky" at least partially because I misunderstood your suggestion. For example, if we removed the If you're not advocating removing |
@eddyb clarified on IRC that he was talking about changing pub fn find(&self, text: &str) -> Option<::std::ops::Range<usize>>; Which is kind of nice, but I don't think actually addresses @huonw's concern. :-/ |
http://doc.rust-lang.org/regex/regex/index.html#syntax | ||
|
||
To my knowledge, the evolution as proposed in this RFC has been followed since | ||
`regex` was created. The syntax has largely remain unchanged with few |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "remained"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed! Thanks!
Why the |
I'd suggest to really think about the trait based version, or any alternative that allow the use of arbitrary iterators. Regex matching on non |
This RFC need not define the new API, but it needs to consider the current API in the context of future expansion: will the current API make it difficult to add these features in future without breaking backwards compatibility? |
2. It was faster. | ||
|
||
Today, (1) can be simulated in practice with the use of a Clippy lint and (2) | ||
is no longer true. In fact, `regex!` is at least one order of magnitude faster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
faster
-> slower
?
@RustPowers Could you explain why it is much better? Cloning the pattern string is insignificant (and always will be). |
@Diggsey The current API is defined to match on UTF-8 encoded text (or "ASCII compatible" text when the Unicode flag is disabled in a |
@mawww I think it is definitely a valuable feature, and can be important, for sure. But ultimately, I think it is out of scope for this RFC. Notably, I do not think it is unreasonable to expect implementors of text editors to provide their own regex implementation that works under their particular constraints. I note that the Some things to consider: Matching text that is not contiguous in memory is an orthogonal feature to matching arbitrary iterators, so your request is a little bit murky. I suspect text editors care more about matching text that isn't contiguous in memory rather than being able to use an arbitrary iterator. Matching on an arbitrary iterator is a streaming interface and is challenging for a number of reasons, and generally requires some type of clever buffering if you want to be fast. (There exist specialized regex engines built around the idea of streaming and it permeates their implementation and interface, for good reason.) Matching on non-contiguous memory is probably quite a bit easier, but would be a significant departure from today's implementation, which makes many assumptions about the layout of text in memory for performance reasons. I could conceive of a future where the |
I don't want to split the regex API over multiple traits. I want a single cohesive interface because I think that is easiest to consume. I feel strongly about this. The current regex crate API has been out in the wild and in use for two years at this point, there's no hurry.
I don't necessarily agree. If mandatory arguments are in the constructor, then there's no way to misuse the API by "forgetting" to call a method before compilation.
Yes, this was brought up above and I agreeish.
Because
What are the semantics of
Yes, this is in the RFC. It is being removed.
This is out of scope IMO. If someone can write a meaningful benchmark that would benefit from those methods, then I think I'd be happy to oblige, but we don't need to resolve this for 1.0.
Because reading the original string of the regex that was compiled is occasionally useful and doesn't really cost us anything.
Do you mean reverse searching? Maybe one day, but it's out of scope for 1.0.
Yes, I'm going to go over the iterator names and make sure they line up with conventions in |
@RustPowers Thanks for the feedback! I purposefully didn't respond to a few of your concerns because they were about the implementation. If you'd like to talk about those, I'd be happy to, but please open an issue for each on the regex repo. |
Sorry everyone for the delay, but summer hit and it was hard for me to find my way back. I think I'm now finally caught up with everyone's feedback. Here are some changes:
The implementation has also been updated to reflect these changes. |
🔔 This RFC is now entering its final comment period 🔔 The @rust-lang/libs team is inclined to merge this RFC, given the recent updates @BurntSushi has made. @rust-lang/libs if you could check off your name below or write a comment below to the contrary: |
Is implementing |
Just an amusing comment, since I've had this discussion about other languages in the past. Most languages (which support functions on objects) have a regex api where the regex is an object and searching/matching is done with methods on those objects. C++ seems to stand out in that, the regex class has very few methods and searching/matching is done through free functions. The motivation is that regex's don't actually perform searching or matching, they are only an input into a searching/matching algorithm. I'm mildly curious if this was ever considered? (While it would likely be easy to change, given Rust's privacy model, it's probably too late in the process to have it be worth it.) |
@ahmedcharles Not really. The Rust ecosystem generally eschews free functions in favor of types with methods. I also disagree that regexes don't actually perform searching or matching. Much of the API is very closely coupled with regexes. (For example, only a subset of the |
Also, a |
I feel like allocator support and the synchronization problem are closely tied together---an API to explicitly preallocate and optionally share scratch space solves this nicely. I'm also wary of the bytes duplication / want to extend bytes to supports iterators or arbitrary sized types (streamed, fix-sized atoms). Most importantly, I think the recent acceptance of procedural macros 1.1 can put the wind back into the sails of regex-macros. I assume that the vast majority of use-cases use statically known regexes and thus are more appropriately handled with such an interface (I don't like the smell of lint + unwrap). Given all this, I think the time will soon be ripe for a big overhaul, in a way it hasn't been since regex was first created. If making a 1.0 doesn't impede progress on a 2.0 at all, sure, go ahead. Otherwise 👎 from me. |
Improving The Rust ecosystem hasn't solved the problem of allocator support yet. If it makes sense to build regex on top of that, and if that absolutely requires breaking changes, than a 2.0 release seems fine to me. |
@BurntSushi do you know where the maintenance policy for stable rust-lang crates is laid out? If 2.0 can released not because breaking changes are "absolutely required" but just because they're nice to have, that would be good to know. |
@Ericson2314 That would probably come under the purview of the rust-lang crates RFC. Specifically:
To specifically answer your question, my interpretation is that subsequent major version bumps are largely left up to the discretion of the maintainers, community, libs team and the willingness of someone to write an RFC for it. |
@BurntSushi Thanks for the link and quote. I either crates.io can support per-release versions, or prototypes can be disseminated via git and git dependencies (master vs stable branch for example). Does that sounds good to you as maintainer? [N.B. I'm reminded that bitflags should probably do this for experimenting with associated constants.] |
@Ericson2314 It doesn't sound unreasonable to me, but I can't really know what we'll do because I can't predict what exactly will provoke a new major version upgrade. |
That's reasonable. |
Just wanted to say: this is a beautiful and inspiring RFC. Basically every question I had while reading it was answered in the next paragraph. I'm 👍 on landing this! My biggest worry is about the bytes/utf8 split, but I pretty much buy your reasoning that the proposed API is the simplest, most ergonomic way to deal with it and that there aren't a lot of advantages for trying to be more clever. We've talked sometimes about "inherent trait impls", where you implement inherent methods and a trait at the same time. That might eventually provide a way to ensure a consistent API across the two variants while retaining the simplicity and ergonomics of the current setup. But that's something to explore later down the line. |
Ok, the decision of the libs team was to merge and accept, so I will do so! Thanks again for the RFC @BurntSushi! |
"rendered" link should now point here https://github.com/rust-lang/rfcs/blob/master/text/1620-regex-1.0.md |
@ciphergoth Thanks! Updated! |
rendered
Pretty much all of the RFC is implemented and can be tracked at this PR: rust-lang/regex#230
rustdoc
output of the corresponding API is also available: http://burntsushi.net/stuff/regex-rfc/regex/