-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: conditionals #263
Comments
I'll just say this up front: I agree that conditionals can be useful, and that in some circumstances, they may make writing/reading/maintaining regexes easier. (This is also true of many of the bells and whistles you find in other regex libraries, like full blown backreferences and look around.) However, one design constraint of this library is that matching must have time complexity no greater than linear with respect to the input. In other words, when searching With that said, this feels an awful lot like backreferences, albeit with a seemingly more limited power. It definitely isn't clear to me how said conditionals could be encoded in a state machine. In particular, a notable property of capturing groups in this library is that they can never impact whether a regex as a whole matches or not. (Certainly, grouping itself can, but the property of whether a group is capturing or not does not.) This feature would change that, such that even if it were possible to do this while maintaining linear search time, it would require substantial changes. With that said, I do have some suggestions that might help solve your specific problem in a different way:
|
Thanks for the detailed response. I was thinking that it might be difficult given the design constraint. How would you implement 1. for the given regex above? Run If at any point you see the opportunity to implement some version of this, it would help me a great deal! :) |
For (1), I might not have been clear. I wasn't suggesting using multiple regexes. Instead, I was suggesting to use string interpolation. In particular, you mentioned that this regex let pat = ".*?";
let re = Regex::new(&format!(r"\({pat}\)|{{{pat}}}", pat)).unwrap(); For (2), whether it's too wasteful or not depends on the problem you're trying to solve and the frequency of matches. For example, if the first regex has a very low false positive rate and the second regex doesn't need to scan the entire input, then the performance could be quite competitive.
Sadly, I don't think it's going to happen for reasons stated previously. |
Let's say that I have n starting and ending patterns (the example above had So what you meant by 1) is to chain the n different possibilities using How is that different in meaning and performance from 2), where instead of Also, what would be the consequence with regards to memory consumption?
|
For (1), I was merely suggesting that you use string interpolation to reduce redundancy in writing and maintaining the regular expression. The resulting regex would still be large. The performance trade offs between one large regex and multiple smaller regexes are not at all clear. One large regex could be just as fast as one small regex. It is not wise to make performance assumptions based on the size of the regex. The only thing you really know for sure is that the bigger the regex the more memory it will use, but you can't really make any conclusions about its match performance. Remember, this library is based on finite state machines, so the way alternations work is very different than how they work in a traditional backtracking engine.
I think the trade offs of conditionals aren't being appropriately account for here. As I said before, it's not clear that a linear time implementation exists. If they turn out to be equivalent to backreferences in expressive power, then it would turn into an NP-complete problem to solve them, which likely takes worst case exponential time. Whether the additional memory consumption is a problem or not isn't clear. If the regex uses 10KB of heap instead of 1KB, is that an issue? |
Fair enough, long regex it is then. There are several capture groups in the middle part and they would now be duplicated as well. I could either use an index offset to identify them or, for readability, use named groups. Are they supported by this crate or would you recommend a different approach? |
Named groups are supported: https://doc.rust-lang.org/regex/regex/index.html#grouping-and-flags |
Yes, but I couldn't find any documentation on what happens if the same name is used multiple times, as it would be here. Only one capture group will ever be used, i.e., it will not have multiple return values, but I don't think the compiler can deduce that. |
Ah, I see. Yeah, duplicate capture names aren't allowed. A potential alternative behavior is to permit multiple capturing groups with the same name, and fill in the "last" matched value. |
Yes, that would be perfectly fine for my use case. That behaviour could live behind a flag, so that you have to explicitly enable multiple named groups. |
Could you say more about the specific use case that benefits from having multiple capture groups with the same name? (It seems like a separate issue than this one.) |
We're still trying to do the same thing - imagine
|
@alubbe It honestly sounds to me like it would be worth it to build up a good abstraction around the problem you're trying to solve, and in particular, abstract away the details of actually writing the regex in the first place. It seems like you don't actually need duplicate capture names for that. If you are indeed concatenating a bunch of regexes with |
Fair enough, I will try working with an offset on the index to avoid named Would it be possible to configure the RegexSet (by setting a flag or
|
@alubbe The point of There are technically performance differences between "is match," "which matched first" and "which matched."
The match semantics are different. In a |
Awesome, thanks for clarifying!
|
I have some uses cases where it is very useful to have conditionals based on whether previous capture groups were successful, i.e.,
(?(1)some_regex|another_regex)
Take for example this subject
I want to match everything between
()
and{}
, whichever closure is relevant. Because the regex describing the content in the closure is very long and complex, I would like to re-use it and have specify different ending characters depending on the starting character, like so:Currently, I would have to break it up like this:
Which, in this short example, is better - because
.*?
is very short. But in real life, this is very long and very complex, and I have 5 different kind of starting and ending combinations that I would like to separate. So right now, I'd have to increase the length of the full regex five-fold to get the wanted behaviour. Having conditionals would be really helpful!The text was updated successfully, but these errors were encountered: