-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support nested character classes and intersection with &&
#346
Support nested character classes and intersection with &&
#346
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments in the code with remarks/questions. Please don't hold back with comments of any kind, I'm still quite new to Rust :).
I can add the user documentation once we have a better understanding of all the rules.
/// Calculate the intersection of two canonical character classes. | ||
/// | ||
/// The returned intersection is canonical. | ||
fn intersection(&self, other: &CharClass) -> CharClass { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this should be pub
or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say hold off for now. Thanks.
regex-syntax/src/lib.rs
Outdated
} else { | ||
// No more ranges to check, done. | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is the most idiomatic way to do this :). Maybe using old-fashioned indexing would work better in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe
match iter.next() {
Some(v) => *item = v,
_ => break // no more ranges to check, done
}
?
} | ||
|
||
Expr::ClassBytes(byte_class) | ||
})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same as before, just moved.
regex-syntax/src/parser.rs
Outdated
fn class_nested_class_brackets_hyphen() { | ||
// This is really confusing, but `]` is allowed if first character within a class | ||
// It parses as a nested class with the `]` and `-` characters | ||
assert_eq!(p(r"[[]-]]"), Expr::Class(class(&[('-', '-'), (']', ']')]))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the decision to allow ]
unescaped as the first character in a class final? Thinking about this and how it interacts with &&
was one of the most confusing parts about this change.
I'd wish the rule for ]
was that it needs to be escaped inside a character class, no matter where it is. The same for -
. Is it too late to change this? It would simplify the parsing a bit, and would make it easier to explain how things work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe if it is too late to break compatibility with existing regexes we can at least ban ]
and possibly -
as first characters of a nested character class. As that test demonstrates, it can get really confusing when nested character classes are involved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe. The downside of this is that it would break the useful rule of "everything that can be a top-level character class can also be a nested character class".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I admit that this is exasperated by the presence of nested character classes, but AFAIK, this is pretty standard. Changing this would, no doubt in my mind, require a semver version bump. I really really don't like breaking changes in the regex syntax, so I would rather live with the implementation complexity over the breaking changes and inconsistencies with other regex engines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Just tested with Java's Pattern, even its implementation allows it. So yeah, makes sense to keep it. (It's just one of those things that seem like they were originally done "not because it's a good idea, but because we can", and now everyone has to support it.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(It's just one of those things that seem like they were originally done "not because it's a good idea, but because we can", and now everyone has to support it.)
I have no doubt that that is indeed the case. :-)
In fact, it's probably true for a lot of things in the regex syntax. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks excellent to me. I reviewed everything and left a couple comments suggesting minor changes and asking questions. Looks well tested, did you check the branch coverage like @BurntSushi mentioned should stay perfect?
regex-syntax/src/parser.rs
Outdated
'&' => { | ||
// intersection with `&&` | ||
self.bump(); | ||
self.bump(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this throw a syntax error (or just treat it as a literal &
in the character class, not sure what spec says) if the second character you bump
over isn't also an &
?
Edit: noticed later that you only exit parse_class_set
if you look ahead and see &&
. Maybe add a comment on the second bump saying this was verified to be &
in parse_class_set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, needs a comment, adding it.
|
||
Expr::ClassBytes(byte_class) | ||
})) | ||
Build::Expr(Expr::ClassBytes(class2)) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are all these extra cases for? I don't see how they relate to the rest of the changes in this PR. Do they add support for more types of escapes in character classes? Which tests test these new branches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to see in the diff, but this is unchanged from before, just extracted into its own parse_class_escape
method.
regex-syntax/src/parser.rs
Outdated
fn class_nested_class_brackets_hyphen() { | ||
// This is really confusing, but `]` is allowed if first character within a class | ||
// It parses as a nested class with the `]` and `-` characters | ||
assert_eq!(p(r"[[]-]]"), Expr::Class(class(&[('-', '-'), (']', ']')]))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe if it is too late to break compatibility with existing regexes we can at least ban ]
and possibly -
as first characters of a nested character class. As that test demonstrates, it can get really confusing when nested character classes are involved.
regex-syntax/src/parser.rs
Outdated
'\\' => match try!(self.parse_escape()) { | ||
Build::Expr(Expr::Class(class2)) => { | ||
// Nested set, e.g. `[c-d]` in `[a-b[c-d]]` | ||
let class2 = try!(self.parse_class_as_chars()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't had chance to do a thorough review yet (although, from what I see, this looks really really good), but this is a potential problem. In particular, this turns a parser with predictable stack growth into a parser with unpredictable stack growth. This is bad because if a program accepts regexes as user input and they do [[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[...
or some such, then this recursion will cause a stack overflow.
With that said, there are various parts of this crate that use recursion over the abstract syntax, but not before checking the nesting limit. It's not quite a guarantee (since who knows how big the stack really is), but it does afford the caller some control over what happens if a very bad regex is provided.
This basically means you would need to maintain an explicit stack in the parser state of nested character classes, and this is what the Build
enum is for. Right now, the Build
enum only cares about groups and alternates, but maybe it's easy to add a variant for nested classes?
I'm sorry I didn't bring this point up earlier. I should have seen it coming, but I completely forgot about it until the code was in front of me.
How willing are you to make this change? If not, I might be able to finish it (I just don't know when I will).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good point. I remember wondering about that problem while implementing it, but then forgot later.
I'll look into using Build
, yeah. Maybe it could also be a separate stack. Have to wrap my head around it first though :).
c436bfd
to
93f1b69
Compare
Rewrote it to use a stack instead of recursion. Also added a test for deeply nested character classes that didn't pass before but now it does :). I decided to not use |
@robinst Awesome. I looked over it briefly but I'm too tired to fully understand it at the moment, so I haven't done anything close to a careful review. One thing I'm unsure about is if the important thing is just to not stack overflow, or to have limited memory usage. At the moment this should use heap proportional to at most the length of the regex as far as I can tell. But @BurntSushi said something about the nesting limit, and I'm not sure if checking it is warranted in this case, or only when using recursion. |
Hmm, yes, the primary purpose of the nesting limit is to mitigate the risk of stack overflow. However, in the nested character case, all nestings are resolved (I think) to a single flattened set of Unicode codepoints. Heap space may indeed be unbounded, but:
|
Yes, they are.
Yes, the Did you have a chance to look at the change yet? I was looking into a |
93f1b69
to
2f5c967
Compare
Added the quickcheck test as well now. |
2f5c967
to
2f872b4
Compare
OK, I've finally had a chance to review this in more depth and I'm pretty much speechless. This is phenomenal work. I looked hard, but I don't see anything to complain about! I'm not sure if #354 will cause conflicts, so you might wind up needing to rebase, but let's see what happens. (Bors seems to be stuck at the moment...) @bors r+ |
📌 Commit 2f872b4 has been approved by |
🔒 Merge conflict |
☔ The latest upstream changes (presumably #354) made this pull request unmergeable. Please resolve the merge conflicts. |
@robinst Yup, this will need a rebase! |
2f872b4
to
d3b5d32
Compare
This implements parts of UTS#18 RL1.3, namely: * Nested character classes, e.g.: `[a[b-c]]` * Intersections in classes, e.g.: `[\w&&\p{Greek}]` They can be combined to do things like `[\w&&[^a]]` to get all word characters except `a`. Fixes rust-lang#341
d3b5d32
to
bb233ec
Compare
Wow, thank you :)! It was surprisingly simple to do this change, the codebase is very readable! Rebased it now! |
@bors retry |
@bors r+ |
📌 Commit bb233ec has been approved by |
…ction, r=BurntSushi Support nested character classes and intersection with `&&` This implements parts of UTS#18 RL1.3, namely: * Nested character classes, e.g.: `[a[b-c]]` * Intersections in classes, e.g.: `[\w&&\p{Greek}]` They can be combined to do things like `[\w&&[^a]]` to get all word characters except `a`. Fixes #341
☀️ Test successful - status-appveyor, status-travis |
This implements parts of UTS#18 RL1.3, namely:
[a[b-c]]
[\w&&\p{Greek}]
They can be combined to do things like
[\w&&[^a]]
to get all wordcharacters except
a
.Fixes #341