-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement (at least part of) UTS#18 RL1.3 - Operators in character sets #341
Comments
My belief is that the change would be limited to I'd still consider this to be a tricky task. I think the parser itself seems a bit complex (the char class parser is already complex, so spending some time trying to make that nicer is probably a worthwhile investment). I also think there could be some issue with computing the various set operations because Unicode makes the sets quite large. For example, In general, parsing takes an insignificant time in regex construction compared to the time it takes to compile the regex. I would hope that even with this change, it would stay that way. Maybe set operations on Unicode sets aren't as bad as I think, but if they wind up slowing down parsing a lot, then we might need to be more clever. And that makes this task "medium" instead of "easyish." At a higher level, I do have plans to rewrite the parser from scratch (again), but I have no idea when that will happen and you shouldn't plan around it. My original intention was to add UTS#18 RL1.3 when I did that. But doing it before then is fine by me. |
Also, I would expect test coverage to remain excellent. Every single branch should be covered by a test. (I believe this is true today.) |
@BurntSushi awesome, thanks for the insight. |
Oh and I'd be happy to answer questions and mentor anyone who works on this. I'm on the usual IRC channels, typically 4-8pm EST and sporadically on the weekends. Email also works well, which I'm on around the clock. |
I started looking into this after I made the comment on the syntect issue :). I too think it can be entirely implemented in
We should probably focus on The first change I have is extracting a |
I think a single commit with everything would be best. If you wanted to
open a PR for earlier feedback that's fine, so long as it gets squashed in
the end. :-)
…On Feb 16, 2017 8:58 PM, "Robin Stocker" ***@***.***> wrote:
I started looking into this after I made the comment on the syntect issue
:). I too think it can be entirely implemented in regex-syntax. Looks
like the work is roughly:
1. Make parse_class handle nested classes
2. Make it handle &&
3. Add a intersect method to CharClass
We should probably focus on && as a first step, I'm not sure the other
operations are in wide use (e.g. Oniguruma only has &&).
The first change I have is extracting a parse_class_escape method to make
parse_class a bit smaller. Would you prefer individual pull requests for
steps along the way, or a bigger one with multiple commits at the end?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#341 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAb34jsMmKhNN_owuHVbLLJ4jXeiTrV_ks5rdP7TgaJpZM4MDRvD>
.
|
Quick status update: I have support for nested classes and intersection using |
This implements parts of UTS#18 RL1.3, namely: * Nested character classes, e.g.: `[a[b-c]]` * Intersections in classes, e.g.: `[\w&&\p{Greek}]` They can be combined to do things like `[\w&&[^a]]` to get all word characters except `a`. Fixes rust-lang#341
Pull request for nested classes and intersection: #346 |
This implements parts of UTS#18 RL1.3, namely: * Nested character classes, e.g.: `[a[b-c]]` * Intersections in classes, e.g.: `[\w&&\p{Greek}]` They can be combined to do things like `[\w&&[^a]]` to get all word characters except `a`. Fixes rust-lang#341
This implements parts of UTS#18 RL1.3, namely: * Nested character classes, e.g.: `[a[b-c]]` * Intersections in classes, e.g.: `[\w&&\p{Greek}]` They can be combined to do things like `[\w&&[^a]]` to get all word characters except `a`. Fixes rust-lang#341
This implements parts of UTS#18 RL1.3, namely: * Nested character classes, e.g.: `[a[b-c]]` * Intersections in classes, e.g.: `[\w&&\p{Greek}]` They can be combined to do things like `[\w&&[^a]]` to get all word characters except `a`. Fixes rust-lang#341
This implements parts of UTS#18 RL1.3, namely: * Nested character classes, e.g.: `[a[b-c]]` * Intersections in classes, e.g.: `[\w&&\p{Greek}]` They can be combined to do things like `[\w&&[^a]]` to get all word characters except `a`. Fixes rust-lang#341
This implements parts of UTS#18 RL1.3, namely: * Nested character classes, e.g.: `[a[b-c]]` * Intersections in classes, e.g.: `[\w&&\p{Greek}]` They can be combined to do things like `[\w&&[^a]]` to get all word characters except `a`. Fixes rust-lang#341
…ction, r=BurntSushi Support nested character classes and intersection with `&&` This implements parts of UTS#18 RL1.3, namely: * Nested character classes, e.g.: `[a[b-c]]` * Intersections in classes, e.g.: `[\w&&\p{Greek}]` They can be combined to do things like `[\w&&[^a]]` to get all word characters except `a`. Fixes #341
I'm working on a syntax highlighting engine in Rust that requires an Oniguruma-compatible regex engine. I'm trying to port it from the
onig
crate to fancy-regex, but there's some features it doesn't support yet (see trishume/syntect#34).One of these features is the
&&
operator and nesting in character sets, for example[a-w&&[^c-g]z]
. I was thinking this would be added to fancy-regex but @robinst pointed out this comment which suggests that you plan for them to be in theregex
crate.It would be nice if the
regex
crate supported UTS#18 RL1.3 in full, but the&&
operator and nesting are all that Oniguruma-compatibility of fancy-regex requires.I imagine this would take some changes to
regex-syntax
and then a pass to convert the fancy character sets down to basic character sets. I haven't thought enough about it to know if there are any unicode-related issues that might make this more complex, perhaps by making a tiny fancy character set compile to an enormous basic character set.@BurntSushi do you have any insight on how difficult you think this would be to add for a contributor not familiar with the internals of
regex
?The text was updated successfully, but these errors were encountered: