Implement (at least part of) UTS#18 RL1.3 - Operators in character sets #341

trishume · 2017-02-16T17:00:55Z

I'm working on a syntax highlighting engine in Rust that requires an Oniguruma-compatible regex engine. I'm trying to port it from the onig crate to fancy-regex, but there's some features it doesn't support yet (see trishume/syntect#34).

One of these features is the && operator and nesting in character sets, for example [a-w&&[^c-g]z]. I was thinking this would be added to fancy-regex but @robinst pointed out this comment which suggests that you plan for them to be in the regex crate.

It would be nice if the regex crate supported UTS#18 RL1.3 in full, but the && operator and nesting are all that Oniguruma-compatibility of fancy-regex requires.

I imagine this would take some changes to regex-syntax and then a pass to convert the fancy character sets down to basic character sets. I haven't thought enough about it to know if there are any unicode-related issues that might make this more complex, perhaps by making a tiny fancy character set compile to an enormous basic character set.

@BurntSushi do you have any insight on how difficult you think this would be to add for a contributor not familiar with the internals of regex?

The text was updated successfully, but these errors were encountered:

BurntSushi · 2017-02-16T17:08:01Z

do you have any insight on how difficult you think this would be to add for a contributor not familiar with the internals of regex?

My belief is that the change would be limited to regex-syntax and shouldn't bleed into anything else. If that's true, then that makes this task "medium" instead of "hard" because it reduces the amount of code you need to be familiar with. In fact, I don't think it even requires changing the existing AST.

I'd still consider this to be a tricky task. I think the parser itself seems a bit complex (the char class parser is already complex, so spending some time trying to make that nicer is probably a worthwhile investment). I also think there could be some issue with computing the various set operations because Unicode makes the sets quite large. For example, [^c-g] is every Unicode codepoint other than c-g.

In general, parsing takes an insignificant time in regex construction compared to the time it takes to compile the regex. I would hope that even with this change, it would stay that way. Maybe set operations on Unicode sets aren't as bad as I think, but if they wind up slowing down parsing a lot, then we might need to be more clever. And that makes this task "medium" instead of "easyish."

At a higher level, I do have plans to rewrite the parser from scratch (again), but I have no idea when that will happen and you shouldn't plan around it. My original intention was to add UTS#18 RL1.3 when I did that. But doing it before then is fine by me.

BurntSushi · 2017-02-16T17:10:54Z

Also, I would expect test coverage to remain excellent. Every single branch should be covered by a test. (I believe this is true today.)

trishume · 2017-02-16T17:12:59Z

@BurntSushi awesome, thanks for the insight.

BurntSushi · 2017-02-16T17:19:52Z

Oh and I'd be happy to answer questions and mentor anyone who works on this. I'm on the usual IRC channels, typically 4-8pm EST and sporadically on the weekends. Email also works well, which I'm on around the clock.

robinst · 2017-02-17T01:58:42Z

I started looking into this after I made the comment on the syntect issue :). I too think it can be entirely implemented in regex-syntax. Looks like the work is roughly:

Make parse_class handle nested classes
Make it handle &&
Add a intersect method to CharClass

We should probably focus on && as a first step, I'm not sure the other operations are in wide use (e.g. Oniguruma only has &&).

The first change I have is extracting a parse_class_escape method to make parse_class a bit smaller. Would you prefer individual pull requests for steps along the way, or a bigger one with multiple commits at the end?

BurntSushi · 2017-02-17T02:21:40Z

I think a single commit with everything would be best. If you wanted to open a PR for earlier feedback that's fine, so long as it gets squashed in the end. :-)

…

On Feb 16, 2017 8:58 PM, "Robin Stocker" ***@***.***> wrote: I started looking into this after I made the comment on the syntect issue :). I too think it can be entirely implemented in regex-syntax. Looks like the work is roughly: 1. Make parse_class handle nested classes 2. Make it handle && 3. Add a intersect method to CharClass We should probably focus on && as a first step, I'm not sure the other operations are in wide use (e.g. Oniguruma only has &&). The first change I have is extracting a parse_class_escape method to make parse_class a bit smaller. Would you prefer individual pull requests for steps along the way, or a bigger one with multiple commits at the end? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#341 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAb34jsMmKhNN_owuHVbLLJ4jXeiTrV_ks5rdP7TgaJpZM4MDRvD> .

robinst · 2017-02-21T10:13:25Z

Quick status update: I have support for nested classes and intersection using &&, currently writing all the tests to make sure all things are covered.

This implements parts of UTS#18 RL1.3, namely: * Nested character classes, e.g.: `[a[b-c]]` * Intersections in classes, e.g.: `[\w&&\p{Greek}]` They can be combined to do things like `[\w&&[^a]]` to get all word characters except `a`. Fixes rust-lang#341

robinst · 2017-02-22T10:57:35Z

Pull request for nested classes and intersection: #346

This implements parts of UTS#18 RL1.3, namely: * Nested character classes, e.g.: `[a[b-c]]` * Intersections in classes, e.g.: `[\w&&\p{Greek}]` They can be combined to do things like `[\w&&[^a]]` to get all word characters except `a`. Fixes rust-lang#341

…ction, r=BurntSushi Support nested character classes and intersection with `&&` This implements parts of UTS#18 RL1.3, namely: * Nested character classes, e.g.: `[a[b-c]]` * Intersections in classes, e.g.: `[\w&&\p{Greek}]` They can be combined to do things like `[\w&&[^a]]` to get all word characters except `a`. Fixes #341

trishume mentioned this issue Feb 16, 2017

[WIP] Kinda-working fancy-regex support trishume/syntect#34

Closed

7 tasks

BurntSushi added the enhancement label Feb 18, 2017

robinst mentioned this issue Feb 22, 2017

Support nested character classes and intersection with && #346

Merged

bors closed this as completed in #346 May 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement (at least part of) UTS#18 RL1.3 - Operators in character sets #341

Implement (at least part of) UTS#18 RL1.3 - Operators in character sets #341

trishume commented Feb 16, 2017

BurntSushi commented Feb 16, 2017 •

edited

Loading

BurntSushi commented Feb 16, 2017

trishume commented Feb 16, 2017

BurntSushi commented Feb 16, 2017

robinst commented Feb 17, 2017

BurntSushi commented Feb 17, 2017 via email

robinst commented Feb 21, 2017

robinst commented Feb 22, 2017

Implement (at least part of) UTS#18 RL1.3 - Operators in character sets #341

Implement (at least part of) UTS#18 RL1.3 - Operators in character sets #341

Comments

trishume commented Feb 16, 2017

BurntSushi commented Feb 16, 2017 • edited Loading

BurntSushi commented Feb 16, 2017

trishume commented Feb 16, 2017

BurntSushi commented Feb 16, 2017

robinst commented Feb 17, 2017

BurntSushi commented Feb 17, 2017 via email

robinst commented Feb 21, 2017

robinst commented Feb 22, 2017

BurntSushi commented Feb 16, 2017 •

edited

Loading