Skip to content
This repository has been archived by the owner on Dec 15, 2022. It is now read-only.

Handle regexes with unicode escape sequences in .find and .findAll #43

Merged
merged 2 commits into from
Nov 8, 2017

Conversation

maxbrunsfeld
Copy link
Contributor

@maxbrunsfeld maxbrunsfeld commented Nov 8, 2017

Fixes atom/atom#16126

Background

The TextBuffer.find family of methods use PCRE to perform regex matching directly on the buffer's native contents. This has a number of advantages: it eliminates a lot of string copying, removes to need for conversions between raw character indices and row/column coordinates, and allows searching to be done on a background thread if desired. See #5, #35.

Problem

There are some differences between PCRE's and ECMAScript's regex syntax. One difference is that PCRE does not support the \u00df syntax for specifying UTF16 character codes.

Solution

Luckily, it's trivial to convert these character sequences into their actual UTF16 values. I've added that conversion in this PR.

Future Work

There may be other incompatibilities between PCRE and ECMAScript regexes. One other example I know of is the unicode code point escape sequence, which are slightly different than the UTF16 character code escape sequences I've dealt with here. They should be easy to support as well, though I have not done so here.

/cc @t9md

@maxbrunsfeld maxbrunsfeld force-pushed the mb-handle-unicode-escape-sequences branch from afda1d3 to 3ea87c5 Compare November 8, 2017 21:44
@maxbrunsfeld maxbrunsfeld merged commit 6694dd5 into master Nov 8, 2017
@maxbrunsfeld maxbrunsfeld deleted the mb-handle-unicode-escape-sequences branch November 8, 2017 21:51
@nathansobo
Copy link
Contributor

💖

@leroix
Copy link
Contributor

leroix commented Nov 8, 2017

It would be nice if there was a standalone c++ module version of v8's Irregexp.

@nathansobo
Copy link
Contributor

It would be nice if there was a standalone c++ module version of v8's Irregexp.

I have lamented this as well.

@maxbrunsfeld
Copy link
Contributor Author

maxbrunsfeld commented Nov 8, 2017

Yeah, we looked into that a little bit, but concluded that Irregexp is totally coupled to V8's internals - I think it generates native code using the same JIT machinery that V8 uses for compiling JS code.

Luckily, PCRE is a pretty lightweight dependency.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants