Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exposing RE2::Set #43

Open
Exter-N opened this issue Jun 6, 2018 · 8 comments
Open

Exposing RE2::Set #43

Exter-N opened this issue Jun 6, 2018 · 8 comments

Comments

@Exter-N
Copy link
Contributor

Exter-N commented Jun 6, 2018

The RE2::Set class allows to match a string against several regular expressions in a single pass, seemingly more efficiently than piping all the regexes together :
https://github.com/google/re2/blob/f2cc1aeb5de463c45d020c446cbcb028385b49f3/re2/set.h#L21-L23

It could be exposed to JavaScript code by :

  • accepting an Array of patterns as the first parameter of the constructor ;
  • adding functionality to the matching methods to use RE2::Set to identify which one of the patterns matches (it doesn't seem to go further than that), and then use regular RE2s corresponding to the identified patterns to get more information ;
  • in .exec(), returning the index of the pattern which matched and/or the pattern itself as properties of the returned array (maybe with Symbol keys, to eliminate the risks of name collisions with future properties that could be defined by the ECMAScript spec) ;
  • returning the piped patterns in the source property, for compatibility ;
  • exposing a new sources property (or a property with a Symbol key) containing the individual patterns ;
  • either returning an array of the translated patterns in the internalSource property, or applying the same process as for the source property.

Use cases could be optimizing anything that boils down to this kind of code :

let match;
if ((match = re1.exec(str)) !== null) {
    // …
} else if ((match = re2.exec(str)) !== null) {
    // …
} else if ((match = re3.exec(str)) !== null) {
    // …
} /* potentially lots of other cases */ else {
    // …
}

For example, a HTTP router, a lexer …

What do you think about such a feature?

@uhop
Copy link
Owner

uhop commented Jun 7, 2018

Sounds useful. We can implement it as a custom RE2-specific feature.

Points to clarify:

  1. Super simple JavaScript API.
    // possible sketch
    
    // creating
    var iterable = [/1/, /2/];
    const reSet = new RE2.Set(iterable);
    // any iterable should do
    // the set is created and compiled
    // I don't think it make sense to add regular expressions to the set dynamically
    
    // using
    reSet.test(str);
    reSet.exec(str);
  2. (nice to have) A benchmark to show, if this feature is faster than:
    a) A bunch of RE2 objects.
    b) A bunch of RegExp objects.

Aside

I have a project, which was trying to generate an optimal lexer automatically: https://github.com/uhop/parser-toolkit

One optimization I was trying to do is converting simple matchers into one to increase the overall performance. For example, I have two matchers: /\d+/ and /\w+/, so one way to combine them into something useable would be like that:

/^(\d+)|(\w+)/

^ is obviously to establish an anchor, which can be done differently with a sticky bit. In reality, my chains of matchers were much longer.

The technique worked and produced a long list of possible matches, all but one of them are empty. And the only way to find out the matched group was a linear search with its n/2 complexity. It would be nice to have a way to find a match and know what match actually happened. I don't know if RE2::Set provides a solution to this problem. It sure would be nice to solve something like that.

@Exter-N
Copy link
Contributor Author

Exter-N commented Jun 9, 2018

The API I was thinking of was in my opinion even simpler, something like :

let re = new RE2("pattern", "flags"); // no change compared to the current version
re.exec("input"); // no change either, the same goes for .test, .match, .replace, .search, .split, .toString, .source, .flags, …
re.sources; // undefined

let reSet = new RE2([ "in", "put", "some other pattern",  ], "gi");
// assume reSet.lastIndex = 0; before each of the following lines
reSet.exec("input"); // [ "in", index: 0, groups: undefined, input: "input", patternIndex: 0, pattern: "in" ]
reSet.test("input"); // true
"input".match(reSet); // same as 2 lines above
"input".search(reSet); // 0
"input".split(reSet); // [ "", "", "" ]
"input".replace(reSet, () => "X"); // "XX"
"input".replace(reSet, [ () => "X", () => "Y" ]); // "XY"
reSet.flags; // "gi"
reSet.source; // "in|put|some other pattern|…"
reSet.sources; // [ "in", "put", "some other pattern", … ]
reSet.toString(); // "/in|put|some other pattern|…/gi"

@Exter-N
Copy link
Contributor Author

Exter-N commented Jun 10, 2018

As for determining the matched pattern, if I understood the RE2::Set API correctly, it returns a list of the matched patterns (but no more information, so the match needs to be re-run with the individual patterns to get position/group info). I'll have to run some tests to determine the performance cost of all this. By the way, do you want to use a benchmark framework/runner in particular ?

@uhop
Copy link
Owner

uhop commented Jun 10, 2018

By the way, do you want to use a benchmark framework/runner in particular?

Because it is an internal deal, not published anywhere, I don't really care as long as it is reliable.

@Exter-N
Copy link
Contributor Author

Exter-N commented Jun 20, 2018

Writing an adapter which enables one to write for (Local<Value> element: object) { in C++ and have the same behavior as a JS for (let element of object) { is an … interesting exercise. 🤔

Anyway, I'm progressing, step after step, when I find time for it.

@uhop
Copy link
Owner

uhop commented Jul 12, 2022

Looking into this RE2 option, it seems less useful: apparently it is used only as extension of test(). It means it cannot return matches, groups, and so on. Just a Boolean value. I am closing this ticket until we have a definitive decision and its execution plan.

@ronag
Copy link

ronag commented Sep 14, 2024

Using Set is a 40x performance improvement for our use when we need to test a string against multiple (100+) regexes.

Would suggest something like the following:

const set = new RE2Set(patterns)
for (const idx of set.test('asdasdasd')) {
  console.log(pattern[idx], 'match!')
}

@uhop uhop reopened this Sep 15, 2024
@uhop
Copy link
Owner

uhop commented Sep 15, 2024

40x is a compelling number. I guess it makes this feature back in the game.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants