Skip to content
This repository has been archived by the owner on Jun 7, 2022. It is now read-only.

RegExp Match Indices Stage-3 performance mitigation #47

Closed
rbuckton opened this issue Nov 19, 2020 · 8 comments · Fixed by #49
Closed

RegExp Match Indices Stage-3 performance mitigation #47

rbuckton opened this issue Nov 19, 2020 · 8 comments · Fixed by #49

Comments

@rbuckton
Copy link
Collaborator

rbuckton commented Nov 19, 2020

Options:

  1. Introduce a flag to opt-in to adding .indices to the result:
    • I to indicate indices (upper-case I does not conflict with lower-case i).
    • d since both i is already used and n means something else in other RegExp engines.
      • NOTE: I intend to propose n at a future meeting.
    • o to indicate offsets, and rename the .indices property to .offsets.
  2. Introduce a new method to RegExp, such as .execIndices.
    • Indicies would not be available to String.prototype.match or String.prototype.matchAll unless we extend String as well.
    • If we add matchIndices and matchIndicesAll to String, we further complicate RegExp subclassing by either needing
      a @@matchIndices and @@matchIndicesAll or needing to extend the existing @@match and @@matchAll.
  3. Go back to passing a callback or options bag.
    • As with (2), this is a subclassing hazard due to the need to pass the callback or options bag through @@match and @@matchAll.
  4. Continue with current implementations that are executing the RegExp twice.
    • This would be unfortunate for users who want to leverage indices, as they are forced to pay the cost of double execution. However, this is still far cheaper than the complex logic needed to calculate capture group offsets without this feature.
  5. Wait to see if implementation can devise an alternative solution that does not incur the performance penalty.
    • I will leave it to implementors to discuss whether they believe there are other avenues to explore.
  6. Something we haven't discussed?
@rbuckton
Copy link
Collaborator Author

rbuckton commented Nov 19, 2020

Given that RegExp match indices is at Stage 3, (1) is the simplest and cheapest solution that doesn't require an overhaul of the entire specification. The flag would be easily accessible to RegExpBuiltinExec and only requires several small changes:

  • Adding the flag to the grammar
  • Add a property for it to RegExp
  • Handle the flag in the RegExp constructor
  • Read the flag in RegExpBuiltinExec

To me (2) and (3) seem like a non-starters due to the complexity of the change as well as the related subclassing hazards. (4) also seems acceptable to me, if unfortunate, as I would gladly pay the penalty for re-execution of the RegExp to leverage a capability that doesn't currently exist.

I'm option to other options (6) if there are any that make sense given the current stage of the proposal.

If we choose to go with (1), we still need to bikeshed the flag. Some I've discussed this with are unhappy with the idea of an upper-case RegExp flag, so I may not be the top choice. I indicates that o would likely require changing the property name in the spec from indices to offsets. While that's not necessarily the case, it would be confusing to introduce a flag like o but have it create an indices property (unless o means that .exec returns an array consisting of [start, end] tuples instead of strings, which would again be a major change to the proposal).

If I were to use ranked-choice voting, my top 3 would be:

  1. Option (1) with the d flag.
  2. Option (2) with the I flag.
  3. Option (4), with the chance that option (5) becomes viable and implementations can switch to a more optimized solution in the future.

@ljharb
Copy link
Member

ljharb commented Nov 19, 2020

Using I will be very confusing and hard to google; i would vastly prefer using a distinct lowercase letter if a flag is the chosen path.

@gibson042
Copy link

"d" and "o" flags both already exist for Perl regular expressions, although even their own documentation denigrates them severely: https://perldoc.perl.org/perlre#Modifiers . None of the flag options seem great to me, though. I would personally be fine with the stay-the-course (4), and unhappy with the symmetry-breaking (2) and (3).

@msaboff
Copy link
Collaborator

msaboff commented Nov 19, 2020

"d" and "o" flags both already exist for Perl regular expressions, although even their own documentation denigrates them severely: https://perldoc.perl.org/perlre#Modifiers . None of the flag options seem great to me, though. I would personally be fine with the stay-the-course (4), and unhappy with the symmetry-breaking (2) and (3).

Given that Perl perl documentation denigrates both the /d and /o flags, that eliminates any barriers for using them as JavaScript RegExp flags to me. We certainly wouldn't add to JavaScript either /d, the the old, problematic, pre-5.14 Default character set behavior or /o, the compile once flag, as that is an implementation detail.

Option (2) has ergonomic issues for some fonts and (3) would eliminate the use in RegExp literals.

The (4) stay-the-course will likely hinder the uptake of this feature.

@rbuckton
Copy link
Collaborator Author

rbuckton commented Nov 30, 2020

@ljharb: i is already hard to google without using + or "", as Google by default treats it as a noise word and drops it from the search. However, I'd venture to guess that the cross section of individuals who (a) search for regex i flag and (b) know the ins and outs of search engine syntax is probably fairly large so I'll concur with your point.

@msaboff: A native implementation of (4) is likely still faster than a polyfill version of (4) like in https://github.com/rbuckton/regexp-match-indices, and better than the alternative of not having the feature when you need it. Option (4) at least provides a path for the possibility of a future where (5) might be possible without deprecating a flag.

Overall though, I'm leaning towards (1) using the d flag as that requires the least changes to the specification.

@msaboff
Copy link
Collaborator

msaboff commented Nov 30, 2020

Overall though, I'm leaning towards d as that requires the least changes to the specification.

I assume you mean Option (1) using d as the flag based on your prior ranking of alternatives.

@rbuckton
Copy link
Collaborator Author

@msaboff: Yes, I am talking about (1) using the d flag. I'll update my comment.

@rbuckton
Copy link
Collaborator Author

I've created #49 as a PR against this proposal to get a better idea of what kind of changes we would need to make in the specification to use Option (1) with a d flag, should we choose that mitigation.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants