Skip to content
This repository has been archived by the owner on Feb 16, 2024. It is now read-only.

Michael Saboff's Review #55

Closed
msaboff opened this issue Mar 28, 2022 · 7 comments
Closed

Michael Saboff's Review #55

msaboff opened this issue Mar 28, 2022 · 7 comments

Comments

@msaboff
Copy link

msaboff commented Mar 28, 2022

Over all, it looks good.

For the Syntax Rules production ClassReservedDouble, is the long list of reserved doubled syntax characters a somewhat paranoid reservation for possible extensions without needing to add a new flag?

A possible nit question. For 22.2.2.7 Runtime Semantics: CompileAtom, Step 6 of the production Atom :: CharacterClass,
It seems to me that the sorting of Strings by descending order of string length might undermine the intent of a developer.
Consider a CharacterClass that contains a long list of strings, the developer may have ordered equal length strings within that character class by the expected match likelihood. If the sort is not stable, the sorting by length may circumvent that intended order, possibly impacting match performance.

@mathiasbynens
Copy link
Member

mathiasbynens commented Mar 28, 2022

Thanks for your review, Michael!

For the Syntax Rules production ClassReservedDouble, is the long list of reserved doubled syntax characters a somewhat paranoid reservation for possible extensions without needing to add a new flag?

Exactly. Some additional background on this was presented at the TC39 May 2021 meeting (and the preceding April 2021 Incubator Call):

We are proposing to reserve additional single and double ASCII punctuation for clarity and for possible future extensions. By mostly reserving double punctuation, most single ASCII characters can continue to be used without escaping.

[…] Some regex engines (and UTS #18) support an operator like ‘~~’ for symmetric difference.


A possible nit question. For 22.2.2.7 Runtime Semantics: CompileAtom, Step 6 of the production Atom :: CharacterClass, It seems to me that the sorting of Strings by descending order of string length might undermine the intent of a developer. Consider a CharacterClass that contains a long list of strings, the developer may have ordered equal length strings within that character class by the expected match likelihood. If the sort is not stable, the sorting by length may circumvent that intended order, possibly impacting match performance.

I see. https://github.com/tc39/proposal-regexp-set-notation#whats-the-match-order-for-character-classes-containing-strings addresses this for properties of strings specifically (where the strings don’t have an inherent order), but as you pointed out we could choose to preserve the order of equal-length strings in string literals (e.g. \q{foo|bar|baz}), although we don’t currently do so. @markusicu Any thoughts?

@mathiasbynens mathiasbynens mentioned this issue Mar 28, 2022
9 tasks
@msaboff
Copy link
Author

msaboff commented Mar 28, 2022

I see. https://github.com/tc39/proposal-regexp-set-notation#whats-the-match-order-for-character-classes-containing-strings addresses this for properties of strings specifically (where the strings don’t have an inherent order), but as you pointed out we could choose to preserve the order of equal-length strings in string literals (e.g. \q{foo|bar|baz}), although we don’t currently do so.

The Unicode property of strings case is clear that order os same length strings is not important. It is the case that you point out, \q{foo|bar|baz}. My concern is due to the discussions of Array.sort stability that you and others were involved in. It would be sad if a similar stability concern would be raised after this proposal is approved and implemented.

@markusicu
Copy link
Collaborator

I see. https://github.com/tc39/proposal-regexp-set-notation#whats-the-match-order-for-character-classes-containing-strings addresses this for properties of strings specifically (where the strings don’t have an inherent order), but as you pointed out we could choose to preserve the order of equal-length strings in string literals (e.g. \q{foo|bar|baz}), although we don’t currently do so. @markusicu Any thoughts?

The order of same-length strings should not matter. However, I expect that implementations will implement character classes with set data structures (extending from only code points to also allowing strings), which means that they won't preserve parsing order (just like they don't for code points). Therefore I would be reluctant to suggest that the matching order for a given string length is the parsing order. Specifying a stable sort in the operation that creates a matcher object might be harmless but would be misleading if the construction of the CharSet didn't preserve the parsing order.

@mathiasbynens
Copy link
Member

We discussed this during the 2023-03-29 TC39 meeting and agreed not to make any spec changes. @waldemarhorwat pointed out that today’s character classes (supporting only strings of size 1) don’t have an inherent order either (e.g. [xyz] vs. [zyx]). The same should go for strings of any size, so that character classes remain true mathematical sets. I will keep this issue open until I add an explicit FAQ entry (tomorrow).

@markusicu
Copy link
Collaborator

Right. I also suggested that implementers should be free to use sets (implementations of mathematical sets), and that for runtime optimizations they might use tries (retrieval trees).

@mathiasbynens
Copy link
Member

I tried to summarize the outcome in #58.

@mathiasbynens
Copy link
Member

Closing now that #58 is merged. Thanks, everyone!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants