Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid inconsistency between \d and [:digit:] when using /a #223

Merged
merged 1 commit into from
Apr 9, 2023

Conversation

carenas
Copy link
Contributor

@carenas carenas commented Apr 6, 2023

Address the inconsistency raised by the suggested change in #222 and that would also affect other users of the API.

for added flexibility PCRE2_EXTRA_ASCII_DIGIT is independent of PCRE2_EXTRA_ASCII_BSD.

@PhilipHazel
Copy link
Collaborator

Thanks for doing a lot of work on this. My feeling is that I don't really like making \s, \w, and \d different in this regard, and therefore I prefer to keep all three PCRE2_EXTRA_ASCII_BS{S,W,D} options as they are. There could be an additional option PCRE2_EXTRA_ASCII_DIGIT that does what you want - affects both \d and [:digit:] but is it really necessary to single out [:digit;] as a special case of POSIX classes? Or do we need separate PCRE2_EXTRA_ASCII_POSIX_{DIGIT,SPACE,WORD} options?

@carenas
Copy link
Contributor Author

carenas commented Apr 7, 2023

do we need separate PCRE2_EXTRA_ASCII_POSIX_{DIGIT,SPACE,WORD} options?

A few arguments against adding SPACE and WORD:

  • they are already defined differently in UCP mode by using different Unicode classes, so for example \w != [:word:] already.
  • reverting them back to ASCII can already be done with PCRE2_EXTRA_ASCII_POSIX.
  • I don't see a use case that wouldn't be supported already by either not setting PCRE2_UCP or the current flags, if one is found they could be added later.
  • PCRE2_EXTRA_ASCII_DIGIT would need a separate flag anyway if the non POSIX [:digit:] needs a bugfix.

Since a608946 (Additional PCRE2_EXTRA_ASCII_xxx code, 2023-02-01)
PCRE2_EXTRA_ASCII_BSD could be used to restrict \d to ASCII causing
the following inconsistent behaviour in UCP mode.

  PCRE2 version 10.43-DEV 2023-01-15
    re> /\d/utf,ucp,ascii_bsd
  data> ٣
  No match
  data>
  re> /[[:digit:]]/utf,ucp,ascii_bsd
  data> ٣
    0: \x{663}

It has been suggested[1] that the change to match \p{Nd} when Unicode
is enabled for [:digit:] might had been unintentional and a bug, as
[:digit:] should be able to be POSIX compatible, so add a new flag
PCRE2_EXTRA_ASCII_DIGIT to avoid changing its definition in UCP mode.

[1] https://lore.kernel.org/git/CANgJU+U+xXsh9psd0z5Xjr+Se5QgdKkjQ7LUQ-PdUULSN3n4+g@mail.gmail.com/
@carenas carenas marked this pull request as ready for review April 8, 2023 01:49
@PhilipHazel PhilipHazel merged commit 6454934 into PCRE2Project:master Apr 9, 2023
@PhilipHazel
Copy link
Collaborator

I'm happy to leave SPACE and WORD until somebody actually wants them. You will see that I've done the merge. There is one further point: I added new internal option settings such as (?aP) for PCRE2_EXTRA_ASCII_POSIX and the other previous ASCII options. Do we need to add in something for DIGIT? I've already used (?aD) for BSD and I'd rather not change that because it matches (?aS) for BSS and (?aW) for BSW. I wondered about (?aPD) but that seems a bit clumsy.

@carenas
Copy link
Contributor Author

carenas commented Apr 9, 2023

Do we need to add in something for DIGIT? I've already used (?aD) for BSD and I'd rather not change that because it matches (?aS) for BSS and (?aW) for BSW. I wondered about (?aPD) but that seems a bit clumsy.

While not the same (?aP) can already be used for that effect:

PCRE2 version 10.43-DEV 2023-01-15
  re> /(?aP)[[:digit:]](?-aP)[[:digit:]]/B,ucp
------------------------------------------------------------------
        Bra
        [0-9]
        [\p{Nd}]
        Ket
        End
------------------------------------------------------------------

My thinking was to punt it for now and see if that would be enough, and if eventually required, see what would make more sense then, so either:

  • reuse (?aD)
  • add something like (?ad) which I agree seems as clumsy

@PhilipHazel
Copy link
Collaborator

Agreed. I also thought of (?ad) and disliked it. Let's see if a need arises.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants