-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
avoid inconsistency between \d and [:digit:] when using /a #223
Conversation
Thanks for doing a lot of work on this. My feeling is that I don't really like making \s, \w, and \d different in this regard, and therefore I prefer to keep all three PCRE2_EXTRA_ASCII_BS{S,W,D} options as they are. There could be an additional option PCRE2_EXTRA_ASCII_DIGIT that does what you want - affects both \d and [:digit:] but is it really necessary to single out [:digit;] as a special case of POSIX classes? Or do we need separate PCRE2_EXTRA_ASCII_POSIX_{DIGIT,SPACE,WORD} options? |
A few arguments against adding SPACE and WORD:
|
Since a608946 (Additional PCRE2_EXTRA_ASCII_xxx code, 2023-02-01) PCRE2_EXTRA_ASCII_BSD could be used to restrict \d to ASCII causing the following inconsistent behaviour in UCP mode. PCRE2 version 10.43-DEV 2023-01-15 re> /\d/utf,ucp,ascii_bsd data> ٣ No match data> re> /[[:digit:]]/utf,ucp,ascii_bsd data> ٣ 0: \x{663} It has been suggested[1] that the change to match \p{Nd} when Unicode is enabled for [:digit:] might had been unintentional and a bug, as [:digit:] should be able to be POSIX compatible, so add a new flag PCRE2_EXTRA_ASCII_DIGIT to avoid changing its definition in UCP mode. [1] https://lore.kernel.org/git/CANgJU+U+xXsh9psd0z5Xjr+Se5QgdKkjQ7LUQ-PdUULSN3n4+g@mail.gmail.com/
I'm happy to leave SPACE and WORD until somebody actually wants them. You will see that I've done the merge. There is one further point: I added new internal option settings such as (?aP) for PCRE2_EXTRA_ASCII_POSIX and the other previous ASCII options. Do we need to add in something for DIGIT? I've already used (?aD) for BSD and I'd rather not change that because it matches (?aS) for BSS and (?aW) for BSW. I wondered about (?aPD) but that seems a bit clumsy. |
While not the same (?aP) can already be used for that effect:
My thinking was to punt it for now and see if that would be enough, and if eventually required, see what would make more sense then, so either:
|
Agreed. I also thought of (?ad) and disliked it. Let's see if a need arises. |
Address the inconsistency raised by the suggested change in #222 and that would also affect other users of the API.
for added flexibility PCRE2_EXTRA_ASCII_DIGIT is independent of PCRE2_EXTRA_ASCII_BSD.