Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode: add CategoryAliases, LC, Cn #70780

Open
rsc opened this issue Dec 11, 2024 · 11 comments
Open

unicode: add CategoryAliases, LC, Cn #70780

rsc opened this issue Dec 11, 2024 · 11 comments

Comments

@rsc
Copy link
Contributor

rsc commented Dec 11, 2024

The Unicode specification defines aliases for some of the general category names. For example the category "L" has alias "Letter".

The regexp package supports \p{L} but not \p{Letter}, because there is nothing in the Unicode tables that lets regexp know about Letter.

In order to support \p{Letter}, I propose to add a new, small table to unicode,

var CategoryAliases = map[string]string{
	"Other": "C",
	"Control": "Cc",
	...,
	"Letter": "L",
	...
}

This would be auto-generated from the Unicode database like all our other tables. For Unicode 15, the table would have only 38 entries, listed below.

% grep '^gc' PropertyValueAliases.txt
gc ; C                                ; Other                            # Cc | Cf | Cn | Co | Cs
gc ; Cc                               ; Control                          ; cntrl
gc ; Cf                               ; Format
gc ; Cn                               ; Unassigned
gc ; Co                               ; Private_Use
gc ; Cs                               ; Surrogate
gc ; L                                ; Letter                           # Ll | Lm | Lo | Lt | Lu
gc ; LC                               ; Cased_Letter                     # Ll | Lt | Lu
gc ; Ll                               ; Lowercase_Letter
gc ; Lm                               ; Modifier_Letter
gc ; Lo                               ; Other_Letter
gc ; Lt                               ; Titlecase_Letter
gc ; Lu                               ; Uppercase_Letter
gc ; M                                ; Mark                             ; Combining_Mark                   # Mc | Me | Mn
gc ; Mc                               ; Spacing_Mark
gc ; Me                               ; Enclosing_Mark
gc ; Mn                               ; Nonspacing_Mark
gc ; N                                ; Number                           # Nd | Nl | No
gc ; Nd                               ; Decimal_Number                   ; digit
gc ; Nl                               ; Letter_Number
gc ; No                               ; Other_Number
gc ; P                                ; Punctuation                      ; punct                            # Pc | Pd | Pe | Pf | Pi | Po | Ps
gc ; Pc                               ; Connector_Punctuation
gc ; Pd                               ; Dash_Punctuation
gc ; Pe                               ; Close_Punctuation
gc ; Pf                               ; Final_Punctuation
gc ; Pi                               ; Initial_Punctuation
gc ; Po                               ; Other_Punctuation
gc ; Ps                               ; Open_Punctuation
gc ; S                                ; Symbol                           # Sc | Sk | Sm | So
gc ; Sc                               ; Currency_Symbol
gc ; Sk                               ; Modifier_Symbol
gc ; Sm                               ; Math_Symbol
gc ; So                               ; Other_Symbol
gc ; Z                                ; Separator                        # Zl | Zp | Zs
gc ; Zl                               ; Line_Separator
gc ; Zp                               ; Paragraph_Separator
gc ; Zs                               ; Space_Separator
%
@gabyhelp
Copy link

Related Issues

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

@ianlancetaylor ianlancetaylor moved this to Incoming in Proposals Dec 11, 2024
@rsc
Copy link
Contributor Author

rsc commented Jan 8, 2025

I implemented this, and there are a few additions. The proposal is now:

  • Add CategoryAliases as described above, but note that there are 38+4 = 42 entries, to include the secondary aliases cntrl, Combining_Mark, digit, and punct, shown in the tables above.
  • Add a new LC table (var LC and "LC": LC entry in Categories). This is a synthesized category (cased letter = Lu | Ll | Lt) that was missing before. I noticed because it has an alias but did not exist in the first place.
  • Add a new Cn table (var Cn and "Cn": Cn entry in Categories). This is also a synthesized category with an alias but which did not exist. It is all unassigned code points (no category).

@rsc rsc changed the title proposal: unicode: add CategoryAliases proposal: unicode: add CategoryAliases, LC, Cn Jan 8, 2025
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/641395 mentions this issue: internal/export/unicode: add CategoryAliases, Cn, and LC

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/641376 mentions this issue: unicode: add CategoryAliases, Cn, LC

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/641377 mentions this issue: regexp/syntax: recognize category aliases like \p{Letter}

@rsc
Copy link
Contributor Author

rsc commented Feb 5, 2025

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc rsc moved this from Incoming to Active in Proposals Feb 5, 2025
@willfaught
Copy link
Contributor

willfaught commented Feb 7, 2025

Could there be any compatibility issues with new Unicode versions? Dropped or renamed or changed aliases?

Will regexp then use the map?

Edit: The changes to regexp are at #70781.

@rsc
Copy link
Contributor Author

rsc commented Feb 12, 2025

In general, Unicode data is subject to change as Unicode changes. That said, I don't expect aliases to be deleted from the list. (We've seen them change the category of an individual code point in the past, but even that is rare.)

@rsc
Copy link
Contributor Author

rsc commented Feb 13, 2025

Have all remaining concerns about this proposal been addressed?

The proposal is to add:

  • var CategoryAliases = map[string]string{...}
  • var LC = _LC // a *RangeTable
  • var Cn = _Cn // a *RangeTable
  • The C table is expanded to include unassigned code points (as it should have had from the start).

@aclements aclements moved this from Active to Likely Accept in Proposals Feb 19, 2025
@aclements
Copy link
Member

Based on the discussion above, this proposal seems like a likely accept.
— aclements for the proposal review group

The proposal is to add:

// CategoryAliases maps category aliases to standard category names.
var CategoryAliases = map[string]string{...}

var LC = _LC // Cased letter (Lu, Ll, and Lt); a *RangeTable
var Cn = _Cn // Other, not assigned; a *RangeTable

The C table is expanded to include unassigned code points (as it should have had from the start).

@aclements
Copy link
Member

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— aclements for the proposal review group

The proposal is to add:

// CategoryAliases maps category aliases to standard category names.
var CategoryAliases = map[string]string{...}

var LC = _LC // Cased letter (Lu, Ll, and Lt); a *RangeTable
var Cn = _Cn // Other, not assigned; a *RangeTable

The C table is expanded to include unassigned code points (as it should have had from the start).

@aclements aclements moved this from Likely Accept to Accepted in Proposals Feb 26, 2025
@aclements aclements changed the title proposal: unicode: add CategoryAliases, LC, Cn unicode: add CategoryAliases, LC, Cn Feb 26, 2025
@aclements aclements modified the milestones: Proposal, Backlog Feb 26, 2025
gopherbot pushed a commit to golang/text that referenced this issue Feb 27, 2025
CategoryAliases is for regexp to use, for things like \p{Letter} as an alias for \p{L}.
Cn and LC are special-case categories that were never implemented
but should have been.

For golang/go#70780.

Change-Id: I1401c1be42106a0ebecabb085c25e97485c363cf
Reviewed-on: https://go-review.googlesource.com/c/text/+/641395
Auto-Submit: Russ Cox <[email protected]>
Reviewed-by: Marcel van Lohuizen <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
Reviewed-by: Ian Lance Taylor <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Accepted
Development

No branches or pull requests

5 participants