-
-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tweaks to Unihan property handling #1022
Conversation
eggrobin
commented
Jan 29, 2025
•
edited
Loading
edited
- Add kZhuang (added in 16.0), which we had missed.
- Mark multivalued Unihan properties from 13.0 and later as multivalued, and kPrimaryNumeric as ordered. Not touching the other older Unihan properties for now.
- Add everything after 13.0 to IndexUnicodeProperties.txt (for most properties, that is required to parse them at all, but we handle Unihan in a magic way, so they worked without that).
- Remove files that looked like they configured the multivaluedness of properties, but did nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Should we also list the multi-valued
- https://www.unicode.org/reports/tr38/#kPrimaryNumeric
- https://www.unicode.org/reports/tr38/#kOtherNumeric
More generally, UAX38 tells us via the Delimiter attribute (N/A vs. space) whether a Unihan property is multi-valued.
We have API access to isMultivalued() already. |
Yes and no, respectively (there are no actually multivalued kOtherNumeric assignments). Eventually we should indeed follow UAX38 and mark as multivalued anything that could be, but for now we don’t want to make things multivalued unless they really are, see the PR description.
Yes, but this just surfaces what we are configuring here. |
Wait, it looks like Multivalued.txt does not actually do anything, and instead I should be poking at IndexPropertyRegex or IndexPropertyRegexRevised… |
Probably IndexPropertyRegex.txt, the other one is not used either. |
…e values as multivalued" This reverts commit bacfe60.
still failing CI checks... |
Still failing, and it somehow likes the other ones then fails on kStrange… very kStrange indeed… |
Ah, it is not in IndexUnicodeProperties… |
Fun! Now we find out that UnicodeProperty simply does not work with multivalued numeric properties 🙃 |
I think I fixed the UnicodeProperty implementation, but now the populateHanExceptions in UCD.java is blowing up in my face so I am trying to figure out what it is actually trying to do. So far this has led me to L2/03-094, and for all Unihan versions that we currently carry in the tools, the offending numeric values are not even there. We should eventually add older versions of Unihan, precisely because that would answer that sort of question of « what is this trying to filter out, and is it still around », but ancient versions should never go through UCD.java. So I think the whole thing is very moot. |
kIRG_UKSource ; SINGLE_VALUED ; V[0-4]-[0-9A-F]{4} | ||
kIRG_SSource ; SINGLE_VALUED ; V[0-4]-[0-9A-F]{4} | ||
|
||
|
||
# Unihan properties from 13.0 and later. No regexes for now. | ||
# TODO(egg): We should automate the updating of the regexes from UAX #38. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally the fields from the table would be in a machine-readable format, and the table generated from them, and our usage also.
I initially generated by dumping the table into a spreadsheet, then using formulæ to transform a bit, eg:
Property | kJis1 |
---|---|
Status | Provisional |
Category | Other Mappings |
Introduced | 2 |
Delimiter | space |
Syntax | [0-9]{4} |
Description | The JIS X 0212-1990 mapping for this ideograph in row-cell form. |
=>
kJapaneseKun | Status | Provisional |
---|---|---|
kJapaneseKun | Category | Readings |
kJapaneseKun | Introduced | 2 |
kJapaneseKun | Delimiter | space |
kJapaneseKun | Syntax | [A-Z]+ |
kJapaneseKun | Description | The Japanese pronunciation(s) of this ideograph in the Hepburn romanization. |
Then extract the delimiter and syntax for each property; but then also check the text for the ones with delimiters to see whether they were ordered or not.
However, I didn't keep up to date (obviously), so it needs a better process.
if (stringToNamedEnum != null) { | ||
result.addAll(enumValueNames); | ||
return result; | ||
} | ||
if (isMultivalued()) { | ||
HashSet<String> valueSet = new HashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A list hash set preserves the original order