Tweaks to Unihan property handling #1022

eggrobin · 2025-01-29T22:34:17Z

Add kZhuang (added in 16.0), which we had missed.
Mark multivalued Unihan properties from 13.0 and later as multivalued, and kPrimaryNumeric as ordered. Not touching the other older Unihan properties for now.
Add everything after 13.0 to IndexUnicodeProperties.txt (for most properties, that is required to parse them at all, but we handle Unihan in a magic way, so they worked without that).
Remove files that looked like they configured the multivaluedness of properties, but did nothing.

… as multivalued

markusicu

lgtm

Should we also list the multi-valued

More generally, UAX38 tells us via the Delimiter attribute (N/A vs. space) whether a Unihan property is multi-valued.

macchiati · 2025-01-29T23:04:23Z

We have API access to isMultivalued() already.

eggrobin · 2025-01-29T23:08:33Z

Should we also list the multi-valued

https://www.unicode.org/reports/tr38/#kPrimaryNumeric
https://www.unicode.org/reports/tr38/#kOtherNumeric

Yes and no, respectively (there are no actually multivalued kOtherNumeric assignments). Eventually we should indeed follow UAX38 and mark as multivalued anything that could be, but for now we don’t want to make things multivalued unless they really are, see the PR description.

We have API access to isMultivalued() already.

Yes, but this just surfaces what we are configuring here.

eggrobin · 2025-01-29T23:14:29Z

Wait, it looks like Multivalued.txt does not actually do anything, and instead I should be poking at IndexPropertyRegex or IndexPropertyRegexRevised…

eggrobin · 2025-01-29T23:15:17Z

Probably IndexPropertyRegex.txt, the other one is not used either.

…e values as multivalued" This reverts commit bacfe60.

markusicu · 2025-01-29T23:42:02Z

still failing CI checks...

eggrobin · 2025-01-29T23:55:18Z

Still failing, and it somehow likes the other ones then fails on kStrange… very kStrange indeed…

eggrobin · 2025-01-30T00:03:42Z

Ah, it is not in IndexUnicodeProperties…

eggrobin · 2025-01-30T00:15:44Z

Fun! Now we find out that UnicodeProperty simply does not work with multivalued numeric properties 🙃

eggrobin · 2025-01-30T00:53:02Z

I think I fixed the UnicodeProperty implementation, but now the populateHanExceptions in UCD.java is blowing up in my face so I am trying to figure out what it is actually trying to do.

So far this has led me to L2/03-094, and for all Unihan versions that we currently carry in the tools, the offending numeric values are not even there. We should eventually add older versions of Unihan, precisely because that would answer that sort of question of « what is this trying to filter out, and is it still around », but ancient versions should never go through UCD.java. So I think the whole thing is very moot.

eggrobin · 2025-01-30T00:54:58Z

(Well, populateHanExceptions as a whole is just the derivation of Numeric_Value from the Han properties. But the U+5793 and U+4EAC exceptions are about what is described in this 2003 document.)

macchiati · 2025-01-30T00:58:10Z

unicodetools/src/main/resources/org/unicode/props/IndexPropertyRegex.txt

 kIRG_UKSource ;                SINGLE_VALUED ;                V[0-4]-[0-9A-F]{4}
 kIRG_SSource ;                SINGLE_VALUED ;                V[0-4]-[0-9A-F]{4}

-
+# Unihan properties from 13.0 and later.  No regexes for now.
+# TODO(egg): We should automate the updating of the regexes from UAX #38.


Ideally the fields from the table would be in a machine-readable format, and the table generated from them, and our usage also.

I initially generated by dumping the table into a spreadsheet, then using formulæ to transform a bit, eg:

Property kJis1

Status Provisional

Category Other Mappings

Introduced 2

Delimiter space

Syntax [0-9]{4}

Description The JIS X 0212-1990 mapping for this ideograph in row-cell form.

=>

kJapaneseKun Status Provisional

kJapaneseKun Category Readings

kJapaneseKun Introduced 2

kJapaneseKun Delimiter space

kJapaneseKun Syntax [A-Z]+

kJapaneseKun Description The Japanese pronunciation(s) of this ideograph in the Hepburn romanization.

Then extract the delimiter and syntax for each property; but then also check the text for the ones with delimiters to see whether they were ordered or not.

However, I didn't keep up to date (obviously), so it needs a better process.

macchiati · 2025-01-30T05:11:29Z

unicodetools/src/main/java/org/unicode/props/IndexUnicodeProperties.java

            if (stringToNamedEnum != null) {
                result.addAll(enumValueNames);
                return result;
            }
+            if (isMultivalued()) {
+                HashSet<String> valueSet = new HashSet<>();


A list hash set preserves the original order

eggrobin added 3 commits January 29, 2025 23:25

kZhuang

3e3cb5e

Mark multivalued Unihan properties that actually have multiple values…

bacfe60

… as multivalued

GenerateEnums

55f4112

eggrobin requested review from macchiati and markusicu January 29, 2025 22:34

markusicu previously approved these changes Jan 29, 2025

View reviewed changes

eggrobin marked this pull request as draft January 29, 2025 23:15

eggrobin added 3 commits January 30, 2025 00:16

Revert "Mark multivalued Unihan properties that actually have multipl…

9104df0

…e values as multivalued" This reverts commit bacfe60.

Update the correct file and remove files that do nothing

f6b774f

GenerateEnums

a436775

eggrobin dismissed markusicu’s stale review via a436775 January 29, 2025 23:34

eggrobin requested a review from markusicu January 29, 2025 23:35

eggrobin added 2 commits January 30, 2025 00:35

space

f0afb41

Throw semicolons at the problem

ed88e6c

markusicu previously approved these changes Jan 29, 2025

View reviewed changes

meow

dcb016e

eggrobin dismissed markusicu’s stale review via dcb016e January 29, 2025 23:47

markusicu previously approved these changes Jan 29, 2025

View reviewed changes

Add to index

55a450a

eggrobin dismissed markusicu’s stale review via 55a450a January 30, 2025 00:09

markusicu previously approved these changes Jan 30, 2025

View reviewed changes

macchiati previously approved these changes Jan 30, 2025

View reviewed changes

Somehow the code in UCD.java becomes a little bit cleaner

3c1ec5e

eggrobin dismissed stale reviews from macchiati and markusicu via 3c1ec5e January 30, 2025 01:18

eggrobin added 4 commits January 30, 2025 02:49

These raw maps are awful

38bb70c

Name collision

0f51c98

… just remove the dead code

f539660

spotless

52904b2

eggrobin marked this pull request as ready for review January 30, 2025 02:47

eggrobin requested a review from markusicu January 30, 2025 02:47

markusicu approved these changes Jan 30, 2025

View reviewed changes

eggrobin merged commit 84bf4cb into unicode-org:main Jan 30, 2025
19 of 20 checks passed

macchiati reviewed Jan 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tweaks to Unihan property handling #1022

Tweaks to Unihan property handling #1022

eggrobin commented Jan 29, 2025 •

edited

Loading

markusicu left a comment

macchiati commented Jan 29, 2025

eggrobin commented Jan 29, 2025

eggrobin commented Jan 29, 2025

eggrobin commented Jan 29, 2025

markusicu commented Jan 29, 2025

eggrobin commented Jan 29, 2025

eggrobin commented Jan 30, 2025

eggrobin commented Jan 30, 2025

eggrobin commented Jan 30, 2025

eggrobin commented Jan 30, 2025

macchiati Jan 30, 2025

macchiati Jan 30, 2025

macchiati Jan 30, 2025

Property	kJis1
Status	Provisional
Category	Other Mappings
Introduced	2
Delimiter	space
Syntax	[0-9]{4}
Description	The JIS X 0212-1990 mapping for this ideograph in row-cell form.

kJapaneseKun	Status	Provisional
kJapaneseKun	Category	Readings
kJapaneseKun	Introduced	2
kJapaneseKun	Delimiter	space
kJapaneseKun	Syntax	[A-Z]+
kJapaneseKun	Description	The Japanese pronunciation(s) of this ideograph in the Hepburn romanization.

Tweaks to Unihan property handling #1022

Tweaks to Unihan property handling #1022

Conversation

eggrobin commented Jan 29, 2025 • edited Loading

markusicu left a comment

Choose a reason for hiding this comment

macchiati commented Jan 29, 2025

eggrobin commented Jan 29, 2025

eggrobin commented Jan 29, 2025

eggrobin commented Jan 29, 2025

markusicu commented Jan 29, 2025

eggrobin commented Jan 29, 2025

eggrobin commented Jan 30, 2025

eggrobin commented Jan 30, 2025

eggrobin commented Jan 30, 2025

eggrobin commented Jan 30, 2025

macchiati Jan 30, 2025

Choose a reason for hiding this comment

macchiati Jan 30, 2025

Choose a reason for hiding this comment

macchiati Jan 30, 2025

Choose a reason for hiding this comment

eggrobin commented Jan 29, 2025 •

edited

Loading