Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweaks to Unihan property handling #1022

Merged
merged 15 commits into from
Jan 30, 2025

Conversation

eggrobin
Copy link
Member

@eggrobin eggrobin commented Jan 29, 2025

  • Add kZhuang (added in 16.0), which we had missed.
  • Mark multivalued Unihan properties from 13.0 and later as multivalued, and kPrimaryNumeric as ordered. Not touching the other older Unihan properties for now.
  • Add everything after 13.0 to IndexUnicodeProperties.txt (for most properties, that is required to parse them at all, but we handle Unihan in a magic way, so they worked without that).
  • Remove files that looked like they configured the multivaluedness of properties, but did nothing.

markusicu
markusicu previously approved these changes Jan 29, 2025
Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Should we also list the multi-valued

More generally, UAX38 tells us via the Delimiter attribute (N/A vs. space) whether a Unihan property is multi-valued.

@macchiati
Copy link
Member

We have API access to isMultivalued() already.

@eggrobin
Copy link
Member Author

Should we also list the multi-valued

https://www.unicode.org/reports/tr38/#kPrimaryNumeric
https://www.unicode.org/reports/tr38/#kOtherNumeric

Yes and no, respectively (there are no actually multivalued kOtherNumeric assignments). Eventually we should indeed follow UAX38 and mark as multivalued anything that could be, but for now we don’t want to make things multivalued unless they really are, see the PR description.

We have API access to isMultivalued() already.

Yes, but this just surfaces what we are configuring here.

@eggrobin
Copy link
Member Author

Wait, it looks like Multivalued.txt does not actually do anything, and instead I should be poking at IndexPropertyRegex or IndexPropertyRegexRevised…

@eggrobin
Copy link
Member Author

Probably IndexPropertyRegex.txt, the other one is not used either.

@eggrobin eggrobin marked this pull request as draft January 29, 2025 23:15
markusicu
markusicu previously approved these changes Jan 29, 2025
@markusicu
Copy link
Member

still failing CI checks...

markusicu
markusicu previously approved these changes Jan 29, 2025
@eggrobin
Copy link
Member Author

Still failing, and it somehow likes the other ones then fails on kStrange… very kStrange indeed…

@eggrobin
Copy link
Member Author

Ah, it is not in IndexUnicodeProperties…

markusicu
markusicu previously approved these changes Jan 30, 2025
@eggrobin
Copy link
Member Author

Fun! Now we find out that UnicodeProperty simply does not work with multivalued numeric properties 🙃

@eggrobin
Copy link
Member Author

I think I fixed the UnicodeProperty implementation, but now the populateHanExceptions in UCD.java is blowing up in my face so I am trying to figure out what it is actually trying to do.

So far this has led me to L2/03-094, and for all Unihan versions that we currently carry in the tools, the offending numeric values are not even there. We should eventually add older versions of Unihan, precisely because that would answer that sort of question of « what is this trying to filter out, and is it still around », but ancient versions should never go through UCD.java. So I think the whole thing is very moot.

@eggrobin
Copy link
Member Author

(Well, populateHanExceptions as a whole is just the derivation of Numeric_Value from the Han properties. But the U+5793 and U+4EAC exceptions are about what is described in this 2003 document.)

macchiati
macchiati previously approved these changes Jan 30, 2025
kIRG_UKSource ; SINGLE_VALUED ; V[0-4]-[0-9A-F]{4}
kIRG_SSource ; SINGLE_VALUED ; V[0-4]-[0-9A-F]{4}


# Unihan properties from 13.0 and later. No regexes for now.
# TODO(egg): We should automate the updating of the regexes from UAX #38.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally the fields from the table would be in a machine-readable format, and the table generated from them, and our usage also.

I initially generated by dumping the table into a spreadsheet, then using formulæ to transform a bit, eg:

Property kJis1
Status Provisional
Category Other Mappings
Introduced 2
Delimiter space
Syntax [0-9]{4}
Description The JIS X 0212-1990 mapping for this ideograph in row-cell form.

=>

kJapaneseKun Status Provisional
kJapaneseKun Category Readings
kJapaneseKun Introduced 2
kJapaneseKun Delimiter space
kJapaneseKun Syntax [A-Z]+
kJapaneseKun Description The Japanese pronunciation(s) of this ideograph in the Hepburn romanization.

Then extract the delimiter and syntax for each property; but then also check the text for the ones with delimiters to see whether they were ordered or not.

However, I didn't keep up to date (obviously), so it needs a better process.

@eggrobin eggrobin dismissed stale reviews from macchiati and markusicu via 3c1ec5e January 30, 2025 01:18
@eggrobin eggrobin marked this pull request as ready for review January 30, 2025 02:47
@eggrobin eggrobin requested a review from markusicu January 30, 2025 02:47
@eggrobin eggrobin merged commit 84bf4cb into unicode-org:main Jan 30, 2025
19 of 20 checks passed
if (stringToNamedEnum != null) {
result.addAll(enumValueNames);
return result;
}
if (isMultivalued()) {
HashSet<String> valueSet = new HashSet<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A list hash set preserves the original order

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants