Issue 639 Check diffs between TUP and IUP #640

macchiati · 2024-01-08T04:28:29Z

Add a test of the differences as per #639

macchiati · 2024-01-08T04:42:39Z

Raw data for #639 is at https://docs.google.com/spreadsheets/d/1mFpC03q_dBNXuVRA_qJ_tHSjmVQY8dIDC4PYRkZW1KA

Property	Analysis (so far)
Bidi_Mirroring_Glyph	IUP is returning null if there is none, TUP returning the code point
Bidi_Paired_Bracket	TUP appears wrong, returning \x{0} character.
FC_NFKC_Closure	IUP is returning the default (equal code point)
Joining_Group	Investigate single difference: IUP Hamza_On_Heh_Goal ≠ TUP Teh_Marbuta_Goal
Joining_Type	2181 differences: IUP Non_Joining ≠ TUP Transparent
Name	IUP is the raw name, TUP is the massaged name, eg `<control-0000>`
Name_Alias	Apparently TUP doesn't support: IUP NUL; NULL ≠ TUP {EMPTY}
Numeric_Value	Investigate further
Script_Extensions	IUP is "resolved" (layering on top of Script), TUP is raw
ISO_Comment	IUP {NULL} ≠ TUP {EMPTY}
Jamo_Short_Name	just for [ᄋ], IUP {NULL} ≠ TUP {EMPTY}
Unicode_1_Name	IUP {NULL} ≠ TUP {EMPTY}
Numeric_Value	IUP {NULL} ≠ TUP {EMPTY}
*TUP properties missing from IUP, but not important for tests/file generation:	ASCII Case_Fold_Turkish_I IdnOutput Non_Break cjkAccountingNumeric cjkCompatibilityVariant cjkIICore cjkIRG_GSource cjkIRG_HSource cjkIRG_JSource cjkIRG_KPSource cjkIRG_KSource cjkIRG_MSource cjkIRG_SSource cjkIRG_TSource cjkIRG_UKSource cjkIRG_USource cjkIRG_VSource cjkOtherNumeric cjkPrimaryNumeric cjkRSUnicode isNFC isNFD isNFKC isNFKD toNFC toNFD toNFKC toNFKD

macchiati · 2024-01-15T00:58:01Z

worked out the kinks, and got normalization to use the newer parsed data. Would like to merge this before continuing on.

eggrobin · 2024-01-15T12:28:01Z

unicodetools/src/main/java/org/unicode/props/ShimUnicodePropertyFactory.java

+import org.unicode.text.UCD.Normalizer.NormalizationFormat;
+import org.unicode.text.utility.Utility;
+
+public class ShimUnicodePropertyFactory extends UnicodeProperty.Factory {


This could use a comment explaining what it is for (as far as I can tell it makes an IUP behave like TUP for testing/diffing purposes, but we shouldn’t actually be used beyond that, right?).

eggrobin · 2024-01-15T12:28:22Z

unicodetools/src/main/java/org/unicode/text/UCD/Normalizer.java

@@ -32,18 +35,53 @@ public final class Normalizer implements Transform<String, String>, UCD_Types {
    public static final String copyright =
            "Copyright (C) 2000, IBM Corp. and others. All Rights Reserved.";

+    public enum NormalizationFormat {


Why not just NormalizationForm?

eggrobin · 2024-01-16T18:23:16Z

IUP is the raw name, TUP is the massaged name, eg <control-0000>

From my experiments with #502, it also looks like IUP has # rather than the code point for CJKV ideographs etc.

macchiati · 2024-01-16T18:43:16Z

IUP is the raw name, TUP is the massaged name, eg <control-0000>

From my experiments with #502, it also looks like IUP has # rather than the code point for CJKV ideographs etc.

Yes, it uses the same convention for names that the XML does for names. There is a method that resolves that (IndexUnicodeProperties.getName(), also a method that gets a sequence of names for a string.

We could rethink the property to have the fully resolved name if we want. The only reason for the # is just to avoid the storage cost, but that may not be an issue now.

Issue 639 Check diffs between TUP and IUP

fa900d8

macchiati marked this pull request as draft January 8, 2024 05:06

macchiati added 3 commits January 9, 2024 21:32

Start adding shimmed properties

8da539a

Remaining properties are not standard UCD properties.

47b305e

Working on toNFC, etc. In progress, but saving current state

437bdf7

macchiati marked this pull request as ready for review January 15, 2024 00:57

macchiati requested review from eggrobin and markusicu January 15, 2024 00:58

Update with changes handling normalization

11923af

eggrobin previously approved these changes Jan 15, 2024

View reviewed changes

Changes as per Robin's review

5fa3ea3

macchiati dismissed eggrobin’s stale review via 5fa3ea3 January 16, 2024 04:06

macchiati requested a review from eggrobin January 16, 2024 04:07

eggrobin approved these changes Jan 16, 2024

View reviewed changes

macchiati merged commit 354fb80 into unicode-org:main Jan 16, 2024

macchiati deleted the Issue-639-Check-diffs-between-TUP-and-IUP branch January 16, 2024 18:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 639 Check diffs between TUP and IUP #640

Issue 639 Check diffs between TUP and IUP #640

macchiati commented Jan 8, 2024

macchiati commented Jan 8, 2024 •

edited by eggrobin

Loading

macchiati commented Jan 15, 2024

eggrobin Jan 15, 2024

macchiati Jan 16, 2024

eggrobin Jan 15, 2024

eggrobin commented Jan 16, 2024

macchiati commented Jan 16, 2024 •

edited

Loading

Issue 639 Check diffs between TUP and IUP #640

Issue 639 Check diffs between TUP and IUP #640

Conversation

macchiati commented Jan 8, 2024

macchiati commented Jan 8, 2024 • edited by eggrobin Loading

macchiati commented Jan 15, 2024

eggrobin Jan 15, 2024

Choose a reason for hiding this comment

macchiati Jan 16, 2024

Choose a reason for hiding this comment

eggrobin Jan 15, 2024

Choose a reason for hiding this comment

eggrobin commented Jan 16, 2024

macchiati commented Jan 16, 2024 • edited Loading

macchiati commented Jan 8, 2024 •

edited by eggrobin

Loading

macchiati commented Jan 16, 2024 •

edited

Loading