Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 639 Check diffs between TUP and IUP #640

Merged

Conversation

macchiati
Copy link
Member

Add a test of the differences as per #639

@macchiati
Copy link
Member Author

macchiati commented Jan 8, 2024

Raw data for #639 is at https://docs.google.com/spreadsheets/d/1mFpC03q_dBNXuVRA_qJ_tHSjmVQY8dIDC4PYRkZW1KA

Property Analysis (so far)
Bidi_Mirroring_Glyph IUP is returning null if there is none, TUP returning the code point
Bidi_Paired_Bracket TUP appears wrong, returning \x{0} character.
FC_NFKC_Closure IUP is returning the default (equal code point)
Joining_Group Investigate single difference: IUP Hamza_On_Heh_Goal ≠ TUP Teh_Marbuta_Goal
Joining_Type 2181 differences: IUP Non_Joining ≠ TUP Transparent
Name IUP is the raw name, TUP is the massaged name, eg <control-0000>
Name_Alias Apparently TUP doesn't support: IUP NUL; NULL ≠ TUP {EMPTY}
Numeric_Value Investigate further
Script_Extensions IUP is "resolved" (layering on top of Script), TUP is raw
ISO_Comment IUP {NULL} ≠ TUP {EMPTY}
Jamo_Short_Name just for [ᄋ], IUP {NULL} ≠ TUP {EMPTY}
Unicode_1_Name IUP {NULL} ≠ TUP {EMPTY}
Numeric_Value IUP {NULL} ≠ TUP {EMPTY}
*TUP properties missing from IUP, but not important for tests/file generation: ASCII Case_Fold_Turkish_I IdnOutput Non_Break cjkAccountingNumeric cjkCompatibilityVariant cjkIICore cjkIRG_GSource cjkIRG_HSource cjkIRG_JSource cjkIRG_KPSource cjkIRG_KSource cjkIRG_MSource cjkIRG_SSource cjkIRG_TSource cjkIRG_UKSource cjkIRG_USource cjkIRG_VSource cjkOtherNumeric cjkPrimaryNumeric cjkRSUnicode isNFC isNFD isNFKC isNFKD toNFC toNFD toNFKC toNFKD

@macchiati macchiati marked this pull request as draft January 8, 2024 05:06
@macchiati macchiati marked this pull request as ready for review January 15, 2024 00:57
@macchiati
Copy link
Member Author

worked out the kinks, and got normalization to use the newer parsed data. Would like to merge this before continuing on.

eggrobin
eggrobin previously approved these changes Jan 15, 2024
import org.unicode.text.UCD.Normalizer.NormalizationFormat;
import org.unicode.text.utility.Utility;

public class ShimUnicodePropertyFactory extends UnicodeProperty.Factory {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use a comment explaining what it is for (as far as I can tell it makes an IUP behave like TUP for testing/diffing purposes, but we shouldn’t actually be used beyond that, right?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -32,18 +35,53 @@ public final class Normalizer implements Transform<String, String>, UCD_Types {
public static final String copyright =
"Copyright (C) 2000, IBM Corp. and others. All Rights Reserved.";

public enum NormalizationFormat {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just NormalizationForm?

@eggrobin
Copy link
Member

IUP is the raw name, TUP is the massaged name, eg <control-0000>

From my experiments with #502, it also looks like IUP has # rather than the code point for CJKV ideographs etc.

@macchiati macchiati merged commit 354fb80 into unicode-org:main Jan 16, 2024
@macchiati macchiati deleted the Issue-639-Check-diffs-between-TUP-and-IUP branch January 16, 2024 18:37
@macchiati
Copy link
Member Author

macchiati commented Jan 16, 2024

IUP is the raw name, TUP is the massaged name, eg <control-0000>

From my experiments with #502, it also looks like IUP has # rather than the code point for CJKV ideographs etc.

Yes, it uses the same convention for names that the XML does for names. There is a method that resolves that (IndexUnicodeProperties.getName(), also a method that gets a sequence of names for a string.

We could rethink the property to have the fully resolved name if we want. The only reason for the # is just to avoid the storage cost, but that may not be an issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants