Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NFC Normalization of Å #8

Open
andersmelander opened this issue May 20, 2023 · 4 comments
Open

NFC Normalization of Å #8

andersmelander opened this issue May 20, 2023 · 4 comments

Comments

@andersmelander
Copy link

Decomposing and normalizing the letter Å (Latin Capital Letter A with Ring Above, codepoint $00C5) produces the sequence $0041 $030A. This is correct.
However, composing the sequence $0041 $030A produces the codepoint $212B (Angstrom Sign).

$00C5 and $212B are equivalent codepoints but their normal form is $00C5 so the composition is wrong.

For example, the distinct Unicode strings "U+212B" (the angstrom sign "Å") and "U+00C5" (the Swedish letter "Å") are both expanded by NFD (or NFKD) into the sequence "U+0041 U+030A" (Latin letter "A" and combining ring above "°") which is then reduced by NFC (or NFKC) to "U+00C5" (the Swedish letter "Å").

This is a pretty big problem as Å is a fairly common letter in Scandinavian languages (which is why I discovered this problem).

@andersmelander
Copy link
Author

The problem, unfortunately, isn't isolated to Å.

I've now run a unit test against the test cases in the Unicode character database. The results are not good...

Operation Passed Failed Crash
Decomposition & normalization 17,023 205 ~1,846
Composition 18,782 292 0

The crash on decomposition is an endless loop in the normalization loop. For example try decomposing Ḕ ̄ (Latin Capital Letter E with Macron and Grave + Combining Macron, codepoint sequence $1E14 $0304). The result should be $0045 $0304 $0300 $0304.

@andersmelander
Copy link
Author

The cause of the Å problem is that PUCU fails to take equivalent codepoints into account when the PUCUUnicodeCharacterCompositionMap table is built in PUCUConvertUnicode.ResolveCompositions.

As far as I can tell, the function uses the codepoint->decomposition table to build a decomposition->codepoint mapping by hashing all the decomposition values to their codepoints. The problem here is that it assumes that if A maps to B then B must map to A. As I've demonstrated, this isn't the case for all codepoints.

I have verified that removing duplicates from PUCUUnicodeCharacterCompositionMap (and keeping the entry with the lowest codepoint), solves the problem for Å (and 31 other test cases), but there are still 260 other cases that fail the composition test.

I suspect that the correct method of solving this is to keep the equivalence state along with the decomposition data so it can be used when generating the composition table.

@benibela
Copy link

I found another unicode library, and removed everything not normalization related to make their tables smaller.

Now I have a normalization-only library: https://github.com/benibela/internettools/blob/master/data/bbnormalizeunicode.pas

You could test if that works better in these cases

@andersmelander
Copy link
Author

Thanks but I gave up and wrote my own implementation from scratch:
https://gitlab.com/anders.bo.melander/pascaltype2/-/blob/master/Source/PascalType.Unicode.pas?ref_type=heads#L820

It passes all 19,074 compose/decompose tests of the Unicode database.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants