NFC Normalization of Å #8

andersmelander · 2023-05-20T00:13:03Z

Decomposing and normalizing the letter Å (Latin Capital Letter A with Ring Above, codepoint $00C5) produces the sequence $0041 $030A. This is correct.
However, composing the sequence $0041 $030A produces the codepoint $212B (Angstrom Sign).

$00C5 and $212B are equivalent codepoints but their normal form is $00C5 so the composition is wrong.

For example, the distinct Unicode strings "U+212B" (the angstrom sign "Å") and "U+00C5" (the Swedish letter "Å") are both expanded by NFD (or NFKD) into the sequence "U+0041 U+030A" (Latin letter "A" and combining ring above "°") which is then reduced by NFC (or NFKC) to "U+00C5" (the Swedish letter "Å").

This is a pretty big problem as Å is a fairly common letter in Scandinavian languages (which is why I discovered this problem).

The text was updated successfully, but these errors were encountered:

andersmelander · 2023-05-21T02:16:11Z

The problem, unfortunately, isn't isolated to Å.

I've now run a unit test against the test cases in the Unicode character database. The results are not good...

Operation	Passed	Failed	Crash
Decomposition & normalization	17,023	205	~1,846
Composition	18,782	292	0

The crash on decomposition is an endless loop in the normalization loop. For example try decomposing Ḕ ̄ (Latin Capital Letter E with Macron and Grave + Combining Macron, codepoint sequence $1E14 $0304). The result should be $0045 $0304 $0300 $0304.

andersmelander · 2023-05-29T01:37:26Z

The cause of the Å problem is that PUCU fails to take equivalent codepoints into account when the PUCUUnicodeCharacterCompositionMap table is built in PUCUConvertUnicode.ResolveCompositions.

As far as I can tell, the function uses the codepoint->decomposition table to build a decomposition->codepoint mapping by hashing all the decomposition values to their codepoints. The problem here is that it assumes that if A maps to B then B must map to A. As I've demonstrated, this isn't the case for all codepoints.

I have verified that removing duplicates from PUCUUnicodeCharacterCompositionMap (and keeping the entry with the lowest codepoint), solves the problem for Å (and 31 other test cases), but there are still 260 other cases that fail the composition test.

I suspect that the correct method of solving this is to keep the equivalence state along with the decomposition data so it can be used when generating the composition table.

benibela · 2023-08-29T15:14:56Z

I found another unicode library, and removed everything not normalization related to make their tables smaller.

Now I have a normalization-only library: https://github.com/benibela/internettools/blob/master/data/bbnormalizeunicode.pas

You could test if that works better in these cases

andersmelander · 2023-08-29T16:57:58Z

Thanks but I gave up and wrote my own implementation from scratch:
https://gitlab.com/anders.bo.melander/pascaltype2/-/blob/master/Source/PascalType.Unicode.pas?ref_type=heads#L820

It passes all 19,074 compose/decompose tests of the Unicode database.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NFC Normalization of Å #8

NFC Normalization of Å #8

andersmelander commented May 20, 2023

andersmelander commented May 21, 2023

andersmelander commented May 29, 2023

benibela commented Aug 29, 2023

andersmelander commented Aug 29, 2023

NFC Normalization of Å #8

NFC Normalization of Å #8

Comments

andersmelander commented May 20, 2023

andersmelander commented May 21, 2023

andersmelander commented May 29, 2023

benibela commented Aug 29, 2023

andersmelander commented Aug 29, 2023