-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NFC Normalization of Å #8
Comments
The problem, unfortunately, isn't isolated to Å. I've now run a unit test against the test cases in the Unicode character database. The results are not good...
The crash on decomposition is an endless loop in the normalization loop. For example try decomposing Ḕ ̄ (Latin Capital Letter E with Macron and Grave + Combining Macron, codepoint sequence |
The cause of the Å problem is that PUCU fails to take equivalent codepoints into account when the As far as I can tell, the function uses the codepoint->decomposition table to build a decomposition->codepoint mapping by hashing all the decomposition values to their codepoints. The problem here is that it assumes that if A maps to B then B must map to A. As I've demonstrated, this isn't the case for all codepoints. I have verified that removing duplicates from I suspect that the correct method of solving this is to keep the equivalence state along with the decomposition data so it can be used when generating the composition table. |
I found another unicode library, and removed everything not normalization related to make their tables smaller. Now I have a normalization-only library: https://github.com/benibela/internettools/blob/master/data/bbnormalizeunicode.pas You could test if that works better in these cases |
Thanks but I gave up and wrote my own implementation from scratch: It passes all 19,074 compose/decompose tests of the Unicode database. |
Decomposing and normalizing the letter Å (Latin Capital Letter A with Ring Above, codepoint
$00C5
) produces the sequence$0041 $030A
. This is correct.However, composing the sequence
$0041 $030A
produces the codepoint$212B
(Angstrom Sign).$00C5
and$212B
are equivalent codepoints but their normal form is$00C5
so the composition is wrong.This is a pretty big problem as Å is a fairly common letter in Scandinavian languages (which is why I discovered this problem).
The text was updated successfully, but these errors were encountered: