Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix zwsp classification #81

Closed
Entomy opened this issue Apr 1, 2020 · 1 comment
Closed

Fix zwsp classification #81

Entomy opened this issue Apr 1, 2020 · 1 comment
Labels
🐝 Bug Something isn't working 🛠 Enhancement New feature or request 👨🏻‍🎓 Good First Issue Good for newcomers 🆘 Help Wanted Extra attention is needed

Comments

@Entomy
Copy link
Owner

Entomy commented Apr 1, 2020

As proposed in L2/08-344 U+200B, or the zwsp, was changed in classification in a way that broke several of the worlds languages in a rather serious way. This is definitely a mistake, and SIL agrees. Microsoft doesn't seem to agree however, and I won't have anything I write and maintain be exclusionary to any group, especially on a strictly ethnic basis.

IsWhitespace(Char) and IsWhitespace(Rune) are effected by this in this repo.

@Entomy Entomy added 🐝 Bug Something isn't working 🛠 Enhancement New feature or request 🆘 Help Wanted Extra attention is needed 👨🏻‍🎓 Good First Issue Good for newcomers labels Apr 1, 2020
@Entomy
Copy link
Owner Author

Entomy commented Apr 3, 2020

So I've looked into this more, along with various tests, using some random Thai and Mayanmar samples I've found, and asking some native speakers about correctness (god bless the internet).

I don't actually agree with any of the above parties, actually.

UNICODE Consortium is correct in that the classification as Zs for spacing causes combining marks to potentially behave incorrectly. In practice this shouldn't happen often, if at all, but as plenty of people who review my code or comments can tell, I refuse to operate on the assumption of "shouldn't ever happen". Such a thing would be a valid sequence of UNICODE and must be handled.

However, L2/08-334 is correct in that UAX#29.4 became broken in the change.

What I'm going to propose is that zwsp remains classified as Cf and not Zs, but that wound boundary detection (https://github.com/Stringier/Literary/issues/13) accurately detect and handle zwsp as a word boundary for Thai, Mayanman, Khmer, and others.

@Entomy Entomy closed this as completed Apr 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐝 Bug Something isn't working 🛠 Enhancement New feature or request 👨🏻‍🎓 Good First Issue Good for newcomers 🆘 Help Wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant