-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apparent bug in word splitting with Hangul character #90
Comments
This crate implements the Default Word Boundary Specification from UAX29, which states:
and
It appears that the |
Huh- for some reason I had understood that CLDR was primarily for locale-sensitive text handling, and the demo site included no such prompt to select language specific ruless |
I thought so too, and I'm not 100% sure what algorithm or settings that demo site is using. However, it appears that the |
Ah, that's what I was looking for; I was having trouble finding out conclusively if 를 is an ideograph (since ALetter explicitly rejects ideograph characters). That's otherwise consistent with my reading of the specification; thanks for your help! |
Our implementation is correct here, what's happening is that UAX 29 allows implementations to diverge from the algorithm in ways that affect degenerate cases, like this one. Hangul is not an ideographic or logographic writing system. |
Do we have something to support line breaks in UAX 14 https://www.unicode.org/reports/tr14/? withoutboats/heck#28 (comment) Where |
Servo uses xi-unicode for UAX 14 line breaking. There is also a unicode-linebreak crate. I don't think this is related to the current issue. Please open a new issue if there are more questions about this. |
@Lucretiel What you said seemed correct.
|
Consider this string:
" abc를 "
According to Unicode's demo implementation of word segmentation, I'd expect this to be split into 4 words:
" "
,"abc"
,"를"
, and" "
. However, the observed behavior (playground) is that it only splits into 3 words; the"abc를"
is grouped together.The text was updated successfully, but these errors were encountered: