Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apparent bug in word splitting with Hangul character #90

Closed
Lucretiel opened this issue Feb 6, 2021 · 8 comments
Closed

Apparent bug in word splitting with Hangul character #90

Lucretiel opened this issue Feb 6, 2021 · 8 comments

Comments

@Lucretiel
Copy link

Consider this string:

" abc를 "

According to Unicode's demo implementation of word segmentation, I'd expect this to be split into 4 words: " ", "abc", "를", and " ". However, the observed behavior (playground) is that it only splits into 3 words; the "abc를" is grouped together.

@mbrubeck
Copy link
Contributor

mbrubeck commented Feb 6, 2021

This crate implements the Default Word Boundary Specification from UAX29, which states:

The following is a general specification for word boundaries—language-specific rules in [CLDR] should be used where available.

and

For Thai, Lao, Khmer, Myanmar, and other scripts that do not typically use spaces between words, a good implementation should not depend on the default word boundary specification. It should use a more sophisticated mechanism, as is also required for line breaking. Ideographic scripts such as Japanese and Chinese are even more complex. Where Hangul text is written without spaces, the same applies. However, in the absence of a more sophisticated mechanism, the rules specified in this annex supply a well-defined default.

It appears that the breaks.jsp tool linked above uses CLDR for segmentation. (It's linked from the CLDR site here.) According to this page, it sounds like CLDR always uses dictionary-based breaking for words in CJK scripts, including Hangul. Unfortunately, I don't know of any equivalent implementation in Rust.

@Lucretiel
Copy link
Author

Huh- for some reason I had understood that CLDR was primarily for locale-sensitive text handling, and the demo site included no such prompt to select language specific ruless

@mbrubeck
Copy link
Contributor

mbrubeck commented Feb 6, 2021

Huh- for some reason I had understood that CLDR was primarily for locale-sensitive text handling, and the demo site included no such prompt to select language specific ruless

I thought so too, and I'm not 100% sure what algorithm or settings that demo site is using.

However, it appears that the unicode-segmentation crate implements the default word boundary spec correctly. Both c and have Word_Break = ALetter, and the spec says not to break between two ALetter characters.

@Lucretiel
Copy link
Author

Ah, that's what I was looking for; I was having trouble finding out conclusively if 를 is an ideograph (since ALetter explicitly rejects ideograph characters). That's otherwise consistent with my reading of the specification; thanks for your help!

@Manishearth
Copy link
Member

Our implementation is correct here, what's happening is that UAX 29 allows implementations to diverge from the algorithm in ways that affect degenerate cases, like this one.

Hangul is not an ideographic or logographic writing system.

@pickfire
Copy link

pickfire commented Mar 2, 2021

Do we have something to support line breaks in UAX 14 https://www.unicode.org/reports/tr14/?

withoutboats/heck#28 (comment)

Where Hello, world. 你好,世界! becomes Hello, world. 你好, 世界!? The punctuation are not necessary. And even if the han character was stuck with the english character it will be separated, like 一a二 becomes a ? Or maybe it is out of the scope of this project? I thought since the name is "unicode-segmentation" so it should have something like this.

@mbrubeck
Copy link
Contributor

mbrubeck commented Mar 2, 2021

Do we have something to support line breaks in UAX 14 https://www.unicode.org/reports/tr14/?

Servo uses xi-unicode for UAX 14 line breaking. There is also a unicode-linebreak crate.

I don't think this is related to the current issue. Please open a new issue if there are more questions about this.

@pickfire
Copy link

pickfire commented Mar 2, 2021

@Lucretiel What you said seemed correct.

Normally word breaking does not require breaking between different scripts. However, adding that capability may be useful in combination with other extensions of word segmentation. For example, in Korean the sentence “I live in Chicago.” is written as three segments delimited by spaces:

나는  Chicago에  산다.

According to Korean standards, the grammatical suffixes, such as “에” meaning “in”, are considered separate words. Thus the above sentence would be broken into the following five words:

나,  는,  Chicago,  에, and  산다.

Separating the first two words requires a dictionary lookup, but for Latin text (“Chicago”) the separation is trivial based on the script boundary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants