Apparent bug in word splitting with Hangul character #90

Lucretiel · 2021-02-06T19:56:45Z

Consider this string:

" abc를 "

According to Unicode's demo implementation of word segmentation, I'd expect this to be split into 4 words: " ", "abc", "를", and " ". However, the observed behavior (playground) is that it only splits into 3 words; the "abc를" is grouped together.

The text was updated successfully, but these errors were encountered:

mbrubeck · 2021-02-06T20:28:01Z

This crate implements the Default Word Boundary Specification from UAX29, which states:

The following is a general specification for word boundaries—language-specific rules in [CLDR] should be used where available.

and

For Thai, Lao, Khmer, Myanmar, and other scripts that do not typically use spaces between words, a good implementation should not depend on the default word boundary specification. It should use a more sophisticated mechanism, as is also required for line breaking. Ideographic scripts such as Japanese and Chinese are even more complex. Where Hangul text is written without spaces, the same applies. However, in the absence of a more sophisticated mechanism, the rules specified in this annex supply a well-defined default.

It appears that the breaks.jsp tool linked above uses CLDR for segmentation. (It's linked from the CLDR site here.) According to this page, it sounds like CLDR always uses dictionary-based breaking for words in CJK scripts, including Hangul. Unfortunately, I don't know of any equivalent implementation in Rust.

Lucretiel · 2021-02-06T20:42:54Z

Huh- for some reason I had understood that CLDR was primarily for locale-sensitive text handling, and the demo site included no such prompt to select language specific ruless

mbrubeck · 2021-02-06T21:08:02Z

Huh- for some reason I had understood that CLDR was primarily for locale-sensitive text handling, and the demo site included no such prompt to select language specific ruless

I thought so too, and I'm not 100% sure what algorithm or settings that demo site is using.

However, it appears that the unicode-segmentation crate implements the default word boundary spec correctly. Both c and 를 have Word_Break = ALetter, and the spec says not to break between two ALetter characters.

Lucretiel · 2021-02-06T23:47:16Z

Ah, that's what I was looking for; I was having trouble finding out conclusively if 를 is an ideograph (since ALetter explicitly rejects ideograph characters). That's otherwise consistent with my reading of the specification; thanks for your help!

Manishearth · 2021-02-07T00:03:33Z

Our implementation is correct here, what's happening is that UAX 29 allows implementations to diverge from the algorithm in ways that affect degenerate cases, like this one.

Hangul is not an ideographic or logographic writing system.

pickfire · 2021-03-02T17:32:32Z

Do we have something to support line breaks in UAX 14 https://www.unicode.org/reports/tr14/?

withoutboats/heck#28 (comment)

Where Hello, world. 你好，世界！ becomes Hello, world. 你好， 世界！? The punctuation are not necessary. And even if the han character was stuck with the english character it will be separated, like 一a二 becomes 一 a 二? Or maybe it is out of the scope of this project? I thought since the name is "unicode-segmentation" so it should have something like this.

mbrubeck · 2021-03-02T17:38:33Z

Do we have something to support line breaks in UAX 14 https://www.unicode.org/reports/tr14/?

Servo uses xi-unicode for UAX 14 line breaking. There is also a unicode-linebreak crate.

I don't think this is related to the current issue. Please open a new issue if there are more questions about this.

pickfire · 2021-03-02T17:44:32Z

@Lucretiel What you said seemed correct.

Normally word breaking does not require breaking between different scripts. However, adding that capability may be useful in combination with other extensions of word segmentation. For example, in Korean the sentence “I live in Chicago.” is written as three segments delimited by spaces:
나는  Chicago에  산다.
According to Korean standards, the grammatical suffixes, such as “에” meaning “in”, are considered separate words. Thus the above sentence would be broken into the following five words:
나,  는,  Chicago,  에, and  산다.
Separating the first two words requires a dictionary lookup, but for Latin text (“Chicago”) the separation is trivial based on the script boundary.

Lucretiel closed this as completed Feb 6, 2021

This was referenced Feb 7, 2021

Incorrect word boundary detection rust-lang/regex#743

Closed

Bug in Word Segmentation demo unicode-org/unicodetools#39

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apparent bug in word splitting with Hangul character #90

Apparent bug in word splitting with Hangul character #90

Lucretiel commented Feb 6, 2021

mbrubeck commented Feb 6, 2021

Lucretiel commented Feb 6, 2021

mbrubeck commented Feb 6, 2021 •

edited

Loading

Lucretiel commented Feb 6, 2021

Manishearth commented Feb 7, 2021

pickfire commented Mar 2, 2021 •

edited

Loading

mbrubeck commented Mar 2, 2021

pickfire commented Mar 2, 2021

Apparent bug in word splitting with Hangul character #90

Apparent bug in word splitting with Hangul character #90

Comments

Lucretiel commented Feb 6, 2021

mbrubeck commented Feb 6, 2021

Lucretiel commented Feb 6, 2021

mbrubeck commented Feb 6, 2021 • edited Loading

Lucretiel commented Feb 6, 2021

Manishearth commented Feb 7, 2021

pickfire commented Mar 2, 2021 • edited Loading

mbrubeck commented Mar 2, 2021

pickfire commented Mar 2, 2021

mbrubeck commented Feb 6, 2021 •

edited

Loading

pickfire commented Mar 2, 2021 •

edited

Loading