Support for Chinese Languages #6

dracofyre123 · 2019-05-14T16:50:20Z

Hello do you have a timeline for when support for Simplified Chinese, and Traditional Chinese will be added?

james-s-w-clark · 2019-05-14T17:12:57Z

It's probably "done when it's done". If you're in a rush, you could help collect a corpus to train the models on and accelerate progress.

pemistahl · 2019-05-15T06:18:17Z

Hi @dracofyre123, thank you for your interest in my library. I have not expected so much interest in it, to be honest. That makes me happy. :-)

I'm doing this project in my rare spare time, so unfortunately, I have to agree with @IdiosApps: It's done when it's done. I'm on vacation the next couple of weeks, so there won't be any progress during that time. However, supporting Chinese is definitely on my to do list and I will add it in the next version if this is requested a lot.

Nevertheless, I hope you still find my library useful and continue to use it. There will be progress, I promise.

james-s-w-clark · 2019-05-15T07:18:14Z

@pemistahl it'd be very helpful if you could make a guide/video on how to create models with our own corpi. It could encourage potential users who express interest, but are put off by the lack of certain languages being supported.
I can try and help with Chinese if I know how to. This does look like a great library compared to Tika and Optimaize - the downside is just number of languages supported.

Actually, if you look at Optimaize (last update 2 years ago), you'll see that it has several issues dealing with CJK - like detecting a mix of English and Chinese as Italian or French (with no probability of Chinese!).

pemistahl · 2019-05-15T09:30:07Z

@IdiosApps I agree with you that a public api for language model creation would be useful. I simply wasn't aware of the fact that there are impatient users out there already who are eagerly awaiting this. You have been pretty quiet so far. ;-) But, since you have uttered your interest for this now, I will prioritize it for the next version. Thanks a lot for letting me know.

Currently, there is a function in the internal package that knows how to convert data from the Leipzig Corpora Collection which have been used for training. However, it is not meant to be used by end users in its current stage.

For the currently supported languages, language models of ngram lengths 1 to 5 are created. This does not make sense for Chinese because a single character represents an entire word. So, only ngrams of length 1 should be taken into account. Probably, there will be further problems with Chinese but I have some ideas in mind on how to deal with them.

My other goal for version 0.5.0 is multiplatform support so that Lingua can compete with JavaScript libraries in this field as well. But this needs time and, as I said, I'm gonna go on vacation soon. From time to time, my private life needs some care, too. :-)

Feel free to fork my repository and implement your needs yourselves if you don't want to wait. As I wrote in the README, pull requests are welcome.

By the way, the plural of "corpus" is "corpora", not "corpi".

james-s-w-clark · 2019-05-15T21:38:46Z

@pemistahl I'd be happy to discuss Chinese if you like. Actually, some characters join together to make a new word/phrase. For example 中 middle + 国 kingdom = 中国 China. I think both bigrams and unigrams would be viable here.

Similarly, for a mix of tri-, bi-, and uni-grams:
你好吗 how are you?
你好 hello
你 you
好 good
吗 question marker

I'm not sure about the best way to find all ngrams that match words in a dictionary for Chinese - naively searching for hashes of all ngrams until the next punctuation mark may not be so bad.

dmbloch · 2019-05-30T18:10:25Z

I simply wasn't aware of the fact that there are impatient users out there already who are eagerly awaiting this. You have been pretty quiet so far. ;-) But, since you have uttered your interest for this now, I will prioritize it for the next version. Thanks a lot for letting me know.

Just chiming in as another user eagerly awaiting this! This is a fantastic and much needed library, I look forward to seeing it's development and perhaps contributing by adding additional language support if that's an option in the future. In particular, the addition of CJK languages will round out the library quite well I think.

pemistahl · 2019-05-30T20:29:16Z

@IdiosApps @dmbloch Thanks for your feedback and your ideas. I'm still on vacation for another week. Afterwards, I will continue development (slowly but steadily).

pemistahl · 2019-07-14T16:40:38Z

@dracofyre123 @IdiosApps @dmbloch

I would say, looks quite promising already for Chinese. What do you think?

I'm going to add Japanese and Korean next.

james-s-w-clark · 2019-07-14T17:11:13Z

@pemistahl looks great for every language!

dracofyre123 · 2019-07-14T17:24:57Z

That looks great. Just curious which version of Chinese are you referring to? Is it traditional versus simplified? In either case this is very nice and exciting news.

…

On Sun, Jul 14, 2019 at 1:11 PM James ***@***.***> wrote: @pemistahl <https://github.com/pemistahl> looks great for every language! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6?email_source=notifications&email_token=AIRNT4YWO65VSTXPPGPGBJLP7NM3DA5CNFSM4HM3Z65KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ4JPDY#issuecomment-511219599>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIRNT47ELTP5KM4QMM2AE3TP7NM3DANCNFSM4HM3Z65A> .

pemistahl · 2019-07-14T18:23:46Z

@dracofyre123 I do not differentiate between traditional and simplified Chinese. Both should be recognized equally well as Chinese. I need to write some test cases for evidence.

james-s-w-clark · 2019-07-14T18:49:53Z

@dracofyre123 I do not differentiate between traditional and simplified Chinese. Both should be recognized equally well as Chinese. I need to write some test cases for evidence.

I think from a user's perspective, it could be handy to recognise the specific script of input - it might, for example, help them decide which script to automate a response with.

pemistahl · 2019-08-12T18:04:01Z

I'm gonna close this issue as Chinese is supported now. There was not enough separate training data for traditional and simplified Chinese, that's why I mixed them. I will try to differentiate between both in a later version of the library.

dracofyre123 · 2019-08-14T10:26:23Z

K thank you for taking a look at this.

…

On Mon, Aug 12, 2019 at 2:04 PM Peter M. Stahl ***@***.***> wrote: I'm gonna close this issue as Chinese is supported now. There was not enough separate training data for traditional and simplified Chinese, that's why I mixed them. I will try to differentiate between both in a later version of the library. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6?email_source=notifications&email_token=AIRNT44ORN3BPKTYWOLKPY3QEGQZFA5CNFSM4HM3Z65KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4DLAHI#issuecomment-520531997>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIRNT455QJJ3PNYLDUN5B4LQEGQZFANCNFSM4HM3Z65A> .

pemistahl added the question label May 15, 2019

pemistahl added the enhancement label May 15, 2019

pemistahl closed this as completed Aug 12, 2019

reececomo mentioned this issue Dec 18, 2023

Simplified & Traditional Chinese #192

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Chinese Languages #6

Support for Chinese Languages #6

dracofyre123 commented May 14, 2019

james-s-w-clark commented May 14, 2019

pemistahl commented May 15, 2019

james-s-w-clark commented May 15, 2019

pemistahl commented May 15, 2019 •

edited

Loading

james-s-w-clark commented May 15, 2019

dmbloch commented May 30, 2019

pemistahl commented May 30, 2019

pemistahl commented Jul 14, 2019

james-s-w-clark commented Jul 14, 2019

dracofyre123 commented Jul 14, 2019 via email

pemistahl commented Jul 14, 2019

james-s-w-clark commented Jul 14, 2019

pemistahl commented Aug 12, 2019

dracofyre123 commented Aug 14, 2019 via email

Support for Chinese Languages #6

Support for Chinese Languages #6

Comments

dracofyre123 commented May 14, 2019

james-s-w-clark commented May 14, 2019

pemistahl commented May 15, 2019

james-s-w-clark commented May 15, 2019

pemistahl commented May 15, 2019 • edited Loading

james-s-w-clark commented May 15, 2019

dmbloch commented May 30, 2019

pemistahl commented May 30, 2019

pemistahl commented Jul 14, 2019

james-s-w-clark commented Jul 14, 2019

dracofyre123 commented Jul 14, 2019 via email

pemistahl commented Jul 14, 2019

james-s-w-clark commented Jul 14, 2019

pemistahl commented Aug 12, 2019

dracofyre123 commented Aug 14, 2019 via email

pemistahl commented May 15, 2019 •

edited

Loading