Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Chinese Languages #6

Closed
dracofyre123 opened this issue May 14, 2019 · 14 comments
Closed

Support for Chinese Languages #6

dracofyre123 opened this issue May 14, 2019 · 14 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@dracofyre123
Copy link

Hello do you have a timeline for when support for Simplified Chinese, and Traditional Chinese will be added?

@james-s-w-clark
Copy link

It's probably "done when it's done". If you're in a rush, you could help collect a corpus to train the models on and accelerate progress.

@pemistahl
Copy link
Owner

Hi @dracofyre123, thank you for your interest in my library. I have not expected so much interest in it, to be honest. That makes me happy. :-)

I'm doing this project in my rare spare time, so unfortunately, I have to agree with @IdiosApps: It's done when it's done. I'm on vacation the next couple of weeks, so there won't be any progress during that time. However, supporting Chinese is definitely on my to do list and I will add it in the next version if this is requested a lot.

Nevertheless, I hope you still find my library useful and continue to use it. There will be progress, I promise.

@pemistahl pemistahl added the question Further information is requested label May 15, 2019
@james-s-w-clark
Copy link

@pemistahl it'd be very helpful if you could make a guide/video on how to create models with our own corpi. It could encourage potential users who express interest, but are put off by the lack of certain languages being supported.
I can try and help with Chinese if I know how to. This does look like a great library compared to Tika and Optimaize - the downside is just number of languages supported.

Actually, if you look at Optimaize (last update 2 years ago), you'll see that it has several issues dealing with CJK - like detecting a mix of English and Chinese as Italian or French (with no probability of Chinese!).

@pemistahl pemistahl added the enhancement New feature or request label May 15, 2019
@pemistahl
Copy link
Owner

pemistahl commented May 15, 2019

@IdiosApps I agree with you that a public api for language model creation would be useful. I simply wasn't aware of the fact that there are impatient users out there already who are eagerly awaiting this. You have been pretty quiet so far. ;-) But, since you have uttered your interest for this now, I will prioritize it for the next version. Thanks a lot for letting me know.

Currently, there is a function in the internal package that knows how to convert data from the Leipzig Corpora Collection which have been used for training. However, it is not meant to be used by end users in its current stage.

For the currently supported languages, language models of ngram lengths 1 to 5 are created. This does not make sense for Chinese because a single character represents an entire word. So, only ngrams of length 1 should be taken into account. Probably, there will be further problems with Chinese but I have some ideas in mind on how to deal with them.

My other goal for version 0.5.0 is multiplatform support so that Lingua can compete with JavaScript libraries in this field as well. But this needs time and, as I said, I'm gonna go on vacation soon. From time to time, my private life needs some care, too. :-)

Feel free to fork my repository and implement your needs yourselves if you don't want to wait. As I wrote in the README, pull requests are welcome.

By the way, the plural of "corpus" is "corpora", not "corpi".

@james-s-w-clark
Copy link

@pemistahl I'd be happy to discuss Chinese if you like. Actually, some characters join together to make a new word/phrase. For example 中 middle + 国 kingdom = 中国 China. I think both bigrams and unigrams would be viable here.

Similarly, for a mix of tri-, bi-, and uni-grams:
你好吗 how are you?
你好 hello
你 you
好 good
吗 question marker

I'm not sure about the best way to find all ngrams that match words in a dictionary for Chinese - naively searching for hashes of all ngrams until the next punctuation mark may not be so bad.

@dmbloch
Copy link

dmbloch commented May 30, 2019

I simply wasn't aware of the fact that there are impatient users out there already who are eagerly awaiting this. You have been pretty quiet so far. ;-) But, since you have uttered your interest for this now, I will prioritize it for the next version. Thanks a lot for letting me know.

Just chiming in as another user eagerly awaiting this! This is a fantastic and much needed library, I look forward to seeing it's development and perhaps contributing by adding additional language support if that's an option in the future. In particular, the addition of CJK languages will round out the library quite well I think.

@pemistahl
Copy link
Owner

@IdiosApps @dmbloch Thanks for your feedback and your ideas. I'm still on vacation for another week. Afterwards, I will continue development (slowly but steadily).

@pemistahl
Copy link
Owner

@dracofyre123 @IdiosApps @dmbloch

I would say, looks quite promising already for Chinese. What do you think?

current average detection accuracy

I'm going to add Japanese and Korean next.

@james-s-w-clark
Copy link

@pemistahl looks great for every language!

@dracofyre123
Copy link
Author

dracofyre123 commented Jul 14, 2019 via email

@pemistahl
Copy link
Owner

@dracofyre123 I do not differentiate between traditional and simplified Chinese. Both should be recognized equally well as Chinese. I need to write some test cases for evidence.

@james-s-w-clark
Copy link

@dracofyre123 I do not differentiate between traditional and simplified Chinese. Both should be recognized equally well as Chinese. I need to write some test cases for evidence.

I think from a user's perspective, it could be handy to recognise the specific script of input - it might, for example, help them decide which script to automate a response with.

@pemistahl
Copy link
Owner

I'm gonna close this issue as Chinese is supported now. There was not enough separate training data for traditional and simplified Chinese, that's why I mixed them. I will try to differentiate between both in a later version of the library.

@dracofyre123
Copy link
Author

dracofyre123 commented Aug 14, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants