-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Chinese Languages #6
Comments
It's probably "done when it's done". If you're in a rush, you could help collect a corpus to train the models on and accelerate progress. |
Hi @dracofyre123, thank you for your interest in my library. I have not expected so much interest in it, to be honest. That makes me happy. :-) I'm doing this project in my rare spare time, so unfortunately, I have to agree with @IdiosApps: It's done when it's done. I'm on vacation the next couple of weeks, so there won't be any progress during that time. However, supporting Chinese is definitely on my to do list and I will add it in the next version if this is requested a lot. Nevertheless, I hope you still find my library useful and continue to use it. There will be progress, I promise. |
@pemistahl it'd be very helpful if you could make a guide/video on how to create models with our own corpi. It could encourage potential users who express interest, but are put off by the lack of certain languages being supported. Actually, if you look at Optimaize (last update 2 years ago), you'll see that it has several issues dealing with CJK - like detecting a mix of English and Chinese as Italian or French (with no probability of Chinese!). |
@IdiosApps I agree with you that a public api for language model creation would be useful. I simply wasn't aware of the fact that there are impatient users out there already who are eagerly awaiting this. You have been pretty quiet so far. ;-) But, since you have uttered your interest for this now, I will prioritize it for the next version. Thanks a lot for letting me know. Currently, there is a function in the For the currently supported languages, language models of ngram lengths 1 to 5 are created. This does not make sense for Chinese because a single character represents an entire word. So, only ngrams of length 1 should be taken into account. Probably, there will be further problems with Chinese but I have some ideas in mind on how to deal with them. My other goal for version 0.5.0 is multiplatform support so that Lingua can compete with JavaScript libraries in this field as well. But this needs time and, as I said, I'm gonna go on vacation soon. From time to time, my private life needs some care, too. :-) Feel free to fork my repository and implement your needs yourselves if you don't want to wait. As I wrote in the README, pull requests are welcome. By the way, the plural of "corpus" is "corpora", not "corpi". |
@pemistahl I'd be happy to discuss Chinese if you like. Actually, some characters join together to make a new word/phrase. For example 中 middle + 国 kingdom = 中国 China. I think both bigrams and unigrams would be viable here. Similarly, for a mix of tri-, bi-, and uni-grams: I'm not sure about the best way to find all ngrams that match words in a dictionary for Chinese - naively searching for hashes of all ngrams until the next punctuation mark may not be so bad. |
Just chiming in as another user eagerly awaiting this! This is a fantastic and much needed library, I look forward to seeing it's development and perhaps contributing by adding additional language support if that's an option in the future. In particular, the addition of CJK languages will round out the library quite well I think. |
@IdiosApps @dmbloch Thanks for your feedback and your ideas. I'm still on vacation for another week. Afterwards, I will continue development (slowly but steadily). |
@dracofyre123 @IdiosApps @dmbloch I would say, looks quite promising already for Chinese. What do you think? I'm going to add Japanese and Korean next. |
@pemistahl looks great for every language! |
That looks great. Just curious which version of Chinese are you referring
to? Is it traditional versus simplified? In either case this is very nice
and exciting news.
…On Sun, Jul 14, 2019 at 1:11 PM James ***@***.***> wrote:
@pemistahl <https://github.com/pemistahl> looks great for every language!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6?email_source=notifications&email_token=AIRNT4YWO65VSTXPPGPGBJLP7NM3DA5CNFSM4HM3Z65KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ4JPDY#issuecomment-511219599>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AIRNT47ELTP5KM4QMM2AE3TP7NM3DANCNFSM4HM3Z65A>
.
|
@dracofyre123 I do not differentiate between traditional and simplified Chinese. Both should be recognized equally well as Chinese. I need to write some test cases for evidence. |
I think from a user's perspective, it could be handy to recognise the specific script of input - it might, for example, help them decide which script to automate a response with. |
I'm gonna close this issue as Chinese is supported now. There was not enough separate training data for traditional and simplified Chinese, that's why I mixed them. I will try to differentiate between both in a later version of the library. |
K thank you for taking a look at this.
…On Mon, Aug 12, 2019 at 2:04 PM Peter M. Stahl ***@***.***> wrote:
I'm gonna close this issue as Chinese is supported now. There was not
enough separate training data for traditional and simplified Chinese,
that's why I mixed them. I will try to differentiate between both in a
later version of the library.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6?email_source=notifications&email_token=AIRNT44ORN3BPKTYWOLKPY3QEGQZFA5CNFSM4HM3Z65KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4DLAHI#issuecomment-520531997>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AIRNT455QJJ3PNYLDUN5B4LQEGQZFANCNFSM4HM3Z65A>
.
|
Hello do you have a timeline for when support for Simplified Chinese, and Traditional Chinese will be added?
The text was updated successfully, but these errors were encountered: