-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect multiple languages in mixed-language text #38
Comments
I think specifically for Lingua the following approach could work. Some of the following points have footnotes describing further considerations. Note that this is not a scientific approach, there might be better and more performant solutions.
I have implemented this in my fork (file Footnotes
|
Is it solved? |
@kargaranamir I've implemented an algorithm for my other implementations of Lingua (Go, Rust, Python) already. I haven't found the time yet to implement it here. So yes, it's generally solved but not yet implemented. |
Thanks for the reply @pemistahl. I just checked the Python version. In the example, LanguageDetectorBuilder selects languages from three languages, and then it predicts among them. I wonder, does it still work if I run it on all languages supported by Lingua and even if I pass monolingual sentences? |
@kargaranamir This feature is still experimental. The more languages you add to the mix, the more inaccurate the result will be. If you can restrict the number of possible languages beforehand, then do it as it will produce better results in most cases. |
Closed in favor of #214. The Rust implementation already contains this feature. |
Currently, for a given input string, only the most likely language is returned. However, if the input contains contiguous sections of multiple languages, it will be desirable to detect all of them and return an ordered sequence of items, where each item consists of a start index, an end index and the detected language.
Input:
He turned around and asked: "Entschuldigen Sie, sprechen Sie Deutsch?"
Output:
The text was updated successfully, but these errors were encountered: