Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the correct answer for detect() on a multilingual string? #13

Open
fergald opened this issue Jul 10, 2024 · 2 comments
Open

What is the correct answer for detect() on a multilingual string? #13

fergald opened this issue Jul 10, 2024 · 2 comments
Labels
i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.

Comments

@fergald
Copy link

fergald commented Jul 10, 2024

If a string is 100 chars of English followed by 900 characters of French, what is the ideal result? Is it the following?

[
  {language: "en", confidence:.1},
  {language: "fr", confidence:.9},
]

I haven't been able to come up with a better idea than that each language in the result should tell you what fraction of the string is in that language.

This gets more complicated when the language of segments of the string are themselves ambiguous. E.g. for an English article talking about words that are shared between Chinese and Japanese, what is the correct answer? Assuming the text is 80% English with 10% of it being Chinese/Japanese. What is the ideal result? Is it

  {language: "en", confidence:.8},
  {language: "ja", confidence:.0.1},
  {language: "zh", confidence:.0.1},
]

even though 20% of the text is Japanese and 20% is Chinese? I can't think of a better "correct" answer but maybe there is one.

Also from an implementation perspective, the above "correct" answer is relatively easy. Models may have a fixed maximum input size and the above can be calculated by breaking the string into chunks and averaging over the results for each chunk.

Questions

  • Should we even be trying to spec level of detail?
  • If so, should we spec the above?
@domenic
Copy link
Collaborator

domenic commented Aug 15, 2024

Some interesting precedents:

Many other APIs don't seem to explicitly document their strategy.

@xfq xfq added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label Oct 9, 2024
@fergald
Copy link
Author

fergald commented Feb 10, 2025

I think annotated spans is the luxury version but will be harder to implement and overkill for many uses. Models that take a short text and assign weights to various languages already exist and by breaking a longer string into chunks and averaging over the results you naturally get something that measures how much of each language there is in a given text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.
Projects
None yet
Development

No branches or pull requests

3 participants