Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify lang detection #299

Merged
merged 8 commits into from
Jul 25, 2024
Merged

Simplify lang detection #299

merged 8 commits into from
Jul 25, 2024

Conversation

ManyTheFish
Copy link
Member

@ManyTheFish ManyTheFish commented Jul 10, 2024

  • Change the language allow_list from a map of script->language to an array of allowed languages
  • Allow to dynamically change the language allow list when tokenizing text

@ManyTheFish ManyTheFish marked this pull request as draft July 10, 2024 06:40
@ManyTheFish ManyTheFish force-pushed the simplify-lang-detection branch from 9efb459 to f70289d Compare July 22, 2024 15:09
@ManyTheFish ManyTheFish marked this pull request as ready for review July 22, 2024 15:09
@ManyTheFish ManyTheFish force-pushed the simplify-lang-detection branch from f70289d to 599bb3b Compare July 22, 2024 15:31
Comment on lines +356 to +361
assert_eq!(Language::Eng.code(), "eng");
assert_eq!(Language::from_code("eng"), Some(Language::Eng));
assert_eq!(Language::Jpn.code(), "jpn");
assert_eq!(Language::from_code("jpn"), Some(Language::Jpn));
assert_eq!(Language::Cmn.code(), "cmn");
assert_eq!(Language::from_code("cmn"), Some(Language::Cmn));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this is relevant, but I thought we were using https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes. I would have expected just en for English, ja for Japanese, etc.

Looks like we're not using set 1, but the other sets? Is set 1 usable?

Copy link
Member Author

@ManyTheFish ManyTheFish Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These codes are inherited from whatlang, which follows the ISO-639-3, and the bigrams codes come from the ISO-639-1, which is an anterior version.
Is there any issue with using a 3-gram code?

Copy link
Contributor

@dureuill dureuill Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about user expection. Maybe that's just me, but as a user, I would expect my language codes to be "EN" or "en", not "ENG"

The best solution would be to support both en and eng.

charabia/src/tokenizer.rs Outdated Show resolved Hide resolved
charabia/src/tokenizer.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@dureuill dureuill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors merge

Copy link
Contributor

meili-bors bot commented Jul 25, 2024

Build succeeded:

@meili-bors meili-bors bot merged commit 9f27d85 into main Jul 25, 2024
4 checks passed
@meili-bors meili-bors bot deleted the simplify-lang-detection branch July 25, 2024 07:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants