Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support lindera for japanese and korea tokenization #3218

Merged
merged 24 commits into from
Jan 7, 2025

Conversation

chenkovsky
Copy link
Contributor

@chenkovsky chenkovsky commented Dec 9, 2024

Lindera is the successor of mecab, it supports multiple languages, currently CJK (Chinese, Japanese, Korea) are supported . Actually, users can build their own models for their languages. Quickwit, Meilisearch, Qdrant and ParadeDB are also integrated with it.

Lindera supports two model loading mechanism:

  1. include language model into compiled binary.
  2. load model from given path.

I integrated it with the second one. because, default language model is very old, (by the way, Jieba's default model is also very old), we have to update language model frequently, if we want to do some serious things. so bundling language model may be not a good idea.

In this pr. I defined a LANCE_TOKENIZERS_HOME env variable. user can put their models into this folder.

@github-actions github-actions bot added enhancement New feature or request python labels Dec 9, 2024
@chenkovsky
Copy link
Contributor Author

@wjones127 meilisearch uses whatlang-rs to detect language, and calls different tokenizers for different languages. can we implement similar function. it's very useful!
but I think it's not a good idea to couple language detection. morden language detection uses fasttext or neural network, it's hard to bundle into compiled binary. we also want to updated language detection model frequently. maybe we can pass an extra column that indicates the language of text when we create index.

for example:

create_index(...., language_column="column_name"})

@codecov-commenter
Copy link

codecov-commenter commented Dec 9, 2024

Codecov Report

Attention: Patch coverage is 0% with 172 lines in your changes missing coverage. Please review.

Project coverage is 78.77%. Comparing base (8585207) to head (0eba3ce).

Files with missing lines Patch % Lines
...lance-index/src/scalar/inverted/tokenizer/jieba.rs 0.00% 78 Missing ⚠️
...nce-index/src/scalar/inverted/tokenizer/lindera.rs 0.00% 59 Missing ⚠️
rust/lance-index/src/scalar/inverted/tokenizer.rs 0.00% 35 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3218      +/-   ##
==========================================
- Coverage   78.95%   78.77%   -0.18%     
==========================================
  Files         246      248       +2     
  Lines       87894    88066     +172     
  Branches    87894    88066     +172     
==========================================
- Hits        69395    69375      -20     
- Misses      15617    15811     +194     
+ Partials     2882     2880       -2     
Flag Coverage Δ
unittests 78.77% <0.00%> (-0.18%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm okay with having this as an experimental feature. I think long term I want a plugin mechanism that could be shared with other tokenizers. See: #3222

@@ -141,9 +145,72 @@ fn build_base_tokenizer_builder(name: &str) -> Result<tantivy::tokenizer::TextAn
tantivy::tokenizer::RawTokenizer::default(),
)
.dynamic()),
#[cfg(feature = "tokenizer-lindera")]
s if s.starts_with("lindera/") => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above, the tokenizer names start with lindera- not lindera/

Copy link
Contributor Author

@chenkovsky chenkovsky Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I haven't changed my unitttest yet. My idea is use a directory structure to manage different language model.

$LANCE_HOME/
    tokenizers/
        lindera/
            japanese-dic1/config.json
            japanese-dic2/config.json
            korea/config.json
        jieba/
            chinese/config.json

}

#[cfg(feature = "tokenizer-lindera")]
fn build_lindera_tokenizer_builder(dic: &str) -> Result<tantivy::tokenizer::TextAnalyzerBuilder> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does dic stand for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this dic is used as relative model path. it seems that tokenizer name will be written into final index file, so we cannot use absolute path. so I use relative path to $LANCE_HOME.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use a better variable name? dictionary_path or lang_model_path or something similar? I don't think most people will understand what dic means.

Some(p) => {
let dic_dir = p.join(dic);
let main_dir = dic_dir.join("main");
let user_config_path = dic_dir.join("user_config.json");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the schema of this JSON file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated implementation, with a clear configuration schema

Comment on lines 21 to 22
pub const LANCE_HOME_ENV_KEY: &str = "LANCE_HOME";
pub const LANCE_HOME_DEFAULT_DIRECTORY: &str = "lance";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this specific to lance + lindera

Suggested change
pub const LANCE_HOME_ENV_KEY: &str = "LANCE_HOME";
pub const LANCE_HOME_DEFAULT_DIRECTORY: &str = "lance";
pub const LANCE_HOME_ENV_KEY: &str = "LANCE_LINDERA_HOME";
pub const LANCE_HOME_DEFAULT_DIRECTORY: &str = "lance_lindera";

Copy link
Contributor Author

@chenkovsky chenkovsky Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. global home path may be not a good idea. so I created a draft. But I think tokenizers can share same home.
different tokenizers are quite similar. they all contain
1. trie dictionary,
2. ngram model, for chinese, 1-gram model is enough, so sometimes there's no this part.
they all use viterbi algorithm.
that's why I think different tokenizers can share same home. BTW the code size of tokenizers are quite small, and have little effect on final wheel package size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved env into lance-index package

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Maybe not LANCE_LINDERA_HOME. Maybe LANCE_TOKENIZERS_HOME?

@wjones127
Copy link
Contributor

Quickwit, Meilisearch, Qdrant and ParadeDB are also integrated with it.

Where do you see that? I just checked Qdrant and Meilisearch and neither show any integration with Lindera. Could you provide reference links for those? I was looking because I wanted to see how they document this and make sure we are exposing in a way that felt as easy or easier than what is shown there.

@wjones127
Copy link
Contributor

I integrated it with the second one. because, default language model is very old, (by the way, Jieba's default model is also very old), we have to update language model frequently

Where can users find a good up-to-date language model? I suspect providing easy instructions to get this would be very valuable for users.

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move forward with this PR. I think Lindera seems like a good solution. However, I would like to see clear documentation on how to use. I want to make sure we are making it easy to use, especially downstream in LanceDB.

@chenkovsky
Copy link
Contributor Author

chenkovsky commented Dec 24, 2024

Quickwit, Meilisearch, Qdrant and ParadeDB are also integrated with it.

Where do you see that? I just checked Qdrant and Meilisearch and neither show any integration with Lindera. Could you provide reference links for those? I was looking because I wanted to see how they document this and make sure we are exposing in a way that felt as easy or easier than what is shown there.

Qdrant and Meilisearch both use this library,
this library use lindera.
https://github.com/meilisearch/charabia

But this library bundles language model into released binary. it allows user to toggle features to enable different languages. In my test, it will also slow down compile time.

@chenkovsky
Copy link
Contributor Author

chenkovsky commented Dec 24, 2024

I think we should move forward with this PR. I think Lindera seems like a good solution. However, I would like to see clear documentation on how to use. I want to make sure we are making it easy to use, especially downstream in LanceDB.

maybe we can add a python script to download open source language models.

python -m lance.tokenizers.download --dic lindera/ipadic

@chenkovsky
Copy link
Contributor Author

I integrated it with the second one. because, default language model is very old, (by the way, Jieba's default model is also very old), we have to update language model frequently

Where can users find a good up-to-date language model? I suspect providing easy instructions to get this would be very valuable for users.

LM is confidential for many companies.
It's one of the most important competitiveness for traditional NLP companies.
So It's hard to obtain up-to-date LM on the market.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wjones127 I add a script to download language models

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added jieba. next step I will build tiny language models, and add them into unittest.

@chenkovsky chenkovsky marked this pull request as ready for review December 28, 2024 08:38
@chenkovsky
Copy link
Contributor Author

I think we should move forward with this PR. I think Lindera seems like a good solution. However, I would like to see clear documentation on how to use. I want to make sure we are making it easy to use, especially downstream in LanceDB.

@wjones127 could you please review it again?

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Looks good.

@wjones127 wjones127 merged commit 5a18b14 into lancedb:main Jan 7, 2025
26 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants