-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support lindera for japanese and korea tokenization #3218
Conversation
@wjones127 meilisearch uses whatlang-rs to detect language, and calls different tokenizers for different languages. can we implement similar function. it's very useful! for example:
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3218 +/- ##
==========================================
- Coverage 78.95% 78.77% -0.18%
==========================================
Files 246 248 +2
Lines 87894 88066 +172
Branches 87894 88066 +172
==========================================
- Hits 69395 69375 -20
- Misses 15617 15811 +194
+ Partials 2882 2880 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm okay with having this as an experimental feature. I think long term I want a plugin mechanism that could be shared with other tokenizers. See: #3222
@@ -141,9 +145,72 @@ fn build_base_tokenizer_builder(name: &str) -> Result<tantivy::tokenizer::TextAn | |||
tantivy::tokenizer::RawTokenizer::default(), | |||
) | |||
.dynamic()), | |||
#[cfg(feature = "tokenizer-lindera")] | |||
s if s.starts_with("lindera/") => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above, the tokenizer names start with lindera-
not lindera/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I haven't changed my unitttest yet. My idea is use a directory structure to manage different language model.
$LANCE_HOME/
tokenizers/
lindera/
japanese-dic1/config.json
japanese-dic2/config.json
korea/config.json
jieba/
chinese/config.json
} | ||
|
||
#[cfg(feature = "tokenizer-lindera")] | ||
fn build_lindera_tokenizer_builder(dic: &str) -> Result<tantivy::tokenizer::TextAnalyzerBuilder> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does dic
stand for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this dic is used as relative model path. it seems that tokenizer name will be written into final index file, so we cannot use absolute path. so I use relative path to $LANCE_HOME.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use a better variable name? dictionary_path
or lang_model_path
or something similar? I don't think most people will understand what dic
means.
Some(p) => { | ||
let dic_dir = p.join(dic); | ||
let main_dir = dic_dir.join("main"); | ||
let user_config_path = dic_dir.join("user_config.json"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the schema of this JSON file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated implementation, with a clear configuration schema
rust/lance-core/src/lib.rs
Outdated
pub const LANCE_HOME_ENV_KEY: &str = "LANCE_HOME"; | ||
pub const LANCE_HOME_DEFAULT_DIRECTORY: &str = "lance"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this specific to lance + lindera
pub const LANCE_HOME_ENV_KEY: &str = "LANCE_HOME"; | |
pub const LANCE_HOME_DEFAULT_DIRECTORY: &str = "lance"; | |
pub const LANCE_HOME_ENV_KEY: &str = "LANCE_LINDERA_HOME"; | |
pub const LANCE_HOME_DEFAULT_DIRECTORY: &str = "lance_lindera"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. global home path may be not a good idea. so I created a draft. But I think tokenizers can share same home.
different tokenizers are quite similar. they all contain
1. trie dictionary,
2. ngram model, for chinese, 1-gram model is enough, so sometimes there's no this part.
they all use viterbi algorithm.
that's why I think different tokenizers can share same home. BTW the code size of tokenizers are quite small, and have little effect on final wheel package size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved env into lance-index package
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Maybe not LANCE_LINDERA_HOME
. Maybe LANCE_TOKENIZERS_HOME
?
Where do you see that? I just checked Qdrant and Meilisearch and neither show any integration with Lindera. Could you provide reference links for those? I was looking because I wanted to see how they document this and make sure we are exposing in a way that felt as easy or easier than what is shown there. |
Where can users find a good up-to-date language model? I suspect providing easy instructions to get this would be very valuable for users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should move forward with this PR. I think Lindera seems like a good solution. However, I would like to see clear documentation on how to use. I want to make sure we are making it easy to use, especially downstream in LanceDB.
Qdrant and Meilisearch both use this library, But this library bundles language model into released binary. it allows user to toggle features to enable different languages. In my test, it will also slow down compile time. |
maybe we can add a python script to download open source language models.
|
LM is confidential for many companies. |
python/python/lance/lm.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wjones127 I add a script to download language models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also added jieba. next step I will build tiny language models, and add them into unittest.
0d3d34c
to
a25db78
Compare
@wjones127 could you please review it again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Looks good.
Lindera is the successor of mecab, it supports multiple languages, currently CJK (Chinese, Japanese, Korea) are supported . Actually, users can build their own models for their languages. Quickwit, Meilisearch, Qdrant and ParadeDB are also integrated with it.
Lindera supports two model loading mechanism:
I integrated it with the second one. because, default language model is very old, (by the way, Jieba's default model is also very old), we have to update language model frequently, if we want to do some serious things. so bundling language model may be not a good idea.
In this pr. I defined a LANCE_TOKENIZERS_HOME env variable. user can put their models into this folder.