feat: support lindera for japanese and korea tokenization #3218

chenkovsky · 2024-12-09T04:44:08Z

Lindera is the successor of mecab, it supports multiple languages, currently CJK (Chinese, Japanese, Korea) are supported . Actually, users can build their own models for their languages. Quickwit, Meilisearch, Qdrant and ParadeDB are also integrated with it.

Lindera supports two model loading mechanism:

include language model into compiled binary.
load model from given path.

I integrated it with the second one. because, default language model is very old, (by the way, Jieba's default model is also very old), we have to update language model frequently, if we want to do some serious things. so bundling language model may be not a good idea.

In this pr. I defined a LANCE_TOKENIZERS_HOME env variable. user can put their models into this folder.

chenkovsky · 2024-12-09T04:59:08Z

@wjones127 meilisearch uses whatlang-rs to detect language, and calls different tokenizers for different languages. can we implement similar function. it's very useful!
but I think it's not a good idea to couple language detection. morden language detection uses fasttext or neural network, it's hard to bundle into compiled binary. we also want to updated language detection model frequently. maybe we can pass an extra column that indicates the language of text when we create index.

for example:

create_index(...., language_column="column_name"})

codecov-commenter · 2024-12-09T05:13:22Z

Codecov Report

Attention: Patch coverage is 0% with 172 lines in your changes missing coverage. Please review.

Project coverage is 78.77%. Comparing base (8585207) to head (0eba3ce).

Files with missing lines	Patch %	Lines
...lance-index/src/scalar/inverted/tokenizer/jieba.rs	0.00%	78 Missing ⚠️
...nce-index/src/scalar/inverted/tokenizer/lindera.rs	0.00%	59 Missing ⚠️
rust/lance-index/src/scalar/inverted/tokenizer.rs	0.00%	35 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3218      +/-   ##
==========================================
- Coverage   78.95%   78.77%   -0.18%     
==========================================
  Files         246      248       +2     
  Lines       87894    88066     +172     
  Branches    87894    88066     +172     
==========================================
- Hits        69395    69375      -20     
- Misses      15617    15811     +194     
+ Partials     2882     2880       -2

Flag	Coverage Δ
unittests	`78.77% <0.00%> (-0.18%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wjones127

I think I'm okay with having this as an experimental feature. I think long term I want a plugin mechanism that could be shared with other tokenizers. See: #3222

wjones127 · 2024-12-09T18:52:05Z

rust/lance-index/src/scalar/inverted/tokenizer.rs

@@ -141,9 +145,72 @@ fn build_base_tokenizer_builder(name: &str) -> Result<tantivy::tokenizer::TextAn
            tantivy::tokenizer::RawTokenizer::default(),
        )
        .dynamic()),
+        #[cfg(feature = "tokenizer-lindera")]
+        s if s.starts_with("lindera/") => {


Above, the tokenizer names start with lindera- not lindera/

sorry, I haven't changed my unitttest yet. My idea is use a directory structure to manage different language model.

$LANCE_HOME/ tokenizers/ lindera/ japanese-dic1/config.json japanese-dic2/config.json korea/config.json jieba/ chinese/config.json

wjones127 · 2024-12-09T18:53:29Z

rust/lance-index/src/scalar/inverted/tokenizer.rs

+}
+
+#[cfg(feature = "tokenizer-lindera")]
+fn build_lindera_tokenizer_builder(dic: &str) -> Result<tantivy::tokenizer::TextAnalyzerBuilder> {


What does dic stand for?

this dic is used as relative model path. it seems that tokenizer name will be written into final index file, so we cannot use absolute path. so I use relative path to $LANCE_HOME.

Could you use a better variable name? dictionary_path or lang_model_path or something similar? I don't think most people will understand what dic means.

wjones127 · 2024-12-09T18:54:10Z

rust/lance-index/src/scalar/inverted/tokenizer.rs

+        Some(p) => {
+            let dic_dir = p.join(dic);
+            let main_dir = dic_dir.join("main");
+            let user_config_path = dic_dir.join("user_config.json");


What's the schema of this JSON file?

I have updated implementation, with a clear configuration schema

wjones127 · 2024-12-09T18:58:52Z

rust/lance-core/src/lib.rs

+pub const LANCE_HOME_ENV_KEY: &str = "LANCE_HOME";
+pub const LANCE_HOME_DEFAULT_DIRECTORY: &str = "lance";


Let's make this specific to lance + lindera

Suggested change

pub const LANCE_HOME_ENV_KEY: &str = "LANCE_HOME";

pub const LANCE_HOME_DEFAULT_DIRECTORY: &str = "lance";

pub const LANCE_HOME_ENV_KEY: &str = "LANCE_LINDERA_HOME";

pub const LANCE_HOME_DEFAULT_DIRECTORY: &str = "lance_lindera";

yes. global home path may be not a good idea. so I created a draft. But I think tokenizers can share same home.
different tokenizers are quite similar. they all contain
1. trie dictionary,
2. ngram model, for chinese, 1-gram model is enough, so sometimes there's no this part.
they all use viterbi algorithm.
that's why I think different tokenizers can share same home. BTW the code size of tokenizers are quite small, and have little effect on final wheel package size.

I have moved env into lance-index package

Okay. Maybe not LANCE_LINDERA_HOME. Maybe LANCE_TOKENIZERS_HOME?

wjones127 · 2024-12-23T21:42:22Z

Quickwit, Meilisearch, Qdrant and ParadeDB are also integrated with it.

Where do you see that? I just checked Qdrant and Meilisearch and neither show any integration with Lindera. Could you provide reference links for those? I was looking because I wanted to see how they document this and make sure we are exposing in a way that felt as easy or easier than what is shown there.

wjones127 · 2024-12-23T21:43:13Z

I integrated it with the second one. because, default language model is very old, (by the way, Jieba's default model is also very old), we have to update language model frequently

Where can users find a good up-to-date language model? I suspect providing easy instructions to get this would be very valuable for users.

wjones127

I think we should move forward with this PR. I think Lindera seems like a good solution. However, I would like to see clear documentation on how to use. I want to make sure we are making it easy to use, especially downstream in LanceDB.

chenkovsky · 2024-12-24T02:02:09Z

Quickwit, Meilisearch, Qdrant and ParadeDB are also integrated with it.

Where do you see that? I just checked Qdrant and Meilisearch and neither show any integration with Lindera. Could you provide reference links for those? I was looking because I wanted to see how they document this and make sure we are exposing in a way that felt as easy or easier than what is shown there.

Qdrant and Meilisearch both use this library,
this library use lindera.
https://github.com/meilisearch/charabia

But this library bundles language model into released binary. it allows user to toggle features to enable different languages. In my test, it will also slow down compile time.

chenkovsky · 2024-12-24T02:05:58Z

I think we should move forward with this PR. I think Lindera seems like a good solution. However, I would like to see clear documentation on how to use. I want to make sure we are making it easy to use, especially downstream in LanceDB.

maybe we can add a python script to download open source language models.

python -m lance.tokenizers.download --dic lindera/ipadic

chenkovsky · 2024-12-24T02:28:51Z

I integrated it with the second one. because, default language model is very old, (by the way, Jieba's default model is also very old), we have to update language model frequently

Where can users find a good up-to-date language model? I suspect providing easy instructions to get this would be very valuable for users.

LM is confidential for many companies.
It's one of the most important competitiveness for traditional NLP companies.
So It's hard to obtain up-to-date LM on the market.

chenkovsky · 2024-12-25T15:12:57Z

python/python/lance/lm.py

@wjones127 I add a script to download language models

I also added jieba. next step I will build tiny language models, and add them into unittest.

chenkovsky · 2025-01-03T05:39:36Z

I think we should move forward with this PR. I think Lindera seems like a good solution. However, I would like to see clear documentation on how to use. I want to make sure we are making it easy to use, especially downstream in LanceDB.

@wjones127 could you please review it again?

wjones127

Nice work! Looks good.

chenkovsky added 3 commits December 8, 2024 13:54

feat: support japanese & korea tokenizer for fts

474c73b

lindera tmp

f36311c

update tokenizer

0c2f590

github-actions bot added enhancement New feature or request python labels Dec 9, 2024

chenkovsky mentioned this pull request Dec 9, 2024

feat: support jieba tokenizer which is a popular chinese tokenizer #3205

Closed

broccoliSpicy requested a review from BubbleCal December 9, 2024 16:51

wjones127 reviewed Dec 9, 2024

View reviewed changes

chenkovsky added 3 commits December 10, 2024 20:26

lindera support

2428567

format

f1b9146

update deps

b8c778e

wjones127 reviewed Dec 23, 2024

View reviewed changes

lm download script

83cd1dd

chenkovsky commented Dec 25, 2024

View reviewed changes

chenkovsky added 9 commits December 25, 2024 23:13

update

7358250

jieba

eb8e568

Merge branch 'main' into feature/lindera

e3aa53d

update type

7962d9e

modulize thirdpart tokenizer

2aa4886

format and dict

5313fe7

update tokenizer

f7fcb47

add document

eab0c16

update test

d19c957

format

d1d98c3

chenkovsky marked this pull request as ready for review December 28, 2024 08:38

chenkovsky added 5 commits December 28, 2024 16:40

format

987e883

format

e83a18a

format

466d26e

format

f01777c

format

a25db78

chenkovsky force-pushed the feature/lindera branch from 0d3d34c to a25db78 Compare December 28, 2024 10:02

chenkovsky added 2 commits December 28, 2024 18:15

format

eeebd16

Merge branch 'main' into feature/lindera

0eba3ce

wjones127 approved these changes Jan 7, 2025

View reviewed changes

wjones127 merged commit 5a18b14 into lancedb:main Jan 7, 2025
26 of 27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support lindera for japanese and korea tokenization #3218

feat: support lindera for japanese and korea tokenization #3218

chenkovsky commented Dec 9, 2024 •

edited

Loading

chenkovsky commented Dec 9, 2024

codecov-commenter commented Dec 9, 2024 •

edited

Loading

wjones127 left a comment

wjones127 Dec 9, 2024

chenkovsky Dec 10, 2024 •

edited

Loading

wjones127 Dec 9, 2024

chenkovsky Dec 10, 2024

wjones127 Dec 23, 2024

wjones127 Dec 9, 2024

chenkovsky Dec 10, 2024

wjones127 Dec 9, 2024

chenkovsky Dec 10, 2024 •

edited

Loading

chenkovsky Dec 10, 2024

wjones127 Dec 23, 2024

wjones127 commented Dec 23, 2024

wjones127 commented Dec 23, 2024

wjones127 left a comment

chenkovsky commented Dec 24, 2024 •

edited

Loading

chenkovsky commented Dec 24, 2024 •

edited

Loading

chenkovsky commented Dec 24, 2024

chenkovsky Dec 25, 2024

chenkovsky Dec 25, 2024

chenkovsky commented Jan 3, 2025

wjones127 left a comment

		pub const LANCE_HOME_ENV_KEY: &str = "LANCE_HOME";
		pub const LANCE_HOME_DEFAULT_DIRECTORY: &str = "lance";

feat: support lindera for japanese and korea tokenization #3218

feat: support lindera for japanese and korea tokenization #3218

Conversation

chenkovsky commented Dec 9, 2024 • edited Loading

chenkovsky commented Dec 9, 2024

codecov-commenter commented Dec 9, 2024 • edited Loading

Codecov Report

wjones127 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenkovsky Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenkovsky Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjones127 commented Dec 23, 2024

wjones127 commented Dec 23, 2024

wjones127 left a comment

Choose a reason for hiding this comment

chenkovsky commented Dec 24, 2024 • edited Loading

chenkovsky commented Dec 24, 2024 • edited Loading

chenkovsky commented Dec 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenkovsky commented Jan 3, 2025

wjones127 left a comment

Choose a reason for hiding this comment

chenkovsky commented Dec 9, 2024 •

edited

Loading

codecov-commenter commented Dec 9, 2024 •

edited

Loading

chenkovsky Dec 10, 2024 •

edited

Loading

chenkovsky Dec 10, 2024 •

edited

Loading

chenkovsky commented Dec 24, 2024 •

edited

Loading

chenkovsky commented Dec 24, 2024 •

edited

Loading