Disable tokenization #422

trailerparkdev · 2022-03-24T17:14:16Z

trailerparkdev
Mar 24, 2022

This is to follow suit on your request to open a discussion about our use case.

In a nutshell, we index strings that are first tokenized (stemming, stopwords filtering, custom-processing and customer-based enhancements). The text (out of which the strings are extracted) come from repositories that can be multilingual. We "tokenize" accordingly.

These tokenized strings are the only searchable attributes in our documents. They are ingested in a dedicated index which is our main index among several (the others are ancillary).

Of course, when we implement a search, we tokenize the query text the very same way. That's how matching is done.

Based on what you told me, and from tests I ran, I don't see that the mandatory tokenization on your side is a problem.

My question about disabling tokenization on your side was pure engineer concern about resources, since in this case it may not be of much help (unless I'm wrong).

Do you need anything more?

Have a nice day.

EDIT by @curquiza
Following: meilisearch/meilisearch#2257

curquiza · 2022-03-28T08:55:03Z

curquiza
Mar 28, 2022
Maintainer

Hello @trailerparkdev, thanks a lot for sharing this!

@gmourier will come back to you if he needs more information :)

0 replies

gmourier · 2022-06-02T21:37:31Z

gmourier
Jun 2, 2022
Maintainer

Hi @trailerparkdev 👋

I'm super curious to know why you don't let Meilisearch do the work about the tokenization instead of it doing it on your side!

Thanks

0 replies

trailerparkdev · 2022-06-07T20:31:16Z

trailerparkdev
Jun 7, 2022
Author

Hi @gmourier.

I'm not really at liberty to discuss that, but i guess you can imagine that depending on the use case, some data need heavy pre-processing...

...and that some of this pre-processing may include use-case specific tokenization...

...which in turn may include stemming / lemmatizing / POS tagging and other kinds of NLP preparation, especially on multi-lingual datasets.

Another angle to plead for disabling tokenization is that some use cases require complete control on tokenization (ie on its results). Which calls for in-house tokenization. And since this process can be complex, and because classic tools of the trade like spaCy or nltk are known to yield imperfect results, even on specifically trained models, the only way you can commit on results for a customer usecase is to be able to master the process end to end. In such situations, MS' tokenization on top of this work may somehow break it.

Again, I perfectly understand the rationale behind your tokenization process, in that it allows many of MS' great features like typo tolerance, etc. It all makes perfect sense and gives developers a great experience if the usecase matches. If it diverges, that's where we need options to disable auto features like tokenization, in order to be able to take charge of it ourselves, and still benefit of the simplicity and efficiency of the rest of MS.

Have a nice day.

1 reply

gmourier Jun 14, 2022
Maintainer

Thanks @trailerparkdev !

noe · 2022-07-28T00:11:54Z

noe
Jul 28, 2022

I subscribe to this need. In my case, I keep language-specific indexes. This would allow me to implement tokenization of whatever language without needing to modify Meilisearch itself.

Also, the built-in tokenizations in Meilisearch may not be appropriate for some cases. For instance, for Chinese, Jieba can be configured with custom dictionaries. I rely on this in the rest of my system, but Meilisearch Jieba-based tokenization uses the default dictionaries.

1 reply

curquiza Aug 23, 2022
Maintainer

Hello @noe
We have opened a dedicated discussion on Chinese tokenization. Indeed, we have issues with the Chinese language due to Jieba, but we don't know what could be the best alternative since no one is Chinese on our team. Feel free to give your opinion on it, it would be really helpful 🙏

tobiasnitsche · 2023-01-23T11:07:25Z

tobiasnitsche
Jan 23, 2023

Hi,
i also have a problem, where i would like to disable tokenization for a specific field.

The reason is because i have several fields (like vehicle license plate, article numbers, customer numbers etc..), which should not be tokenized.

My detailed use case can be found in this thread: meilisearch/meilisearch#3380

16 replies

tobiasnitsche Mar 7, 2024

Hi @ManyTheFish and @macraig

The setting you have proposed would perfectly solve my Problem!

To my example, i will try my best to give you a better example.

But maybe try this:

Articles Nr

ABC-50-744-YX
YX-50-ABC

As article numbers are sensitive

ABC-50 should only return the First

(But the table might have another column "Description" which should work with tokenization.

Many of users have huge article databases, but know the article numbers of their most used articles

tobiasnitsche Mar 7, 2024

I will give you details of our Version hopefully by tomorrow.

tobiasnitsche Mar 7, 2024

Hi @curquiza

Here is an example.

My targetfor this attribute is to have only articles having number "DEF-" at the beginning. For the description field and title field it should work as normal with tokenization. Hope this helps / underlines my problem ;-)

ManyTheFish Mar 11, 2024
Collaborator

Hello @tobiasnitsche,
Is it possible for you to set your Articles Nr as filterable attributes, use the facet search to retrieve the value and apply the filter?
If it's not possible, is the "1 search for all fields" the only blocker, or is there something else?

tobiasnitsche Apr 8, 2024

Thank you for your reply @ManyTheFish and sorry for being that late.

Thats actually the only blocker yes, I would prefer and "1 search for all field" solution.

We have never tried face search yet. Will look into that, as soon as we find time.

tobiasnitsche · 2023-01-31T15:06:25Z

tobiasnitsche
Jan 31, 2023

Hi Guillaume , exactly. For the typo correction i disabled it already on the fields like "article-nr", "license-plate" etc... Having the same option to disable tokenize on attributes would be my awesome wish :-) Gesendet: Dienstag, 31. Januar 2023 um 12:21 Uhr Von: "Guillaume Mourier" ***@***.***> An: "meilisearch/product" ***@***.***> Cc: "tobiasnitsche" ***@***.***>, "Mention" ***@***.***> Betreff: Re: [meilisearch/product] Disable tokenization (Discussion #422) Thanks! FYI, you can already disable the typo correction for the field containing the article number. You want to use disableOnAttributes. https://blog.meilisearch.com/typo-tolerance/ So you prefer disabling the whole tokenization of a field instead of keeping it in a sparse way by customizing the separators for a given field? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meilisearch

Disable tokenization #422

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 18 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Disable tokenization #422

Replies: 6 comments · 18 replies

curquiza Mar 28, 2022 Maintainer

gmourier Jun 2, 2022 Maintainer

trailerparkdev Jun 7, 2022 Author

gmourier Jun 14, 2022 Maintainer

curquiza Aug 23, 2022 Maintainer

ManyTheFish Mar 11, 2024 Collaborator

Replies: 6 comments 18 replies

curquiza
Mar 28, 2022
Maintainer

gmourier
Jun 2, 2022
Maintainer

trailerparkdev
Jun 7, 2022
Author

gmourier Jun 14, 2022
Maintainer

curquiza Aug 23, 2022
Maintainer

ManyTheFish Mar 11, 2024
Collaborator