Disable tokenization #422
Replies: 6 comments 18 replies
-
Hello @trailerparkdev, thanks a lot for sharing this! @gmourier will come back to you if he needs more information :) |
Beta Was this translation helpful? Give feedback.
-
Hi @trailerparkdev 👋 I'm super curious to know why you don't let Meilisearch do the work about the tokenization instead of it doing it on your side! Thanks |
Beta Was this translation helpful? Give feedback.
-
Hi @gmourier. I'm not really at liberty to discuss that, but i guess you can imagine that depending on the use case, some data need heavy pre-processing... ...and that some of this pre-processing may include use-case specific tokenization... ...which in turn may include stemming / lemmatizing / POS tagging and other kinds of NLP preparation, especially on multi-lingual datasets. Another angle to plead for disabling tokenization is that some use cases require complete control on tokenization (ie on its results). Which calls for in-house tokenization. And since this process can be complex, and because classic tools of the trade like spaCy or nltk are known to yield imperfect results, even on specifically trained models, the only way you can commit on results for a customer usecase is to be able to master the process end to end. In such situations, MS' tokenization on top of this work may somehow break it. Again, I perfectly understand the rationale behind your tokenization process, in that it allows many of MS' great features like typo tolerance, etc. It all makes perfect sense and gives developers a great experience if the usecase matches. If it diverges, that's where we need options to disable auto features like tokenization, in order to be able to take charge of it ourselves, and still benefit of the simplicity and efficiency of the rest of MS. Have a nice day. |
Beta Was this translation helpful? Give feedback.
-
I subscribe to this need. In my case, I keep language-specific indexes. This would allow me to implement tokenization of whatever language without needing to modify Meilisearch itself. Also, the built-in tokenizations in Meilisearch may not be appropriate for some cases. For instance, for Chinese, Jieba can be configured with custom dictionaries. I rely on this in the rest of my system, but Meilisearch Jieba-based tokenization uses the default dictionaries. |
Beta Was this translation helpful? Give feedback.
-
Hi, The reason is because i have several fields (like vehicle license plate, article numbers, customer numbers etc..), which should not be tokenized. My detailed use case can be found in this thread: meilisearch/meilisearch#3380 |
Beta Was this translation helpful? Give feedback.
-
Hi Guillaume ,
exactly. For the typo correction i disabled it already on the fields like "article-nr", "license-plate" etc...
Having the same option to disable tokenize on attributes would be my awesome wish :-)
Gesendet: Dienstag, 31. Januar 2023 um 12:21 Uhr
Von: "Guillaume Mourier" ***@***.***>
An: "meilisearch/product" ***@***.***>
Cc: "tobiasnitsche" ***@***.***>, "Mention" ***@***.***>
Betreff: Re: [meilisearch/product] Disable tokenization (Discussion #422)
Thanks! FYI, you can already disable the typo correction for the field containing the article number. You want to use disableOnAttributes. https://blog.meilisearch.com/typo-tolerance/
So you prefer disabling the whole tokenization of a field instead of keeping it in a sparse way by customizing the separators for a given field?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Hi @curquiza.
This is to follow suit on your request to open a discussion about our use case.
In a nutshell, we index strings that are first tokenized (stemming, stopwords filtering, custom-processing and customer-based enhancements). The text (out of which the strings are extracted) come from repositories that can be multilingual. We "tokenize" accordingly.
These tokenized strings are the only searchable attributes in our documents. They are ingested in a dedicated index which is our main index among several (the others are ancillary).
Of course, when we implement a search, we tokenize the query text the very same way. That's how matching is done.
Based on what you told me, and from tests I ran, I don't see that the mandatory tokenization on your side is a problem.
My question about disabling tokenization on your side was pure engineer concern about resources, since in this case it may not be of much help (unless I'm wrong).
Do you need anything more?
Have a nice day.
EDIT by @curquiza
Following: meilisearch/meilisearch#2257
Beta Was this translation helpful? Give feedback.
All reactions