-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Similarity module is broken #15345
Comments
You need to configure a custom similarity for all but the Default and BM25 similarities. You can see how to do so here: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html#configuration |
Reopening: we should complain if using a similarity that requires configuration, instead of just silently accepting it. |
I agree that a complain would be helpful. But beside that, LMDirichlet has parameter that has a default value of 2000. It gives the impression that if I don't configure anything, it should use the default value. I don't see why I need configuration if I want to use the default value. |
Adding the configuration to the index settings helped. But now the scores that I get back are all zero for the LM similarity. Here is the steps to reproduce the problem:
{ "settings": { "similarity": { "LMSimilarity": { "type": "LMDirichlet", "mu": 2500 } } }, "mappings": { "item": { "properties": { "title": { "type": "string", "similarity": "LMSimilarity" } } } } }
{"title":"This is a test for search similarity when we search by other search options."}
{ "explain": "true", "query": { "match": { "title": "search" } } } If you look at the returned scores, there are multiple weird numbers:
|
Since this issue was closed I'll open a new issue with the latest problem. |
Sorry, meant to reopen this. |
@clintongormley Did you get a chance to look at the LM scoring problem. I'm wondering if you have any suggestion about it or if there is a workaround. |
@keikha protected float score(BasicStats stats, float freq, float docLen) {
float score = stats.getBoost() * (float)(Math.log(1 + freq /
(mu * ((LMStats)stats).getCollectionProbability())) +
Math.log(mu / (docLen + mu)));
return score > 0.0f ? score : 0.0f;
} Because the stats.getBoost() is in your case always 1.0f, the computation of the score consists of the sum of term weight (the first log-expression) and the document norm (the second log-expression). In short, I think, that the data you've used for testing is too small to get good statistics calculation or you have to adjust mu. |
@tlmnw many thanks for diving into this and providing the answer. We should still throw an exception when trying to use a similarity that requires configuration without providing said config. |
@tlmnw @clintongormley Thank you for following it up. I still have concerns about it although I'm not sure if I should mention them here since they are more related to Lucene. I don't know if ES team care about underneath Lucene problems.
|
@keikha Those issues should probably be raised at [email protected] mailing list |
@clintongormley |
@tlmnw no it may be me who has misunderstood. I thought that all similarities except default (now classic) and bm25 required config, but i may well be wrong? |
@tlmnw @clintongormley That was the reason I opened this ticket in this first place. |
Okay, got it. I'll try to fix this. |
@keikha |
@tlmnw I tried it with different versions including 1.7 and 2.1. |
It's true. This can be seen by doing the following:
|
I’m seeing that this issue is a bit stale. Is this still a problem? |
cc @elastic/es-search-aggs |
Actually closing it: Elasticsearch now fails on unknows similarities. Regarding scores that are equal to 0 in some cases, this is indeed a limitation of similarities that we should document: #29015 |
There are multiple similarity measures available, but apparently only the default one and BM25 work. I'm trying to use LMDirichlet similarity, creating a simple mapping such as:
ES ignores the LMDirichlet similarity and just uses the default one. Other similarity modules such as DFR, LMJelinekMercer also have same problem.
The text was updated successfully, but these errors were encountered: