Similarity module is broken #15345

keikha · 2015-12-09T19:06:44Z

There are multiple similarity measures available, but apparently only the default one and BM25 work. I'm trying to use LMDirichlet similarity, creating a simple mapping such as:

"mappings": {
"item1": {
"properties": {
"title1": {
"type": "string" } } },
"item2": {
"properties": {
"title2": {
"similarity": "BM25",
"type": "string" } } },
"item3": {
"properties": {
"title3": {
"type": "string",
"similarity":"LMDirichlet"}  } }
}

ES ignores the LMDirichlet similarity and just uses the default one. Other similarity modules such as DFR, LMJelinekMercer also have same problem.

The text was updated successfully, but these errors were encountered:

clintongormley · 2015-12-10T12:36:02Z

You need to configure a custom similarity for all but the Default and BM25 similarities. You can see how to do so here: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html#configuration

clintongormley · 2015-12-10T13:05:41Z

Reopening: we should complain if using a similarity that requires configuration, instead of just silently accepting it.

keikha · 2015-12-10T15:18:29Z

I agree that a complain would be helpful.

But beside that, LMDirichlet has parameter that has a default value of 2000. It gives the impression that if I don't configure anything, it should use the default value. I don't see why I need configuration if I want to use the default value.

keikha · 2015-12-10T16:33:23Z

Adding the configuration to the index settings helped. But now the scores that I get back are all zero for the LM similarity. Here is the steps to reproduce the problem:

Created the index:

{ "settings": { "similarity": { "LMSimilarity": { "type": "LMDirichlet", "mu": 2500 } } }, "mappings": { "item": { "properties": { "title": { "type": "string", "similarity": "LMSimilarity" } } } } }

Indexed two documents:

{"title":"This is a test for search similarity when we search by other search options."}
{"title”:”Search looks weird when use other search possibilities. Numbers are not clear. Just adding new stuff to make the document longer. Document norm looks weird."}

Run a simple query:

{ "explain": "true", "query": { "match": { "title": "search" } } }

If you look at the returned scores, there are multiple weird numbers:

The score for all documents is zero
The collection probability, a term property that is independent from individual documents, is different for each document. I expect this number to be the same for all documents for a given term.
Document norm has a negative value, probably it's the log of another number, but I can't match these numbers to the LM formula.

keikha · 2015-12-11T16:25:01Z

Since this issue was closed I'll open a new issue with the latest problem.

clintongormley · 2015-12-14T18:19:51Z

Sorry, meant to reopen this.

keikha · 2016-01-19T18:06:19Z

@clintongormley Did you get a chance to look at the LM scoring problem. I'm wondering if you have any suggestion about it or if there is a workaround.

tlmnb · 2016-01-19T22:46:11Z

@keikha
I've looked a bit around and I don't think that this a bug related to elasticsearch.
The computation is done by Apache Lucene's LMDirichletSimilarity. If you'll have a look inside the code, you'll see that the computation of the score is done in the score(BasicStats stats, float freq, float docLen)-method:

protected float score(BasicStats stats, float freq, float docLen) {
    float score = stats.getBoost() * (float)(Math.log(1 + freq /
        (mu * ((LMStats)stats).getCollectionProbability())) +
        Math.log(mu / (docLen + mu)));
    return score > 0.0f ? score : 0.0f;
  }

Because the stats.getBoost() is in your case always 1.0f, the computation of the score consists of the sum of term weight (the first log-expression) and the document norm (the second log-expression).
Since the latter is negative and the term weight-value not big enough, the score gets negative (or zero).
For one document of your data docLen was 28.44 and freq was 2.0.

In short, I think, that the data you've used for testing is too small to get good statistics calculation or you have to adjust mu.

clintongormley · 2016-01-20T14:52:08Z

@tlmnw many thanks for diving into this and providing the answer. We should still throw an exception when trying to use a similarity that requires configuration without providing said config.

keikha · 2016-01-20T17:17:59Z

@tlmnw @clintongormley Thank you for following it up.
I had another look and played with smaller mu values. As @tlmnw mentioned they way that the score is calculated it assigns zero to any document that has the term with lower probability than the collection.

I still have concerns about it although I'm not sure if I should mention them here since they are more related to Lucene. I don't know if ES team care about underneath Lucene problems.

When I look at the explained score, there are different values for collection probability. I expect this to be independent of documents. This cause the term weight to have totally weird values.
I think it is really bad to not distinguish between two documents even though they don't have enough very high term frequencies. In my case one document is clearly more relevant than the other one, but both of them get score zero.

adrianocrestani · 2016-01-20T17:26:37Z

@keikha Those issues should probably be raised at [email protected] mailing list

tlmnb · 2016-01-20T21:28:10Z

@clintongormley
I've probably misunderstood something, but why should we throw an error if no configuration was given, since all SimilarityProviders have default values? Or should we raise a warning?

clintongormley · 2016-01-21T14:06:31Z

@tlmnw no it may be me who has misunderstood. I thought that all similarities except default (now classic) and bm25 required config, but i may well be wrong?

keikha · 2016-01-21T15:06:26Z

@tlmnw @clintongormley That was the reason I opened this ticket in this first place.
If no configuration is provided, ES ignores the similarity module and uses the default one ( Even if you want to use the default parameters). As @clintongormley mentioned, this is not the case for BM25.

tlmnb · 2016-01-21T19:34:36Z

Okay, got it. I'll try to fix this.

tlmnb · 2016-01-21T20:22:03Z

@keikha
Which version of elasticsearch do you use?

keikha · 2016-01-21T20:43:32Z

@tlmnw I tried it with different versions including 1.7 and 2.1.

clintongormley · 2016-01-22T11:34:15Z

It's true. This can be seen by doing the following:

PUT t
{
  "mappings": {
    "item1": {
      "properties": {
        "title1": {
          "type": "string"
        }
      }
    },
    "item2": {
      "properties": {
        "title2": {
          "similarity": "BM25",
          "type": "string"
        }
      }
    },
    "item3": {
      "properties": {
        "title3": {
          "type": "string",
          "similarity": "LMDirichlet"
        }
      }
    }
  }
}

GET _mapping/field/title*?include_defaults

bryanhanner · 2017-12-22T08:59:51Z

I’m seeing that this issue is a bit stale. Is this still a problem?

jpountz · 2018-03-13T13:57:23Z

cc @elastic/es-search-aggs

jpountz · 2018-03-13T14:07:43Z

Actually closing it: Elasticsearch now fails on unknows similarities. Regarding scores that are equal to 0 in some cases, this is indeed a limitation of similarities that we should document: #29015

clintongormley closed this as completed Dec 10, 2015

clintongormley added >enhancement good first issue low hanging fruit help wanted adoptme :Search Foundations/Mapping Index mappings, including merging and defining field types labels Dec 10, 2015

clintongormley reopened this Dec 14, 2015

clintongormley mentioned this issue Feb 14, 2016

LM Similarity returns all zero scores #15397

Closed

jpountz closed this as completed Mar 13, 2018

javanna added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Similarity module is broken #15345

Similarity module is broken #15345

keikha commented Dec 9, 2015

clintongormley commented Dec 10, 2015

clintongormley commented Dec 10, 2015

keikha commented Dec 10, 2015

keikha commented Dec 10, 2015

keikha commented Dec 11, 2015

clintongormley commented Dec 14, 2015

keikha commented Jan 19, 2016

tlmnb commented Jan 19, 2016

clintongormley commented Jan 20, 2016

keikha commented Jan 20, 2016

adrianocrestani commented Jan 20, 2016

tlmnb commented Jan 20, 2016

clintongormley commented Jan 21, 2016

keikha commented Jan 21, 2016

tlmnb commented Jan 21, 2016

tlmnb commented Jan 21, 2016

keikha commented Jan 21, 2016

clintongormley commented Jan 22, 2016

bryanhanner commented Dec 22, 2017

jpountz commented Mar 13, 2018

jpountz commented Mar 13, 2018

Similarity module is broken #15345

Similarity module is broken #15345

Comments

keikha commented Dec 9, 2015

clintongormley commented Dec 10, 2015

clintongormley commented Dec 10, 2015

keikha commented Dec 10, 2015

keikha commented Dec 10, 2015

keikha commented Dec 11, 2015

clintongormley commented Dec 14, 2015

keikha commented Jan 19, 2016

tlmnb commented Jan 19, 2016

clintongormley commented Jan 20, 2016

keikha commented Jan 20, 2016

adrianocrestani commented Jan 20, 2016

tlmnb commented Jan 20, 2016

clintongormley commented Jan 21, 2016

keikha commented Jan 21, 2016

tlmnb commented Jan 21, 2016

tlmnb commented Jan 21, 2016

keikha commented Jan 21, 2016

clintongormley commented Jan 22, 2016

bryanhanner commented Dec 22, 2017

jpountz commented Mar 13, 2018

jpountz commented Mar 13, 2018