Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Similarity module is broken #15345

Closed
keikha opened this issue Dec 9, 2015 · 21 comments
Closed

Similarity module is broken #15345

keikha opened this issue Dec 9, 2015 · 21 comments
Labels
>enhancement good first issue low hanging fruit help wanted adoptme :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch

Comments

@keikha
Copy link

keikha commented Dec 9, 2015

There are multiple similarity measures available, but apparently only the default one and BM25 work. I'm trying to use LMDirichlet similarity, creating a simple mapping such as:

"mappings": {
"item1": {
"properties": {
"title1": {
"type": "string" } } },
"item2": {
"properties": {
"title2": {
"similarity": "BM25",
"type": "string" } } },
"item3": {
"properties": {
"title3": {
"type": "string",
"similarity":"LMDirichlet"}  } }
}

ES ignores the LMDirichlet similarity and just uses the default one. Other similarity modules such as DFR, LMJelinekMercer also have same problem.

@clintongormley
Copy link
Contributor

You need to configure a custom similarity for all but the Default and BM25 similarities. You can see how to do so here: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html#configuration

@clintongormley clintongormley added >enhancement good first issue low hanging fruit help wanted adoptme :Search Foundations/Mapping Index mappings, including merging and defining field types labels Dec 10, 2015
@clintongormley
Copy link
Contributor

Reopening: we should complain if using a similarity that requires configuration, instead of just silently accepting it.

@keikha
Copy link
Author

keikha commented Dec 10, 2015

I agree that a complain would be helpful.

But beside that, LMDirichlet has parameter that has a default value of 2000. It gives the impression that if I don't configure anything, it should use the default value. I don't see why I need configuration if I want to use the default value.

@keikha
Copy link
Author

keikha commented Dec 10, 2015

Adding the configuration to the index settings helped. But now the scores that I get back are all zero for the LM similarity. Here is the steps to reproduce the problem:

  1. Created the index:

{ "settings": { "similarity": { "LMSimilarity": { "type": "LMDirichlet", "mu": 2500 } } }, "mappings": { "item": { "properties": { "title": { "type": "string", "similarity": "LMSimilarity" } } } } }

  1. Indexed two documents:

{"title":"This is a test for search similarity when we search by other search options."}
{"title”:”Search looks weird when use other search possibilities. Numbers are not clear. Just adding new stuff to make the document longer. Document norm looks weird."}

  1. Run a simple query:

{ "explain": "true", "query": { "match": { "title": "search" } } }

If you look at the returned scores, there are multiple weird numbers:

  1. The score for all documents is zero
  2. The collection probability, a term property that is independent from individual documents, is different for each document. I expect this number to be the same for all documents for a given term.
  3. Document norm has a negative value, probably it's the log of another number, but I can't match these numbers to the LM formula.

@keikha
Copy link
Author

keikha commented Dec 11, 2015

Since this issue was closed I'll open a new issue with the latest problem.

@clintongormley
Copy link
Contributor

Sorry, meant to reopen this.

@keikha
Copy link
Author

keikha commented Jan 19, 2016

@clintongormley Did you get a chance to look at the LM scoring problem. I'm wondering if you have any suggestion about it or if there is a workaround.

@tlmnb
Copy link

tlmnb commented Jan 19, 2016

@keikha
I've looked a bit around and I don't think that this a bug related to elasticsearch.
The computation is done by Apache Lucene's LMDirichletSimilarity. If you'll have a look inside the code, you'll see that the computation of the score is done in the score(BasicStats stats, float freq, float docLen)-method:

protected float score(BasicStats stats, float freq, float docLen) {
    float score = stats.getBoost() * (float)(Math.log(1 + freq /
        (mu * ((LMStats)stats).getCollectionProbability())) +
        Math.log(mu / (docLen + mu)));
    return score > 0.0f ? score : 0.0f;
  }

Because the stats.getBoost() is in your case always 1.0f, the computation of the score consists of the sum of term weight (the first log-expression) and the document norm (the second log-expression).
Since the latter is negative and the term weight-value not big enough, the score gets negative (or zero).
For one document of your data docLen was 28.44 and freq was 2.0.

In short, I think, that the data you've used for testing is too small to get good statistics calculation or you have to adjust mu.

@clintongormley
Copy link
Contributor

@tlmnw many thanks for diving into this and providing the answer. We should still throw an exception when trying to use a similarity that requires configuration without providing said config.

@keikha
Copy link
Author

keikha commented Jan 20, 2016

@tlmnw @clintongormley Thank you for following it up.
I had another look and played with smaller mu values. As @tlmnw mentioned they way that the score is calculated it assigns zero to any document that has the term with lower probability than the collection.

I still have concerns about it although I'm not sure if I should mention them here since they are more related to Lucene. I don't know if ES team care about underneath Lucene problems.

  1. When I look at the explained score, there are different values for collection probability. I expect this to be independent of documents. This cause the term weight to have totally weird values.
  2. I think it is really bad to not distinguish between two documents even though they don't have enough very high term frequencies. In my case one document is clearly more relevant than the other one, but both of them get score zero.

@adrianocrestani
Copy link

@keikha Those issues should probably be raised at [email protected] mailing list

@tlmnb
Copy link

tlmnb commented Jan 20, 2016

@clintongormley
I've probably misunderstood something, but why should we throw an error if no configuration was given, since all SimilarityProviders have default values? Or should we raise a warning?

@clintongormley
Copy link
Contributor

@tlmnw no it may be me who has misunderstood. I thought that all similarities except default (now classic) and bm25 required config, but i may well be wrong?

@keikha
Copy link
Author

keikha commented Jan 21, 2016

@tlmnw @clintongormley That was the reason I opened this ticket in this first place.
If no configuration is provided, ES ignores the similarity module and uses the default one ( Even if you want to use the default parameters). As @clintongormley mentioned, this is not the case for BM25.

@tlmnb
Copy link

tlmnb commented Jan 21, 2016

Okay, got it. I'll try to fix this.

@tlmnb
Copy link

tlmnb commented Jan 21, 2016

@keikha
Which version of elasticsearch do you use?

@keikha
Copy link
Author

keikha commented Jan 21, 2016

@tlmnw I tried it with different versions including 1.7 and 2.1.

@clintongormley
Copy link
Contributor

It's true. This can be seen by doing the following:

PUT t
{
  "mappings": {
    "item1": {
      "properties": {
        "title1": {
          "type": "string"
        }
      }
    },
    "item2": {
      "properties": {
        "title2": {
          "similarity": "BM25",
          "type": "string"
        }
      }
    },
    "item3": {
      "properties": {
        "title3": {
          "type": "string",
          "similarity": "LMDirichlet"
        }
      }
    }
  }
}

GET _mapping/field/title*?include_defaults

@bryanhanner
Copy link

I’m seeing that this issue is a bit stale. Is this still a problem?

@jpountz
Copy link
Contributor

jpountz commented Mar 13, 2018

cc @elastic/es-search-aggs

@jpountz
Copy link
Contributor

jpountz commented Mar 13, 2018

Actually closing it: Elasticsearch now fails on unknows similarities. Regarding scores that are equal to 0 in some cases, this is indeed a limitation of similarities that we should document: #29015

@jpountz jpountz closed this as completed Mar 13, 2018
@javanna javanna added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement good first issue low hanging fruit help wanted adoptme :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

7 participants