From 174fd84e76f5cc67f304f65a68456b982fb7ca40 Mon Sep 17 00:00:00 2001 From: Adrien Grand Date: Thu, 15 Mar 2018 14:55:41 +0100 Subject: [PATCH] Improve similarity docs. This adds links to the relevant Lucene javadocs and warnings regarding similarities that might return 0 as a score. Close #29015 --- .../index-modules/similarity.asciidoc | 53 +++++++++++++++---- 1 file changed, 42 insertions(+), 11 deletions(-) diff --git a/docs/reference/index-modules/similarity.asciidoc b/docs/reference/index-modules/similarity.asciidoc index 85ca9e0cea369..87a9f7dc4a27e 100644 --- a/docs/reference/index-modules/similarity.asciidoc +++ b/docs/reference/index-modules/similarity.asciidoc @@ -97,22 +97,38 @@ similarity has the following option: Type name: `classic` [float] -[[drf]] +[[dfr]] ==== DFR similarity Similarity that implements the -http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/DFRSimilarity.html[divergence +{lucene-core-javadoc}/org/apache/lucene/search/similarities/DFRSimilarity.html[divergence from randomness] framework. This similarity has the following options: [horizontal] `basic_model`:: - Possible values: `be`, `d`, `g`, `if`, `in`, `ine` and `p`. + Possible values: {lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelG.html[`be`], + {lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelD.html[`d`], + {lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelG.html[`g`], + {lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelIF.html[`if`], + {lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelIn.html[`in`], + {lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelIne.html[`ine`] and + {lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelP.html[`p`]. + +`be`, `d` and `p` should be avoided in practice as they might return scores that +are equal to 0 or infinite with terms that do not meet the expected random +distribution. `after_effect`:: - Possible values: `no`, `b` and `l`. + Possible values: {lucene-core-javadoc}/org/apache/lucene/search/similarities/AfterEffect.NoAfterEffect.html[`no`], + {lucene-core-javadoc}/org/apache/lucene/search/similarities/AfterEffectB.html[`b`] and + {lucene-core-javadoc}/org/apache/lucene/search/similarities/AfterEffectL.html[`l`]. `normalization`:: - Possible values: `no`, `h1`, `h2`, `h3` and `z`. + Possible values: {lucene-core-javadoc}/org/apache/lucene/search/similarities/Normalization.NoNormalization.html[`no`], + {lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationH1.html[`h1`], + {lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationH2.html[`h2`], + {lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationH1.html[`h3`] and + {lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationZ.html[`z`]. All options but the first option need a normalization value. @@ -127,7 +143,14 @@ model. This similarity has the following options: [horizontal] -`independence_measure`:: Possible values `standardized`, `saturated`, `chisquared`. +`independence_measure`:: Possible values + {lucene-core-javadoc}/org/apache/lucene/search/similarities/IndependenceStandardized.html[`standardized`], + {lucene-core-javadoc}/org/apache/lucene/search/similarities/IndependenceSaturated.html[`saturated`], + {lucene-core-javadoc}/org/apache/lucene/search/similarities/IndependenceChiSquared.html[`chisquared`. + +When using this similarity, it is highly recommended to remove stop words to get +good relevance. Also beware that terms whose frequency is less than the expected +frequency will get a score equal to 0. Type name: `DFI` @@ -135,15 +158,19 @@ Type name: `DFI` [[ib]] ==== IB similarity. -http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/IBSimilarity.html[Information +{lucene-core-javadoc}/org/apache/lucene/search/similarities/IBSimilarity.html[Information based model] . The algorithm is based on the concept that the information content in any symbolic 'distribution' sequence is primarily determined by the repetitive usage of its basic elements. For written texts this challenge would correspond to comparing the writing styles of different authors. This similarity has the following options: [horizontal] -`distribution`:: Possible values: `ll` and `spl`. -`lambda`:: Possible values: `df` and `ttf`. +`distribution`:: Possible values: + {lucene-core-javadoc}/org/apache/lucene/search/similarities/DistributionLL.html[`ll`] and + {lucene-core-javadoc}/org/apache/lucene/search/similarities/DistributionSPL.html[`spl`]. +`lambda`:: Possible values: + {lucene-core-javadoc}/org/apache/lucene/search/similarities/LambdaDF.html[`df`] and + {lucene-core-javadoc}/org/apache/lucene/search/similarities/LambdaTTF.html[`ttf`]. `normalization`:: Same as in `DFR` similarity. Type name: `IB` @@ -152,19 +179,23 @@ Type name: `IB` [[lm_dirichlet]] ==== LM Dirichlet similarity. -http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html[LM +{lucene-core-javadoc}/org/apache/lucene/search/similarities/LMDirichletSimilarity.html[LM Dirichlet similarity] . This similarity has the following options: [horizontal] `mu`:: Default to `2000`. +The scoring formula in the paper assigns negative scores to terms that have +fewer occurrences than predicted by the language model, which is illegal to +Lucene, so such terms get a score of 0. + Type name: `LMDirichlet` [float] [[lm_jelinek_mercer]] ==== LM Jelinek Mercer similarity. -http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html[LM +{lucene-core-javadoc}/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html[LM Jelinek Mercer similarity] . The algorithm attempts to capture important patterns in the text, while leaving out noise. This similarity has the following options: [horizontal]