Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve similarity docs. #29089

Merged
merged 2 commits into from
Mar 21, 2018
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 42 additions & 11 deletions docs/reference/index-modules/similarity.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -97,22 +97,38 @@ similarity has the following option:
Type name: `classic`

[float]
[[drf]]
[[dfr]]
==== DFR similarity

Similarity that implements the
http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/DFRSimilarity.html[divergence
{lucene-core-javadoc}/org/apache/lucene/search/similarities/DFRSimilarity.html[divergence
from randomness] framework. This similarity has the following options:

[horizontal]
`basic_model`::
Possible values: `be`, `d`, `g`, `if`, `in`, `ine` and `p`.
Possible values: {lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelG.html[`be`],
{lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelD.html[`d`],
{lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelG.html[`g`],
{lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelIF.html[`if`],
{lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelIn.html[`in`],
{lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelIne.html[`ine`] and
{lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelP.html[`p`].

`be`, `d` and `p` should be avoided in practice as they might return scores that
are equal to 0 or infinite with terms that do not meet the expected random
distribution.

`after_effect`::
Possible values: `no`, `b` and `l`.
Possible values: {lucene-core-javadoc}/org/apache/lucene/search/similarities/AfterEffect.NoAfterEffect.html[`no`],
{lucene-core-javadoc}/org/apache/lucene/search/similarities/AfterEffectB.html[`b`] and
{lucene-core-javadoc}/org/apache/lucene/search/similarities/AfterEffectL.html[`l`].

`normalization`::
Possible values: `no`, `h1`, `h2`, `h3` and `z`.
Possible values: {lucene-core-javadoc}/org/apache/lucene/search/similarities/Normalization.NoNormalization.html[`no`],
{lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationH1.html[`h1`],
{lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationH2.html[`h2`],
{lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationH1.html[`h3`] and
{lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationZ.html[`z`].

All options but the first option need a normalization value.

Expand All @@ -127,23 +143,34 @@ model.
This similarity has the following options:

[horizontal]
`independence_measure`:: Possible values `standardized`, `saturated`, `chisquared`.
`independence_measure`:: Possible values
{lucene-core-javadoc}/org/apache/lucene/search/similarities/IndependenceStandardized.html[`standardized`],
{lucene-core-javadoc}/org/apache/lucene/search/similarities/IndependenceSaturated.html[`saturated`],
{lucene-core-javadoc}/org/apache/lucene/search/similarities/IndependenceChiSquared.html[`chisquared`].

When using this similarity, it is highly recommended to remove stop words to get
good relevance. Also beware that terms whose frequency is less than the expected
frequency will get a score equal to 0.

Type name: `DFI`

[float]
[[ib]]
==== IB similarity.

http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/IBSimilarity.html[Information
{lucene-core-javadoc}/org/apache/lucene/search/similarities/IBSimilarity.html[Information
based model] . The algorithm is based on the concept that the information content in any symbolic 'distribution'
sequence is primarily determined by the repetitive usage of its basic elements.
For written texts this challenge would correspond to comparing the writing styles of different authors.
This similarity has the following options:

[horizontal]
`distribution`:: Possible values: `ll` and `spl`.
`lambda`:: Possible values: `df` and `ttf`.
`distribution`:: Possible values:
{lucene-core-javadoc}/org/apache/lucene/search/similarities/DistributionLL.html[`ll`] and
{lucene-core-javadoc}/org/apache/lucene/search/similarities/DistributionSPL.html[`spl`].
`lambda`:: Possible values:
{lucene-core-javadoc}/org/apache/lucene/search/similarities/LambdaDF.html[`df`] and
{lucene-core-javadoc}/org/apache/lucene/search/similarities/LambdaTTF.html[`ttf`].
`normalization`:: Same as in `DFR` similarity.

Type name: `IB`
Expand All @@ -152,19 +179,23 @@ Type name: `IB`
[[lm_dirichlet]]
==== LM Dirichlet similarity.

http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html[LM
{lucene-core-javadoc}/org/apache/lucene/search/similarities/LMDirichletSimilarity.html[LM
Dirichlet similarity] . This similarity has the following options:

[horizontal]
`mu`:: Default to `2000`.

The scoring formula in the paper assigns negative scores to terms that have
fewer occurrences than predicted by the language model, which is illegal to
Lucene, so such terms get a score of 0.

Type name: `LMDirichlet`

[float]
[[lm_jelinek_mercer]]
==== LM Jelinek Mercer similarity.

http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html[LM
{lucene-core-javadoc}/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html[LM
Jelinek Mercer similarity] . The algorithm attempts to capture important patterns in the text, while leaving out noise. This similarity has the following options:

[horizontal]
Expand Down