Skip to content

Commit

Permalink
Don't use stopwords in new ranking
Browse files Browse the repository at this point in the history
I don't think that stopwords are helping us currently; they force us to
add workarounds for some cases (eg, "form AN"), and standard weighting
measures should ensure that common words like stopwords aren't given
undue prominence.  If we find that stopwords are causing a problem with
ranking, we should change the weighting algorithm to one that has better
compensation for common words (such as BM25f).

In order not to change the existing ranking, this indexes
`searchable_text` fields additionally to a `.no_stop` sub-field.  The
`all_searchable_text` field isn't used by the existing ranking, so just
remove stopwording from that field's default analyzer.
  • Loading branch information
Richard Boulton committed May 28, 2015
1 parent f55006f commit 0fe6e52
Show file tree
Hide file tree
Showing 3 changed files with 16 additions and 3 deletions.
6 changes: 6 additions & 0 deletions config/schema/elasticsearch_schema.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,12 @@ index:
filter: [standard, lowercase, stop, stemmer_override, stemmer_english]
char_filter: [normalize_quotes, strip_quotes]

searchable_text:
type: custom
tokenizer: standard
filter: [standard, lowercase, stemmer_override, stemmer_english]
char_filter: [normalize_quotes, strip_quotes]

# Analyzer used at index time for the .synonym variants of searchable
# text fields.
with_index_synonyms:
Expand Down
7 changes: 7 additions & 0 deletions config/schema/field_types.json
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,12 @@
"include_in_all": true,
"copy_to": ["spelling_text", "all_searchable_text"],
"fields": {
"no_stop": {
"type": "string",
"index": "analyzed",
"include_in_all": false,
"analyzer": "searchable_text"
},
"synonym": {
"type": "string",
"index": "analyzed",
Expand All @@ -55,6 +61,7 @@
"es_config": {
"type": "string",
"index": "analyzed",
"analyzer": "searchable_text",
"include_in_all": false,
"fields": {
"synonym": {
Expand Down
6 changes: 3 additions & 3 deletions lib/query_components/text_query.rb
Original file line number Diff line number Diff line change
Expand Up @@ -77,23 +77,23 @@ def field_boosts_words
# Return the highest weight found by looking for a word-based match in
# individual fields
MATCH_FIELDS.map { |field_name, boost|
match_query(field_name, search_term, boost: boost)
match_query("#{field_name}.no_stop", search_term, boost: boost)
}
end

def field_boosts_phrase
# Return the highest weight found by looking for a phrase match in
# individual fields
MATCH_FIELDS.map { |field_name, boost|
match_query(field_name, search_term, type: :phrase, boost: boost)
match_query("#{field_name}.no_stop", search_term, type: :phrase, boost: boost)
}
end

def field_boosts_all_terms
# Return the highest weight found by looking for a match of all terms
# individual fields
MATCH_FIELDS.map { |field_name, boost|
match_query(field_name, search_term, type: :boolean, operator: :and, boost: boost)
match_query("#{field_name}.no_stop", search_term, type: :boolean, operator: :and, boost: boost)
}
end

Expand Down

0 comments on commit 0fe6e52

Please sign in to comment.