Expose duplicate removal in the completion suggester #26496

jimczi · 2017-09-04T15:19:53Z

This change exposes the duplicate removal option added in Lucene for the completion suggester
with a new option called skip_duplicates (defaults to false).
This commit also adapts the custom suggest collector to handle deduplication when multiple contexts match the input.

Closes #23364

This change exposes the duplicate removal option added in Lucene for the completion suggester with a new option called `skip_duplicates` (defaults to false). This commit also adapts the custom suggest collector to handle deduplication when multiple contexts match the input. Closes elastic#23364

cbuescher

@jimczi I took a look out of interest how this feature is implemented in Lucene. My knowledge of how the documents collector works is limited, but the rest looks good to me. Left a few nits and questions, those mostly because I'm not too familiar with the underlying feature.

cbuescher · 2017-09-05T09:20:03Z

core/src/main/java/org/elasticsearch/search/suggest/completion/CompletionSuggestion.java

    public CompletionSuggestion() {
    }

-    public CompletionSuggestion(String name, int size) {
+    public CompletionSuggestion(String name, int size, boolean skipDuplicates) {


I always get confused, is this a user-facing API? In that case, should we keep the original constructor with "false" as the default? If not all good.

cbuescher · 2017-09-05T09:26:32Z

core/src/main/java/org/elasticsearch/search/suggest/completion/CompletionSuggestionBuilder.java

+        return skipDuplicates;
+    }
+
+    public CompletionSuggestionBuilder skipDuplicates(boolean skipDuplicates) {


nit: maybe add some javadoc since this is what most users see that use the java api

cbuescher · 2017-09-05T09:26:41Z

core/src/main/java/org/elasticsearch/search/suggest/completion/CompletionSuggestionBuilder.java

@@ -210,6 +223,15 @@ private CompletionSuggestionBuilder contexts(XContentBuilder contextBuilder) {
        return this;
    }

+    public boolean skipDuplicates() {


nit: maybe add some javadoc since this is what most users see that use the java api

cbuescher · 2017-09-05T09:48:52Z

core/src/main/java/org/elasticsearch/search/suggest/completion/CompletionSuggester.java

+     * This collector is also able to filter duplicate suggestion coming from different documents.
+     * When different contexts match the same suggestion form only the best one (sorted by weight) is kept.
+     * In order to keep this feature fast, the de-duplication of suggestions with different contexts is done
+     * only on the top N*num_contexts suggestions per segment. This means that skip_duplicates will visit at most N*num_contexts suggestions


question for my own understanding: what is N here?

The completion suggester is document based so N here is the number of documents to return.

cbuescher · 2017-09-05T09:51:07Z

core/src/main/java/org/elasticsearch/search/suggest/completion/CompletionSuggester.java

@@ -150,93 +164,57 @@ void add(CharSequence key, CharSequence context, float score) {
            }
        }

-        private static final class SuggestDocPriorityQueue extends PriorityQueue<SuggestDoc> {


For my own understanding, can you briefly explain what the queue did here before and why this can now be done in TopSuggestDocsCollector (at least thats what it looks to me)

Prior to this change the custom collector was used to group the suggestions coming from different contexts per document.
The TopSuggestDocsCollector is not document based and can return suggestions coming from different documents so this custom collector was needed. What changed is that we record the document in a map during the collection and if a document was already visited we just hide it from the parent collector. When collection is finished we get the top N suggestions from the parent collector and re-group them per document. Because we ensured that a document can only be visited once, the N suggestions returned by the parent collector are all coming from different docs and we just need to link the skipped suggestions to the final top N documents.
Early termination is done the same except that the decision is taken by the parent collector.

…hes the input with different suggestions within the same context

jimczi · 2017-09-05T12:02:19Z

Thanks @cbuescher
I pushed some changes to address your comments and questions

cbuescher

Thanks, LGTM. Unless you want somebody with deeper Lucene knowledge to take another look at the collector changes, that is.

jimczi · 2017-09-05T13:37:31Z

Thanks @cbuescher !
I think it needs another pair of eyes, it's not a big change but the completion suggester is touchy.

jpountz

LGTM!

… tests. The completion suggester has a `shard_size` option that sets the size of the suggestions to retrieve per shard but it is ignored by the builder. This commit restores the handling of this option and fixes a test that can randomly fail without it.

…GeoBoosting This test should not rely on strict ordering for same score suggestions. The Lucene completion suggester uses the doc id in case of a tie and documents are indexed randomly.

This change exposes the duplicate removal option added in Lucene for the completion suggester with a new option called `skip_duplicates` (defaults to false). This commit also adapts the custom suggest collector to handle deduplication when multiple contexts match the input. Closes #23364

…rflow * origin/master: (59 commits) Fix Lucene version of 5.6.1. Remove azure deprecated settings (elastic#26099) Handle the 5.6.0 release Allow plugins to validate cluster-state on join (elastic#26595) Remove index mapper dynamic settings (elastic#25734) update AWS SDK for ECS Task IAM support in discovery-ec2 (elastic#26479) Azure repository: Accelerate the listing of files (used in delete snapshot) (elastic#25710) Build: Remove norelease from forbidden patterns (elastic#26592) Fix reference to painless inside expression engine (elastic#26528) Build: Move javadoc linking to root build.gradle (elastic#26529) Test: Remove leftover static bwc test case (elastic#26584) Docs: Remove remaining references to file and native scripts (elastic#26580) Snapshot fallback should consider build.snapshot elastic#26496: Set the correct bwc version after backport to 6.x Fix the MapperFieldType.rangeQuery API. (elastic#26552) Deduplicate `_field_names`. (elastic#26550) [Docs] Update method setSource(byte[] source) (elastic#26561) [Docs] Fix typo in javadocs (elastic#26556) Allow multiple digits in Vagrant 2.x minor versions Support Vagrant 2.x ...

jimczi added :Search Relevance/Suggesters "Did you mean" and suggestions as you type >feature v6.1.0 v7.0.0 labels Sep 4, 2017

skip bwc rest tests for now

450804f

cbuescher reviewed Sep 5, 2017

View reviewed changes

jimczi added 4 commits September 5, 2017 13:37

after review

f625085

javadoc

beedb08

fix grouping per doc in the completion suggester when a document matc…

114e078

…hes the input with different suggestions within the same context

javadoc

a53e904

cbuescher approved these changes Sep 5, 2017

View reviewed changes

jpountz approved these changes Sep 7, 2017

View reviewed changes

jimczi merged commit d68d8c9 into elastic:master Sep 7, 2017

jimczi deleted the feature/completion_suggester_skip_duplicates branch September 7, 2017 15:11

jimczi added the backport pending label Sep 7, 2017

jpountz mentioned this pull request Sep 11, 2017

Autocomplete suggestion no longer removes duplicate entries as in ES 2.3 !! #22912

Closed

jimczi removed the backport pending label Sep 11, 2017

jimczi added a commit that referenced this pull request Sep 11, 2017

#26496: Set the correct bwc version after backport to 6.x

c62b019

jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose duplicate removal in the completion suggester #26496

Expose duplicate removal in the completion suggester #26496

jimczi commented Sep 4, 2017

cbuescher left a comment

cbuescher Sep 5, 2017

cbuescher Sep 5, 2017

cbuescher Sep 5, 2017

cbuescher Sep 5, 2017

jimczi Sep 5, 2017

cbuescher Sep 5, 2017

jimczi Sep 5, 2017

jimczi commented Sep 5, 2017

cbuescher left a comment

jimczi commented Sep 5, 2017

jpountz left a comment

Expose duplicate removal in the completion suggester #26496

Expose duplicate removal in the completion suggester #26496

Conversation

jimczi commented Sep 4, 2017

cbuescher left a comment

Choose a reason for hiding this comment

cbuescher Sep 5, 2017

Choose a reason for hiding this comment

cbuescher Sep 5, 2017

Choose a reason for hiding this comment

cbuescher Sep 5, 2017

Choose a reason for hiding this comment

cbuescher Sep 5, 2017

Choose a reason for hiding this comment

jimczi Sep 5, 2017

Choose a reason for hiding this comment

cbuescher Sep 5, 2017

Choose a reason for hiding this comment

jimczi Sep 5, 2017

Choose a reason for hiding this comment

jimczi commented Sep 5, 2017

cbuescher left a comment

Choose a reason for hiding this comment

jimczi commented Sep 5, 2017

jpountz left a comment

Choose a reason for hiding this comment