Expose proximity boosting #39385

mayya-sharipova · 2019-02-26T02:05:00Z

Expose DistanceFeatureQuery for geo, date_nanos an date types

elasticmachine · 2019-02-26T02:05:40Z

Pinging @elastic/es-search

Expose DistanceFeatureQuery for long, geo and date types Closes elastic#33382

cbuescher

@mayya-sharipova this looks like a really cool and useful feature. I left a couple of comments, most of them minor. I think the biggest suggestion I'd have would be trying to avoid to have untyped (Object) origin and pivot fields and instead create some very simple inner class for holding this information in either one of the allowed types (long, String, Date). This way we should be able to control which values the user can set and prevent accidental misuses early.
I was also wondering if it makes sense to add a few integration tests (something extending ESIntegTestCase) to check if the scoring works as advertised (e.g. boosts are applied correctly, pivot works as expected etc...) but then again I guess this is already covered in the underlying Lucene queries so maybe its not that useful, but wanted to bring it up as a suggestion.
Other than that great feature, looking forward to using it in 7.1.

cbuescher · 2019-02-27T10:51:25Z

docs/reference/query-dsl/distance-feature-query.asciidoc

+ways to modify the score, this query has the benefit of being able to
+efficiently skip non-competitive hits when
+<<search-uri-request,`track_total_hits`>> is not set to `true`. Speedups may be
+spectacular.


Nice, do we have experiments/numbers/blogs to back this up? No need to change if we haven't but I was wondering if we could add anything in case we have it.

@cbuescher I have copied this phrase from rank_feature query which is using the same optimizations, and we also have a blog post on this. This blog post has a link to Lucene benchmarks, but looks like adding this link to these benchmarks would be excessive here.

I'd maybe drop the last sentence with spectacular since it's going to be a bit less efficient than rank_feature due to the dynamic nature of the feature (it's computed on the fly).

cbuescher · 2019-02-27T10:52:03Z

docs/reference/query-dsl/distance-feature-query.asciidoc

+
+[horizontal]
+`field`::
+    Required parameter. Defines a name of the field on which to calculate


nit: s/a/the

cbuescher · 2019-02-27T10:52:48Z

docs/reference/query-dsl/distance-feature-query.asciidoc

+`field`::
+    Required parameter. Defines a name of the field on which to calculate
+    distances. Must be a field of type `long`, `date`, or `geo_point`,
+    and must be indexed and has <<doc-values, doc values>>.


maybe "must be indexed using <<doc-values, doc values>>" or something similar instead of another "and" clause.

Mayya refers to the fact that the field needs both indexed:true and doc_values:true. Maybe we could be more explicit by saying eg "[...] and must be indexed (index: true, which is the default) and have doc values (doc_values: true, which is the default too).

cbuescher · 2019-02-27T10:55:37Z

docs/reference/query-dsl/distance-feature-query.asciidoc

+--------------------------------------------------
+// CONSOLE
+
+We look for all chocolate items, but we also want chocolates


nit: Maybe in all three cases start with "We can look", or "We now want to look" to make it less repetitive?

cbuescher · 2019-02-27T10:56:29Z

docs/reference/query-dsl/special-queries.asciidoc

+<<query-dsl-distance-feature-query,`distance_feature` query>>::
+
+A query that computes scores based on the dynamically computed distances
+between the origin and documents' long numeric, geo or distance fields.


maybe "geo-point" instead of "geo" like above

should it mention dates?

cbuescher · 2019-02-27T13:06:57Z

server/src/main/java/org/elasticsearch/index/query/DistanceFeatureQueryBuilder.java

+    @Override
+    protected void doWriteTo(StreamOutput out) throws IOException {
+        out.writeString(field);
+        out.writeGenericValue(origin);


this could move to an inner class if we added them for Origin and Pivot.

cbuescher · 2019-02-27T13:10:03Z

server/src/main/java/org/elasticsearch/index/query/DistanceFeatureQueryBuilder.java

+            long pivotLong = (Long) pivot;
+            return LongPoint.newDistanceFeatureQuery(field, boost, originLong, pivotLong);
+        } else if (fieldType instanceof GeoPointFieldType) {
+            GeoPoint originGeoPoint = (origin instanceof GeoPoint)? (GeoPoint) origin : GeoUtils.parseFromString((String) origin);


nit: space between ")?"

cbuescher · 2019-02-27T13:10:22Z

server/src/main/java/org/elasticsearch/index/query/DistanceFeatureQueryBuilder.java

+        if (fieldType instanceof DateFieldType) {
+            long originLong = (origin instanceof Long) ? (Long) origin :
+                ((DateFieldType) fieldType).parseToLong(origin, true, null, null, context);
+            TimeValue val = TimeValue.parseTimeValue((String)pivot, TimeValue.timeValueHours(24),


nit: space between "(String)pivot"

cbuescher · 2019-02-27T13:14:53Z

server/src/test/java/org/elasticsearch/index/query/DistanceFeatureQueryBuilderTests.java

+        switch (field) {
+            case GEO_POINT_FIELD_NAME:
+                origin = new GeoPoint(randomDouble(), randomDouble());
+                origin = randomBoolean()? origin : ((GeoPoint) origin).geohash();


nit: spaces

cbuescher · 2019-02-27T13:15:06Z

server/src/test/java/org/elasticsearch/index/query/DistanceFeatureQueryBuilderTests.java

+        float boost = queryBuilder.boost;
+        final Query expectedQuery;
+        if (fieldName.equals(GEO_POINT_FIELD_NAME)) {
+            GeoPoint originGeoPoint = (origin instanceof GeoPoint)? (GeoPoint) origin : GeoUtils.parseFromString((String) origin);


nit: spaces

jpountz

This looks good overall. Two thoughts on the PR:

I'm wondering that we should only expose it on date(_nanos) and geo_point fields for now. Lucene exposes it only longs mostly because it doesn't have dedicated fields for dates. As-is I think it feels a bit inconsistent to support this feature on longs but not integers or doubles. Since it's also likely less useful than recency boosting I'm wondering that we should leave it out for now?
When I opened the issue I wondered whether it would make more sense to have one query for all data types or one query per data type. Looking at the PR I'm wondering that having one query per data type might make things cleaner. For instance it's a pity that some validation only occurs at the shard level while we could do it at the coordinating node level if we knew whether the query would apply to a geo point or a date field.

jpountz · 2019-02-28T16:50:46Z

docs/reference/query-dsl/distance-feature-query.asciidoc

+ways to modify the score, this query has the benefit of being able to
+efficiently skip non-competitive hits when
+<<search-uri-request,`track_total_hits`>> is not set to `true`. Speedups may be
+spectacular.


I'd maybe drop the last sentence with spectacular since it's going to be a bit less efficient than rank_feature due to the dynamic nature of the feature (it's computed on the fly).

jpountz · 2019-02-28T16:53:09Z

docs/reference/query-dsl/distance-feature-query.asciidoc

+`field`::
+    Required parameter. Defines a name of the field on which to calculate
+    distances. Must be a field of type `long`, `date`, or `geo_point`,
+    and must be indexed and has <<doc-values, doc values>>.


Mayya refers to the fact that the field needs both indexed:true and doc_values:true. Maybe we could be more explicit by saying eg "[...] and must be indexed (index: true, which is the default) and have doc values (doc_values: true, which is the default too).

jpountz · 2019-02-28T16:55:17Z

docs/reference/query-dsl/distance-feature-query.asciidoc

+
+where `distance` is the absolute difference between the origin and
+a document's field value. For date field the distance will be in
+milliseconds; for geo fields the distance is a haversine distance in meters.


Units don't matter, do they?

jpountz · 2019-02-28T16:55:41Z

docs/reference/query-dsl/distance-feature-query.asciidoc

+=== Distance Feature Query
+
+The `distance_feature` query is a specialized query that only works
+on <<number,`long`>>, <<date, `date`>> or <<geo-point,`geo_point`>>


does it work for date_nanos too?

jpountz · 2019-02-28T16:57:33Z

docs/reference/query-dsl/distance-feature-query.asciidoc

+calculates distances between the given origin and documents' field values,
+and use these distances as features to boost the documents' scores.
+
+`distance_feature` query is typically put in a `should` clause of a


suggestion: maybe also mention the nearest neighbors use-case, eg. "distance_feature query is typically used on its own to find the nearest neighbors to a given point, or put in ashould clause [...]"

jpountz · 2019-02-28T17:01:41Z

docs/reference/query-dsl/special-queries.asciidoc

+<<query-dsl-distance-feature-query,`distance_feature` query>>::
+
+A query that computes scores based on the dynamically computed distances
+between the origin and documents' long numeric, geo or distance fields.


should it mention dates?

jpountz · 2019-02-28T17:09:41Z

server/src/main/java/org/elasticsearch/index/query/DistanceFeatureQueryBuilder.java

+                ((DateFieldType) fieldType).parseToLong(origin, true, null, null, context);
+            TimeValue val = TimeValue.parseTimeValue((String)pivot, TimeValue.timeValueHours(24),
+                DistanceFeatureQueryBuilder.class.getSimpleName() + ".pivot");
+            long pivotLong = val.getMillis();


this isn't correct with date_nanos.

jpountz · 2019-02-28T17:20:09Z

I'm adding the release highlight label as I know many people including @qhoxie expressed interest for efficient recency and geo-distance boosting in the past.

mayya-sharipova · 2019-02-28T20:50:21Z

@cbuescher thanks for such an extensive review - good feedback on the class design
@jpountz thank for the review, Adrien. What you suggested makes sense, I will redesign this PR to have separate queries and only for date and geo fields.

@jpountz @cbuescher I am wondering what should be the name for these new queries. I have two options:

distance_feature_date and distance_feature_geo - option in line with another similar query - rank_feature query
recency_boosting (for date) and proximity_boosting (for geo) - option easier for users to understand what these queries are doing

cbuescher · 2019-03-01T09:46:46Z

I wondered whether it would make more sense to have one query for all data types or one query per data type.
Having one query per data type might make things cleaner. For instance it's a pity that some validation only occurs at the shard level while we could do it at the coordinating node level

I agree that being able to validate at the coordinating node would be great. Separating the query however comes with some burden (two query builders, test, documentations etc...) that might be getting even bigger if we choose to add different proximity types later. Maybe it would be possible to stay with one distance_feature query but introduce a mandatory type argument (e.g. date, geo etc...) that we then could use for validation on the coordinating node? @jpountz wdyt?

jpountz · 2019-03-01T18:14:53Z

@cbuescher What would the API look like, are you thinking of something like below

"distance_feature": {
  "type": "geo",
  "field": "location",
  "origin": <origin>,
  "pivot": <pivot>,
  "boost" : <boost>
}

That said, we have a single query for the range query even though it suffers for the same issue, so maybe we should keep the current approach for consistency.

cbuescher · 2019-03-01T19:48:13Z

@jpountz yes, thats what I had in mind. This way I think we can check the input parameters for consistency already on the coordinating node when parsing the query, e.g. check that dates can be parsed etc...

mayya-sharipova · 2019-03-04T17:15:48Z

@jpountz @cbuescher thanks for suggestions. I will go then with the format:

"distance_feature": {
  "type": "geo",
  "field": "location",
  "origin": <origin>,
  "pivot": <pivot>,
  "boost" : <boost>
}

jimczi · 2019-03-06T10:01:17Z

That said, we have a single query for the range query even though it suffers for the same issue, so maybe we should keep the current approach for consistency.

And for simplicity ;), exposing the type argument just to allow parsing on the coordinating node seems overkill to me and I am not sure that type would be enough to handle date format, ...
I am not too concerned by the fact that parsing errors will be detected at the shard level, most of our queries that require parsing have the same limitations so IMO we shouldn't add extra arguments that are useful only internally.

cbuescher · 2019-03-06T10:22:54Z

I am not too concerned by the fact that parsing errors will be detected at the shard level, most of our queries that require parsing have the same limitations so IMO we shouldn't add extra arguments that are useful only internally.

I'm okay with either way that doesn't involve adding several different query builders etc...

jpountz · 2019-03-06T12:32:03Z

Thanks @jimczi for chiming in, agreed.

mayya-sharipova · 2019-03-07T20:57:18Z

@cbuescher @jpountz Thanks for your review. This is ready for another round of review. Changes made:

Modify distance_feature query to be on date, date_nanos and geo ields only. Exclude long field
Make origin as an inner class

- Modify distance_feature query to be on date, date_nanos and geo fields only. Exclude long field - Make `origin` as an inner class

mayya-sharipova · 2019-03-08T10:43:24Z

@elasticmachine run elasticsearch-ci/1

mayya-sharipova · 2019-03-08T10:44:43Z

@elasticmachine test this please

jpountz

I left some comments.

jpountz · 2019-03-08T17:32:04Z

server/src/main/java/org/elasticsearch/index/query/DistanceFeatureQueryBuilder.java

+                return LongPoint.newDistanceFeatureQuery(field, boost, originLong, pivotVal.getNanos());
+            }
+        } else if (fieldType instanceof GeoPointFieldType) {
+            GeoPoint originGeoPoint = (originObj instanceof GeoPoint)? (GeoPoint) originObj : GeoUtils.parseFromString((String) originObj);


the cast from String might raise a ClassCastException?

jpountz · 2019-03-08T17:33:23Z

server/src/main/java/org/elasticsearch/index/query/DistanceFeatureQueryBuilder.java

+    protected Query doToQuery(QueryShardContext context) throws IOException {
+        MappedFieldType fieldType = context.fieldMapper(field);
+        if (fieldType == null) {
+            throw new IllegalArgumentException("Can't run [" + NAME + "] query on unmapped fields!");


our general policy is to be lenient when it comes to unmapped fields, to make cross-index search easier. I'd return a MatchNoDocsQuery instead.

jpountz · 2019-03-08T17:34:34Z

server/src/main/java/org/elasticsearch/index/query/DistanceFeatureQueryBuilder.java

+        Object originObj = origin.origin();
+        if (fieldType instanceof DateFieldType) {
+            long originLong = (originObj instanceof Long) ? (Long) originObj :
+                ((DateFieldType) fieldType).parseToLong(originObj, true, null, null, context);


I don't like taking longs as-is, we should always go through parseToLong imo. I'd like to avoid that users need to be aware of the internal resolution of the field.

@jpountz parseToLong does NOT work for date_nanos field when parsing a huge long value (an expected value for date_nanos).
DateFieldType::parseToLong uses JavaDateMathParser (instead of desired JodaDateMathParser) even if a date type is date_nanos. Or should I pass a forcedDateParser to parseToLong depending on the fieldType's resolution?

@jpountz Please disregard my previous comment here. I assumed that date_nanos except a long value as nanoseconds-since-the-epoch, but actually the long value should be milliseconds-since-the-epoch.
I have made the changes as you suggested.

jpountz · 2019-03-08T17:34:51Z

server/src/main/java/org/elasticsearch/index/query/DistanceFeatureQueryBuilder.java

+            return LatLonPoint.newDistanceFeatureQuery(field, boost, originGeoPoint.lat(), originGeoPoint.lon(), pivotDouble);
+        }
+        throw new IllegalArgumentException(
+            "Illegal data type! ["+ NAME + "] query can only be run on a date, date_nanos or geo_point field type!");


can you add the type of the field to the error message?

jpountz · 2019-03-08T17:35:52Z

server/src/main/java/org/elasticsearch/index/query/DistanceFeatureQueryBuilder.java

+                this.origin = origin;
+            } else {
+                throw new IllegalArgumentException("Illegal type for [origin]! Must be of type [long] or [string] for " +
+                    "date and date_nanos origins," + "[geo_point] or [string] for geo_point origins!");


can you add the class of the object to the error message?

cbuescher

Left some more minor comments, but nothing huge that needs another review from my side I think.

server/src/main/java/org/elasticsearch/index/query/DistanceFeatureQueryBuilder.java

mayya-sharipova · 2019-03-17T23:37:19Z

@cbuescher @jpountz Thanks for another round of the review. I have addressed all your comments in the last commits.

mayya-sharipova · 2019-03-18T01:30:39Z

@elasticmachine run elasticsearch-ci/packaging-sample

jpountz

I left some minor comments. Other than that LGTM.

jpountz · 2019-03-18T19:03:30Z

server/src/main/java/org/elasticsearch/index/query/DistanceFeatureQueryBuilder.java

+        Object originObj = origin.origin();
+        if (fieldType instanceof DateFieldType) {
+            long originLong = ((DateFieldType) fieldType).parseToLong(originObj, true, null, null, context);
+            TimeValue pivotVal = TimeValue.parseTimeValue(pivot, TimeValue.timeValueHours(24),


It's a bit confusing to pass a default value here given that the origin can't be null.

jpountz · 2019-03-18T19:05:39Z

server/src/main/java/org/elasticsearch/index/query/DistanceFeatureQueryBuilder.java

+
+        public Origin(GeoPoint origin) {
+            this.origin = origin;
+        }


let's reject nulls in above constructors

jpountz · 2019-03-18T19:11:39Z

server/src/test/java/org/elasticsearch/index/query/DistanceFeatureQueryBuilderTests.java

+                pivot = randomTimeValue(1, 1000, "d", "h", "ms", "s", "m");
+                break;
+            default: // DATE_NANOS_FIELD_NAME
+                Instant randomDateNanos = Instant.now().minus(Duration.ofNanos(randomLongBetween(0, 100_000_000)));


Let's avoid using now to make failures reproducible?

jpountz · 2019-03-18T19:11:50Z

server/src/test/java/org/elasticsearch/index/query/DistanceFeatureQueryBuilderTests.java

+                pivot = randomFrom(DistanceUnit.values()).toString(randomDouble());
+                break;
+            case DATE_FIELD_NAME:
+                long randomDateMills = System.currentTimeMillis() - randomLongBetween(0, 1_000_000);


let's avoid using currentTimeMillis to make things reproducible?

Expose DistanceFeatureQuery for geo, date and date_nanos types Closes elastic#33382

Expose DistanceFeatureQuery for geo, date and date_nanos types Closes #33382

Relates to #39385

mayya-sharipova added >feature :Search Relevance/Ranking Scoring, rescoring, rank evaluation. labels Feb 26, 2019

mayya-sharipova added v8.0.0 v7.2.0 labels Feb 26, 2019

Expose proximity boosting

7dd4447

Expose DistanceFeatureQuery for long, geo and date types Closes elastic#33382

mayya-sharipova force-pushed the expose-distance-feature-query branch from 71a5863 to 7dd4447 Compare February 26, 2019 15:56

cbuescher self-assigned this Feb 27, 2019

cbuescher requested changes Feb 27, 2019

View reviewed changes

jpountz reviewed Feb 28, 2019

View reviewed changes

jpountz added the release highlight label Feb 28, 2019

mayya-sharipova added 2 commits March 7, 2019 16:50

Address feedback:

0b2ae25

- Modify distance_feature query to be on date, date_nanos and geo fields only. Exclude long field - Make `origin` as an inner class

Merge branch 'master' into expose-distance-feature-query

de1263b

mayya-sharipova force-pushed the expose-distance-feature-query branch from df5220b to de1263b Compare March 7, 2019 22:41

jpountz requested changes Mar 8, 2019

View reviewed changes

cbuescher approved these changes Mar 12, 2019

View reviewed changes

mayya-sharipova added 2 commits March 15, 2019 18:25

Address Adrien and Christophs' comments

a591bb4

Merge branch 'master' into expose-distance-feature-query

c934435

mayya-sharipova added 2 commits March 15, 2019 18:31

Revert back origin

e3366da

Change to milliseconds for date_nanos

a4bd771

Merge branch 'master' into expose-distance-feature-query

2441174

jpountz approved these changes Mar 18, 2019

View reviewed changes

mayya-sharipova added 2 commits March 18, 2019 16:32

Address Adrien's comments

83f0c84

Correct checkstyle

be7a4f9

mayya-sharipova merged commit a87b139 into elastic:master Mar 19, 2019

mayya-sharipova deleted the expose-distance-feature-query branch March 19, 2019 11:04

mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this pull request Mar 20, 2019

Expose proximity boosting (elastic#39385)

d6b3f91

Expose DistanceFeatureQuery for geo, date and date_nanos types Closes elastic#33382

mayya-sharipova mentioned this pull request Mar 20, 2019

Expose proximity boosting (#39385) #40251

Merged

mayya-sharipova added a commit that referenced this pull request Mar 20, 2019

Expose proximity boosting (#39385) (#40251)

49a7c6e

Expose DistanceFeatureQuery for geo, date and date_nanos types Closes #33382

mayya-sharipova added a commit that referenced this pull request Mar 20, 2019

Adjust the version for 250_distance_feature test

2945b8f

Relates to #39385

codebrain mentioned this pull request Aug 2, 2019

[meta] 7.2 Release elastic/elasticsearch-net#3980

Closed

37 tasks

consulthys mentioned this pull request Oct 9, 2019

QueryBuilders is missing builder for distance_feature #47767

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Expose proximity boosting #39385

Expose proximity boosting #39385

Conversation

mayya-sharipova commented Feb 26, 2019 • edited Loading

elasticmachine commented Feb 26, 2019

cbuescher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz commented Feb 28, 2019

mayya-sharipova commented Feb 28, 2019 • edited Loading

cbuescher commented Mar 1, 2019

jpountz commented Mar 1, 2019

cbuescher commented Mar 1, 2019

mayya-sharipova commented Mar 4, 2019 • edited Loading

jimczi commented Mar 6, 2019

cbuescher commented Mar 6, 2019 • edited Loading

jpountz commented Mar 6, 2019

mayya-sharipova commented Mar 7, 2019

mayya-sharipova commented Mar 8, 2019 • edited Loading

mayya-sharipova commented Mar 8, 2019

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbuescher left a comment

Choose a reason for hiding this comment

mayya-sharipova commented Mar 17, 2019

mayya-sharipova commented Mar 18, 2019

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayya-sharipova commented Feb 26, 2019 •

edited

Loading

mayya-sharipova commented Feb 28, 2019 •

edited

Loading

mayya-sharipova commented Mar 4, 2019 •

edited

Loading

cbuescher commented Mar 6, 2019 •

edited

Loading

mayya-sharipova commented Mar 8, 2019 •

edited

Loading