Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose proximity boosting #39385

Merged

Conversation

mayya-sharipova
Copy link
Contributor

@mayya-sharipova mayya-sharipova commented Feb 26, 2019

Expose DistanceFeatureQuery for geo, date_nanos an date types

Closes #33382

@mayya-sharipova mayya-sharipova added >feature :Search Relevance/Ranking Scoring, rescoring, rank evaluation. labels Feb 26, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

Expose DistanceFeatureQuery for long, geo and date types

Closes elastic#33382
@mayya-sharipova mayya-sharipova force-pushed the expose-distance-feature-query branch from 71a5863 to 7dd4447 Compare February 26, 2019 15:56
@cbuescher cbuescher self-assigned this Feb 27, 2019
Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mayya-sharipova this looks like a really cool and useful feature. I left a couple of comments, most of them minor. I think the biggest suggestion I'd have would be trying to avoid to have untyped (Object) origin and pivot fields and instead create some very simple inner class for holding this information in either one of the allowed types (long, String, Date). This way we should be able to control which values the user can set and prevent accidental misuses early.
I was also wondering if it makes sense to add a few integration tests (something extending ESIntegTestCase) to check if the scoring works as advertised (e.g. boosts are applied correctly, pivot works as expected etc...) but then again I guess this is already covered in the underlying Lucene queries so maybe its not that useful, but wanted to bring it up as a suggestion.
Other than that great feature, looking forward to using it in 7.1.

ways to modify the score, this query has the benefit of being able to
efficiently skip non-competitive hits when
<<search-uri-request,`track_total_hits`>> is not set to `true`. Speedups may be
spectacular.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, do we have experiments/numbers/blogs to back this up? No need to change if we haven't but I was wondering if we could add anything in case we have it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbuescher I have copied this phrase from rank_feature query which is using the same optimizations, and we also have a blog post on this. This blog post has a link to Lucene benchmarks, but looks like adding this link to these benchmarks would be excessive here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd maybe drop the last sentence with spectacular since it's going to be a bit less efficient than rank_feature due to the dynamic nature of the feature (it's computed on the fly).


[horizontal]
`field`::
Required parameter. Defines a name of the field on which to calculate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/a/the

`field`::
Required parameter. Defines a name of the field on which to calculate
distances. Must be a field of type `long`, `date`, or `geo_point`,
and must be indexed and has <<doc-values, doc values>>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "must be indexed using <<doc-values, doc values>>" or something similar instead of another "and" clause.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mayya refers to the fact that the field needs both indexed:true and doc_values:true. Maybe we could be more explicit by saying eg "[...] and must be indexed (index: true, which is the default) and have doc values (doc_values: true, which is the default too).

--------------------------------------------------
// CONSOLE

We look for all chocolate items, but we also want chocolates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe in all three cases start with "We can look", or "We now want to look" to make it less repetitive?

<<query-dsl-distance-feature-query,`distance_feature` query>>::

A query that computes scores based on the dynamically computed distances
between the origin and documents' long numeric, geo or distance fields.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "geo-point" instead of "geo" like above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it mention dates?

@Override
protected void doWriteTo(StreamOutput out) throws IOException {
out.writeString(field);
out.writeGenericValue(origin);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could move to an inner class if we added them for Origin and Pivot.

long pivotLong = (Long) pivot;
return LongPoint.newDistanceFeatureQuery(field, boost, originLong, pivotLong);
} else if (fieldType instanceof GeoPointFieldType) {
GeoPoint originGeoPoint = (origin instanceof GeoPoint)? (GeoPoint) origin : GeoUtils.parseFromString((String) origin);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space between ")?"

if (fieldType instanceof DateFieldType) {
long originLong = (origin instanceof Long) ? (Long) origin :
((DateFieldType) fieldType).parseToLong(origin, true, null, null, context);
TimeValue val = TimeValue.parseTimeValue((String)pivot, TimeValue.timeValueHours(24),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space between "(String)pivot"

switch (field) {
case GEO_POINT_FIELD_NAME:
origin = new GeoPoint(randomDouble(), randomDouble());
origin = randomBoolean()? origin : ((GeoPoint) origin).geohash();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spaces

float boost = queryBuilder.boost;
final Query expectedQuery;
if (fieldName.equals(GEO_POINT_FIELD_NAME)) {
GeoPoint originGeoPoint = (origin instanceof GeoPoint)? (GeoPoint) origin : GeoUtils.parseFromString((String) origin);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spaces

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good overall. Two thoughts on the PR:

  • I'm wondering that we should only expose it on date(_nanos) and geo_point fields for now. Lucene exposes it only longs mostly because it doesn't have dedicated fields for dates. As-is I think it feels a bit inconsistent to support this feature on longs but not integers or doubles. Since it's also likely less useful than recency boosting I'm wondering that we should leave it out for now?
  • When I opened the issue I wondered whether it would make more sense to have one query for all data types or one query per data type. Looking at the PR I'm wondering that having one query per data type might make things cleaner. For instance it's a pity that some validation only occurs at the shard level while we could do it at the coordinating node level if we knew whether the query would apply to a geo point or a date field.

ways to modify the score, this query has the benefit of being able to
efficiently skip non-competitive hits when
<<search-uri-request,`track_total_hits`>> is not set to `true`. Speedups may be
spectacular.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd maybe drop the last sentence with spectacular since it's going to be a bit less efficient than rank_feature due to the dynamic nature of the feature (it's computed on the fly).

`field`::
Required parameter. Defines a name of the field on which to calculate
distances. Must be a field of type `long`, `date`, or `geo_point`,
and must be indexed and has <<doc-values, doc values>>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mayya refers to the fact that the field needs both indexed:true and doc_values:true. Maybe we could be more explicit by saying eg "[...] and must be indexed (index: true, which is the default) and have doc values (doc_values: true, which is the default too).


where `distance` is the absolute difference between the origin and
a document's field value. For date field the distance will be in
milliseconds; for geo fields the distance is a haversine distance in meters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Units don't matter, do they?

=== Distance Feature Query

The `distance_feature` query is a specialized query that only works
on <<number,`long`>>, <<date, `date`>> or <<geo-point,`geo_point`>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it work for date_nanos too?

calculates distances between the given origin and documents' field values,
and use these distances as features to boost the documents' scores.

`distance_feature` query is typically put in a `should` clause of a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: maybe also mention the nearest neighbors use-case, eg. "distance_feature query is typically used on its own to find the nearest neighbors to a given point, or put in ashould clause [...]"

<<query-dsl-distance-feature-query,`distance_feature` query>>::

A query that computes scores based on the dynamically computed distances
between the origin and documents' long numeric, geo or distance fields.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it mention dates?

((DateFieldType) fieldType).parseToLong(origin, true, null, null, context);
TimeValue val = TimeValue.parseTimeValue((String)pivot, TimeValue.timeValueHours(24),
DistanceFeatureQueryBuilder.class.getSimpleName() + ".pivot");
long pivotLong = val.getMillis();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't correct with date_nanos.

@jpountz
Copy link
Contributor

jpountz commented Feb 28, 2019

I'm adding the release highlight label as I know many people including @qhoxie expressed interest for efficient recency and geo-distance boosting in the past.

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented Feb 28, 2019

@cbuescher thanks for such an extensive review - good feedback on the class design
@jpountz thank for the review, Adrien. What you suggested makes sense, I will redesign this PR to have separate queries and only for date and geo fields.

@jpountz @cbuescher I am wondering what should be the name for these new queries. I have two options:

  1. distance_feature_date and distance_feature_geo - option in line with another similar query - rank_feature query
  2. recency_boosting (for date) and proximity_boosting (for geo) - option easier for users to understand what these queries are doing

@cbuescher
Copy link
Member

I wondered whether it would make more sense to have one query for all data types or one query per data type.
Having one query per data type might make things cleaner. For instance it's a pity that some validation only occurs at the shard level while we could do it at the coordinating node level

I agree that being able to validate at the coordinating node would be great. Separating the query however comes with some burden (two query builders, test, documentations etc...) that might be getting even bigger if we choose to add different proximity types later. Maybe it would be possible to stay with one distance_feature query but introduce a mandatory type argument (e.g. date, geo etc...) that we then could use for validation on the coordinating node? @jpountz wdyt?

@jpountz
Copy link
Contributor

jpountz commented Mar 1, 2019

@cbuescher What would the API look like, are you thinking of something like below

"distance_feature": {
  "type": "geo",
  "field": "location",
  "origin": <origin>,
  "pivot": <pivot>,
  "boost" : <boost>
}

That said, we have a single query for the range query even though it suffers for the same issue, so maybe we should keep the current approach for consistency.

@cbuescher
Copy link
Member

@jpountz yes, thats what I had in mind. This way I think we can check the input parameters for consistency already on the coordinating node when parsing the query, e.g. check that dates can be parsed etc...

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented Mar 4, 2019

@jpountz @cbuescher thanks for suggestions. I will go then with the format:

"distance_feature": {
  "type": "geo",
  "field": "location",
  "origin": <origin>,
  "pivot": <pivot>,
  "boost" : <boost>
}

@jimczi
Copy link
Contributor

jimczi commented Mar 6, 2019

That said, we have a single query for the range query even though it suffers for the same issue, so maybe we should keep the current approach for consistency.

And for simplicity ;), exposing the type argument just to allow parsing on the coordinating node seems overkill to me and I am not sure that type would be enough to handle date format, ...
I am not too concerned by the fact that parsing errors will be detected at the shard level, most of our queries that require parsing have the same limitations so IMO we shouldn't add extra arguments that are useful only internally.

@cbuescher
Copy link
Member

cbuescher commented Mar 6, 2019

I am not too concerned by the fact that parsing errors will be detected at the shard level, most of our queries that require parsing have the same limitations so IMO we shouldn't add extra arguments that are useful only internally.

I'm okay with either way that doesn't involve adding several different query builders etc...

@jpountz
Copy link
Contributor

jpountz commented Mar 6, 2019

Thanks @jimczi for chiming in, agreed.

@mayya-sharipova
Copy link
Contributor Author

@cbuescher @jpountz Thanks for your review. This is ready for another round of review. Changes made:

  • Modify distance_feature query to be on date, date_nanos and geo ields only. Exclude long field
  • Make origin as an inner class

- Modify distance_feature query to be on date, date_nanos and geo
    fields only. Exclude long field
- Make `origin` as an inner class
@mayya-sharipova mayya-sharipova force-pushed the expose-distance-feature-query branch from df5220b to de1263b Compare March 7, 2019 22:41
@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented Mar 8, 2019

@elasticmachine run elasticsearch-ci/1

@mayya-sharipova
Copy link
Contributor Author

@elasticmachine test this please

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments.

return LongPoint.newDistanceFeatureQuery(field, boost, originLong, pivotVal.getNanos());
}
} else if (fieldType instanceof GeoPointFieldType) {
GeoPoint originGeoPoint = (originObj instanceof GeoPoint)? (GeoPoint) originObj : GeoUtils.parseFromString((String) originObj);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cast from String might raise a ClassCastException?

protected Query doToQuery(QueryShardContext context) throws IOException {
MappedFieldType fieldType = context.fieldMapper(field);
if (fieldType == null) {
throw new IllegalArgumentException("Can't run [" + NAME + "] query on unmapped fields!");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our general policy is to be lenient when it comes to unmapped fields, to make cross-index search easier. I'd return a MatchNoDocsQuery instead.

Object originObj = origin.origin();
if (fieldType instanceof DateFieldType) {
long originLong = (originObj instanceof Long) ? (Long) originObj :
((DateFieldType) fieldType).parseToLong(originObj, true, null, null, context);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like taking longs as-is, we should always go through parseToLong imo. I'd like to avoid that users need to be aware of the internal resolution of the field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz parseToLong does NOT work for date_nanos field when parsing a huge long value (an expected value for date_nanos).
DateFieldType::parseToLong uses JavaDateMathParser (instead of desired JodaDateMathParser) even if a date type is date_nanos. Or should I pass a forcedDateParser to parseToLong depending on the fieldType's resolution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz Please disregard my previous comment here. I assumed that date_nanos except a long value as nanoseconds-since-the-epoch, but actually the long value should be milliseconds-since-the-epoch.
I have made the changes as you suggested.

return LatLonPoint.newDistanceFeatureQuery(field, boost, originGeoPoint.lat(), originGeoPoint.lon(), pivotDouble);
}
throw new IllegalArgumentException(
"Illegal data type! ["+ NAME + "] query can only be run on a date, date_nanos or geo_point field type!");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the type of the field to the error message?

this.origin = origin;
} else {
throw new IllegalArgumentException("Illegal type for [origin]! Must be of type [long] or [string] for " +
"date and date_nanos origins," + "[geo_point] or [string] for geo_point origins!");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the class of the object to the error message?

Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some more minor comments, but nothing huge that needs another review from my side I think.

@mayya-sharipova
Copy link
Contributor Author

@cbuescher @jpountz Thanks for another round of the review. I have addressed all your comments in the last commits.

@mayya-sharipova
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/packaging-sample

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some minor comments. Other than that LGTM.

Object originObj = origin.origin();
if (fieldType instanceof DateFieldType) {
long originLong = ((DateFieldType) fieldType).parseToLong(originObj, true, null, null, context);
TimeValue pivotVal = TimeValue.parseTimeValue(pivot, TimeValue.timeValueHours(24),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit confusing to pass a default value here given that the origin can't be null.


public Origin(GeoPoint origin) {
this.origin = origin;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's reject nulls in above constructors

pivot = randomTimeValue(1, 1000, "d", "h", "ms", "s", "m");
break;
default: // DATE_NANOS_FIELD_NAME
Instant randomDateNanos = Instant.now().minus(Duration.ofNanos(randomLongBetween(0, 100_000_000)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid using now to make failures reproducible?

pivot = randomFrom(DistanceUnit.values()).toString(randomDouble());
break;
case DATE_FIELD_NAME:
long randomDateMills = System.currentTimeMillis() - randomLongBetween(0, 1_000_000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's avoid using currentTimeMillis to make things reproducible?

@mayya-sharipova mayya-sharipova merged commit a87b139 into elastic:master Mar 19, 2019
@mayya-sharipova mayya-sharipova deleted the expose-distance-feature-query branch March 19, 2019 11:04
mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this pull request Mar 20, 2019
Expose DistanceFeatureQuery for geo, date and date_nanos types

Closes elastic#33382
mayya-sharipova added a commit that referenced this pull request Mar 20, 2019
Expose DistanceFeatureQuery for geo, date and date_nanos types

Closes #33382
mayya-sharipova added a commit that referenced this pull request Mar 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose proximity boosting
6 participants