-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose proximity boosting #39385
Expose proximity boosting #39385
Conversation
Pinging @elastic/es-search |
Expose DistanceFeatureQuery for long, geo and date types Closes elastic#33382
71a5863
to
7dd4447
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mayya-sharipova this looks like a really cool and useful feature. I left a couple of comments, most of them minor. I think the biggest suggestion I'd have would be trying to avoid to have untyped (Object) origin and pivot fields and instead create some very simple inner class for holding this information in either one of the allowed types (long, String, Date). This way we should be able to control which values the user can set and prevent accidental misuses early.
I was also wondering if it makes sense to add a few integration tests (something extending ESIntegTestCase) to check if the scoring works as advertised (e.g. boosts are applied correctly, pivot works as expected etc...) but then again I guess this is already covered in the underlying Lucene queries so maybe its not that useful, but wanted to bring it up as a suggestion.
Other than that great feature, looking forward to using it in 7.1.
ways to modify the score, this query has the benefit of being able to | ||
efficiently skip non-competitive hits when | ||
<<search-uri-request,`track_total_hits`>> is not set to `true`. Speedups may be | ||
spectacular. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, do we have experiments/numbers/blogs to back this up? No need to change if we haven't but I was wondering if we could add anything in case we have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbuescher I have copied this phrase from rank_feature query which is using the same optimizations, and we also have a blog post on this. This blog post has a link to Lucene benchmarks, but looks like adding this link to these benchmarks would be excessive here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd maybe drop the last sentence with spectacular
since it's going to be a bit less efficient than rank_feature
due to the dynamic nature of the feature (it's computed on the fly).
|
||
[horizontal] | ||
`field`:: | ||
Required parameter. Defines a name of the field on which to calculate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/a/the
`field`:: | ||
Required parameter. Defines a name of the field on which to calculate | ||
distances. Must be a field of type `long`, `date`, or `geo_point`, | ||
and must be indexed and has <<doc-values, doc values>>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe "must be indexed using <<doc-values, doc values>>" or something similar instead of another "and" clause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mayya refers to the fact that the field needs both indexed:true
and doc_values:true
. Maybe we could be more explicit by saying eg "[...] and must be indexed (index: true
, which is the default) and have doc values (doc_values: true
, which is the default too).
-------------------------------------------------- | ||
// CONSOLE | ||
|
||
We look for all chocolate items, but we also want chocolates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Maybe in all three cases start with "We can look", or "We now want to look" to make it less repetitive?
<<query-dsl-distance-feature-query,`distance_feature` query>>:: | ||
|
||
A query that computes scores based on the dynamically computed distances | ||
between the origin and documents' long numeric, geo or distance fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe "geo-point" instead of "geo" like above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it mention dates?
@Override | ||
protected void doWriteTo(StreamOutput out) throws IOException { | ||
out.writeString(field); | ||
out.writeGenericValue(origin); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could move to an inner class if we added them for Origin and Pivot.
long pivotLong = (Long) pivot; | ||
return LongPoint.newDistanceFeatureQuery(field, boost, originLong, pivotLong); | ||
} else if (fieldType instanceof GeoPointFieldType) { | ||
GeoPoint originGeoPoint = (origin instanceof GeoPoint)? (GeoPoint) origin : GeoUtils.parseFromString((String) origin); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: space between ")?"
if (fieldType instanceof DateFieldType) { | ||
long originLong = (origin instanceof Long) ? (Long) origin : | ||
((DateFieldType) fieldType).parseToLong(origin, true, null, null, context); | ||
TimeValue val = TimeValue.parseTimeValue((String)pivot, TimeValue.timeValueHours(24), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: space between "(String)pivot"
switch (field) { | ||
case GEO_POINT_FIELD_NAME: | ||
origin = new GeoPoint(randomDouble(), randomDouble()); | ||
origin = randomBoolean()? origin : ((GeoPoint) origin).geohash(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: spaces
float boost = queryBuilder.boost; | ||
final Query expectedQuery; | ||
if (fieldName.equals(GEO_POINT_FIELD_NAME)) { | ||
GeoPoint originGeoPoint = (origin instanceof GeoPoint)? (GeoPoint) origin : GeoUtils.parseFromString((String) origin); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: spaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good overall. Two thoughts on the PR:
- I'm wondering that we should only expose it on
date(_nanos)
andgeo_point
fields for now. Lucene exposes it only longs mostly because it doesn't have dedicated fields for dates. As-is I think it feels a bit inconsistent to support this feature on longs but not integers or doubles. Since it's also likely less useful than recency boosting I'm wondering that we should leave it out for now? - When I opened the issue I wondered whether it would make more sense to have one query for all data types or one query per data type. Looking at the PR I'm wondering that having one query per data type might make things cleaner. For instance it's a pity that some validation only occurs at the shard level while we could do it at the coordinating node level if we knew whether the query would apply to a geo point or a date field.
ways to modify the score, this query has the benefit of being able to | ||
efficiently skip non-competitive hits when | ||
<<search-uri-request,`track_total_hits`>> is not set to `true`. Speedups may be | ||
spectacular. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd maybe drop the last sentence with spectacular
since it's going to be a bit less efficient than rank_feature
due to the dynamic nature of the feature (it's computed on the fly).
`field`:: | ||
Required parameter. Defines a name of the field on which to calculate | ||
distances. Must be a field of type `long`, `date`, or `geo_point`, | ||
and must be indexed and has <<doc-values, doc values>>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mayya refers to the fact that the field needs both indexed:true
and doc_values:true
. Maybe we could be more explicit by saying eg "[...] and must be indexed (index: true
, which is the default) and have doc values (doc_values: true
, which is the default too).
|
||
where `distance` is the absolute difference between the origin and | ||
a document's field value. For date field the distance will be in | ||
milliseconds; for geo fields the distance is a haversine distance in meters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Units don't matter, do they?
=== Distance Feature Query | ||
|
||
The `distance_feature` query is a specialized query that only works | ||
on <<number,`long`>>, <<date, `date`>> or <<geo-point,`geo_point`>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it work for date_nanos
too?
calculates distances between the given origin and documents' field values, | ||
and use these distances as features to boost the documents' scores. | ||
|
||
`distance_feature` query is typically put in a `should` clause of a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: maybe also mention the nearest neighbors use-case, eg. "distance_feature
query is typically used on its own to find the nearest neighbors to a given point, or put in ashould
clause [...]"
<<query-dsl-distance-feature-query,`distance_feature` query>>:: | ||
|
||
A query that computes scores based on the dynamically computed distances | ||
between the origin and documents' long numeric, geo or distance fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it mention dates?
((DateFieldType) fieldType).parseToLong(origin, true, null, null, context); | ||
TimeValue val = TimeValue.parseTimeValue((String)pivot, TimeValue.timeValueHours(24), | ||
DistanceFeatureQueryBuilder.class.getSimpleName() + ".pivot"); | ||
long pivotLong = val.getMillis(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't correct with date_nanos
.
I'm adding the |
@cbuescher thanks for such an extensive review - good feedback on the class design @jpountz @cbuescher I am wondering what should be the name for these new queries. I have two options:
|
I agree that being able to validate at the coordinating node would be great. Separating the query however comes with some burden (two query builders, test, documentations etc...) that might be getting even bigger if we choose to add different proximity types later. Maybe it would be possible to stay with one |
@cbuescher What would the API look like, are you thinking of something like below
That said, we have a single query for the |
@jpountz yes, thats what I had in mind. This way I think we can check the input parameters for consistency already on the coordinating node when parsing the query, e.g. check that dates can be parsed etc... |
@jpountz @cbuescher thanks for suggestions. I will go then with the format:
|
And for simplicity ;), exposing the |
I'm okay with either way that doesn't involve adding several different query builders etc... |
Thanks @jimczi for chiming in, agreed. |
@cbuescher @jpountz Thanks for your review. This is ready for another round of review. Changes made:
|
- Modify distance_feature query to be on date, date_nanos and geo fields only. Exclude long field - Make `origin` as an inner class
df5220b
to
de1263b
Compare
@elasticmachine run elasticsearch-ci/1 |
@elasticmachine test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments.
return LongPoint.newDistanceFeatureQuery(field, boost, originLong, pivotVal.getNanos()); | ||
} | ||
} else if (fieldType instanceof GeoPointFieldType) { | ||
GeoPoint originGeoPoint = (originObj instanceof GeoPoint)? (GeoPoint) originObj : GeoUtils.parseFromString((String) originObj); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the cast from String might raise a ClassCastException?
protected Query doToQuery(QueryShardContext context) throws IOException { | ||
MappedFieldType fieldType = context.fieldMapper(field); | ||
if (fieldType == null) { | ||
throw new IllegalArgumentException("Can't run [" + NAME + "] query on unmapped fields!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our general policy is to be lenient when it comes to unmapped fields, to make cross-index search easier. I'd return a MatchNoDocsQuery
instead.
Object originObj = origin.origin(); | ||
if (fieldType instanceof DateFieldType) { | ||
long originLong = (originObj instanceof Long) ? (Long) originObj : | ||
((DateFieldType) fieldType).parseToLong(originObj, true, null, null, context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like taking longs as-is, we should always go through parseToLong
imo. I'd like to avoid that users need to be aware of the internal resolution of the field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpountz parseToLong
does NOT work for date_nanos field when parsing a huge long value (an expected value for date_nanos
).
DateFieldType::parseToLong
uses JavaDateMathParser
(instead of desired JodaDateMathParser
) even if a date type is date_nanos
. Or should I pass a forcedDateParser
to parseToLong
depending on the fieldType's resolution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpountz Please disregard my previous comment here. I assumed that date_nanos
except a long value as nanoseconds-since-the-epoch, but actually the long value should be milliseconds-since-the-epoch.
I have made the changes as you suggested.
return LatLonPoint.newDistanceFeatureQuery(field, boost, originGeoPoint.lat(), originGeoPoint.lon(), pivotDouble); | ||
} | ||
throw new IllegalArgumentException( | ||
"Illegal data type! ["+ NAME + "] query can only be run on a date, date_nanos or geo_point field type!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the type of the field to the error message?
this.origin = origin; | ||
} else { | ||
throw new IllegalArgumentException("Illegal type for [origin]! Must be of type [long] or [string] for " + | ||
"date and date_nanos origins," + "[geo_point] or [string] for geo_point origins!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the class of the object to the error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some more minor comments, but nothing huge that needs another review from my side I think.
server/src/main/java/org/elasticsearch/index/query/DistanceFeatureQueryBuilder.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/query/DistanceFeatureQueryBuilder.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/query/DistanceFeatureQueryBuilder.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/query/DistanceFeatureQueryBuilder.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/query/DistanceFeatureQueryBuilder.java
Outdated
Show resolved
Hide resolved
@cbuescher @jpountz Thanks for another round of the review. I have addressed all your comments in the last commits. |
@elasticmachine run elasticsearch-ci/packaging-sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some minor comments. Other than that LGTM.
Object originObj = origin.origin(); | ||
if (fieldType instanceof DateFieldType) { | ||
long originLong = ((DateFieldType) fieldType).parseToLong(originObj, true, null, null, context); | ||
TimeValue pivotVal = TimeValue.parseTimeValue(pivot, TimeValue.timeValueHours(24), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit confusing to pass a default value here given that the origin can't be null.
|
||
public Origin(GeoPoint origin) { | ||
this.origin = origin; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's reject nulls in above constructors
pivot = randomTimeValue(1, 1000, "d", "h", "ms", "s", "m"); | ||
break; | ||
default: // DATE_NANOS_FIELD_NAME | ||
Instant randomDateNanos = Instant.now().minus(Duration.ofNanos(randomLongBetween(0, 100_000_000))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid using now to make failures reproducible?
pivot = randomFrom(DistanceUnit.values()).toString(randomDouble()); | ||
break; | ||
case DATE_FIELD_NAME: | ||
long randomDateMills = System.currentTimeMillis() - randomLongBetween(0, 1_000_000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's avoid using currentTimeMillis to make things reproducible?
Expose DistanceFeatureQuery for geo, date and date_nanos types Closes elastic#33382
Expose DistanceFeatureQuery for geo, date_nanos an date types
Closes #33382