improved boosting for exact matches where the last token is in fact complete #1202

missinglink · 2018-10-09T13:49:07Z

Today I was looking into the scoring of queries for autocomplete transit stops and managed to fix some incorrect scoring in some cases:

before:

/v1/autocomplete?text=stop 2

 1)	SW 2nd & Alder (TriMet Stop ID 7095), Portland, OR, USA
 2)	W Burnside & SW 2nd (TriMet Stop ID 9526), Portland, OR, USA
 3)	SW 2nd & Taylor (TriMet Stop ID 7098), Portland, OR, USA
...

/v1/autocomplete?text=2905

 1)	2905 SW 1st Ave, Portland, OR, USA
 2)	2905 SW 4th Ave, Portland, OR, USA
 3)	2905 SE 17th Ave, Portland, OR, USA

after:

/v1/autocomplete?text=stop 2

 1)	A Ave & Chandler (TriMet Stop ID 2), Lake Oswego, OR, USA
 2)	W Burnside & SW 2nd (TriMet Stop ID 9526), Portland, OR, USA
 3)	NW Everett & 2nd (TriMet Stop ID 1612), Portland, OR, USA

/v1/autocomplete?text=2905

 1)	SE Hwy 212 & 122nd (TriMet Stop ID 2905), Clackamas County, OR, USA
 2)	2905 SW 1st Ave, Portland, OR, USA
 3)	2905 SW 4th Ave, Portland, OR, USA

So.. explaining this one is a bit complex but I'll give it a shot!

For autocomplete we split the input text in to 'tokens', and then we mark those tokens as complete or incomplete, the gist of this is that every token except the last one is 'complete', the last one we don't know if the user has finished typing that word or not.

An example query:

input: "stop 2"
tokens: [ "stop", "2" ]
complete: [ "stop" ]
incomplete: [ "2" ]

So before this PR we would already boost documents matching stop in the phrase index, this is important because it boosted exact matching phrases higher and didn't use the ngram index.

So this prevented "Clayton Avenue" and "Clay Ave" from scoring the same despite sharing common ngrams.

This works great when the first token is complete and the last one isn't.

This PR now covers the additional case where the last token is in fact also complete (something we cannot tell from just looking at the text)

So in the example above, we will generate one more subquery which also tries to match stop 2, and in this case, it would find an exact match, giving that document an additional boost.

missinglink · 2018-10-09T13:52:38Z

I also looked in to getting numerically ascending scoring in this case, such as:

?text=stop 2

Stop 2
Stop 20
Stop 21
...

... however, this is not currently possible due to how the peliasIndexOneEdgeGram treats numbers as a single token:

GET /pelias/_analyze?analyzer=peliasIndexOneEdgeGram&text=222

{
   "tokens": [
      {
         "token": "222",
         "start_offset": 0,
         "end_offset": 3,
         "type": "word",
         "position": 0
      }
   ]
}

This means we can't currently do prefix matching on numbers.
I suspect we did this because storing prefix ngrams for numbers would get pretty messy pretty fast :)

missinglink · 2018-10-09T13:54:40Z

cc/ @fpurcell

orangejulius · 2018-10-16T01:18:01Z

Well, I can confirm this does exactly what it says on the tin. It boosts exact matches for autocomplete. Quite a bit. This does actually negatively affect the autocomplete acceptance test scores.

	master	autocomplete_boost_exact_matches	Change
Pass	3551	3491	-1.69%
Fail	1683	1685	0.12%
Regressions	4556	4616	1.32%
Percent	53.46	52.86	-1.12%
Search passes, autocomplete fails	233	213	-8.58%
Jitter	73	105	43.84%
Pass only on last char	23	33	43.48%

Overall though, the code looks great, and I believe the principle is sound. Perhaps we should look at lowering the boost for this query clause, so it's a little less powerful.

Notable new failures

Most of the failures seem to be while autocompleting admin area names

Pari

brook

Notable improvements

/v1/autocomplete?text=100 20th

This used to return streets that didn't match exactly, but now it does.

missinglink · 2018-10-25T18:22:33Z

I'm still interested in merging this, the 'Failures' represent a change, but I'm not convinced it's negative.

With our current implementation, it's impossible to find some of those records, for example:

/v1/autocomplete?text=pari, italy

 1) Métropole du Grand Paris, Villejuif, France
 2) Paris, France
 3) Paris, France
 4) Paris, France
 5) Paris, France
 6) East Baton Rouge Parish, Central, LA, USA
 7) East Baton Rouge Parish, LA, USA
 8) Jefferson Parish, LA, USA
 9) Jefferson Parish, Lafitte, LA, USA
10) Pariaman, Indonesia

So this means that there are small towns with exact linguistic matches being drowned out by larger cities to the extent that they are no longer discoverable via autocomplete.

This change would allow those exact matches to be discovered, while still allowing the user to continue specifying more of the input until they receive the required result.

[edit] in fact the query /v1/autocomplete?text=pari, estonia returns exactly the same results as the query above.

Joxit · 2018-10-26T10:10:34Z

This also repairs a use case of one of my client.
Here is his use case :

/v1/autocomplete?text=Schirrhein&sources=wof,oa,osm

 1) 9 Rue De Schirrhein, Soufflenheim, France
 2) 6 Rue De Schirrhein, Soufflenheim, France
 3) 18 Rue De Schirrhein, Soufflenheim, France
 4) 33 Rue De Schirrhein, Soufflenheim, France
 5) 10 Rue De Schirrhein, Soufflenheim, France
 6) 28 Rue De Schirrhein, Soufflenheim, France
 7) 11 Rue De Schirrhein, Soufflenheim, France
 8) 25 Rue De Schirrhein, Soufflenheim, France
 9) 1 Rue De Schirrhein, Soufflenheim, France
10) 24 Rue De Schirrhein, Soufflenheim, France

As a result he only has complete addresses, except that he searches for the city.
With this PR here is what he will have

/v1/autocomplete?text=Schirrhein&sources=wof,oa,osm

 1) Schirrhein, France
 2) Schirrhein, France
 3) Schirrhein, France
 4) FC Schirrhein, Schirrhein, France
 5) 9 Rue De Schirrhein, Soufflenheim, France
 6) 6 Rue De Schirrhein, Soufflenheim, France
 7) 18 Rue De Schirrhein, Soufflenheim, France
 8) 33 Rue De Schirrhein, Soufflenheim, France
 9) 10 Rue De Schirrhein, Soufflenheim, France
10) 28 Rue De Schirrhein, Soufflenheim, France

orangejulius · 2018-10-29T12:32:37Z

I'm coming around to this PR.

My main objection at this point is that it adds another potential query clause to our already complex autocomplete queries (as I've been digging through them in the last few days, the number of possible query layouts is actually quite staggering!).

I still think the boosting is a little too strong, but especially for the single token input case, it's clear we are currently under-scoring exactly matching results.

I'm open to ideas to tune it a bit, but if there aren't any that are immediately obvious, we can look to simplify our autocomplete queries separately, and it seems like the good far outweighs the bad already here!

orangejulius · 2018-10-29T12:54:48Z

The Brooklyn results concern me as well, it looks like now our queries would be predisposed to match fairly low importance records that have multiple words and don't even start with the word from the input text. Any thoughts on how we could fix this? That would be a huge improvement.

missinglink · 2018-11-09T10:54:29Z

I wrote up a wiki article explaining the pros/cons of different autocomplete query styles https://github.com/pelias/api/wiki/Autocomplete-search-logic

missinglink · 2018-11-09T11:47:54Z

@orangejulius I had a look at the autocomplete queries today following on from writing the wiki article linked above ^

I think that it's appropriate to generate 4 different queries for the hybrid approach, depending on the amount of tokens provided in the input and their state of completion.

single token inputs

For a single-token input there are two options:

The token is complete (it was output by a parser)
The token is incomplete (it was not able to be parsed)

In the first situation (which is fairly rare), we produce one gram per token using the peliasQueryFullToken analyser and compare it to the phrase.default index, only one query clause is generated.

In the second situation we also produce one gram per token using the peliasQueryPartialToken analyser but this time we query the name.default index, only one query clause is generated.

This PR would add a second clause in this situation which additionally queries the phrase.default index too, effectively boosting exact matches, so it would generate two clauses.

note: the difference between peliasQueryFullToken and peliasQueryPartialToken is simply how confident it is to apply synonym substitution

multiple token inputs

For a multi-token input there are also two options:

The last token is complete (it was output by a parser)
The last token is incomplete (it was not able to be parsed)

I wont repeat myself, but basically the same thing happens as explained above for the last token.
In both cases we generate a query against phrase.default for the prior tokens (all but the last one).

So it generates two clauses for each of these options, and would produce a third clause with this PR.

thoughts / conclusions

I think the "hybrid" approach is superious to both the "exact" and "loose" methods, it looks impossible to avoid two clauses for a multi-token input using "hybrid"

We might be able to reduce complexity by dropping support for the "rare cases" where all tokens have been classified as complete, or at least we can just use normal search at this point. This would reduce the complexity to two possible query permutations.

I'm not 100% sure on the reasoning behind the constant_score clause, this will need more investigation.

I still think adding a third clause will 'tip the balance' of "hybrid" more towards "strict" than "loose" but I feel that while it's a change, it's a marked improvement in user experience.

We should spend a little more time tuning this balance between "strict" an "loose" to a point we feel is appropriate, I agree it might be a little too far in the direction of "strict" right now.

missinglink · 2018-11-09T12:32:38Z

Also worth noting that our min_ngram size is currently set at 2, so if a clause was to be generated against the name.default index which tested a single-character token, then the clause is discarded.

We might be able to get rid of this too, but we'd need to check the queries to ensure we're not doing something like a MUST on that which would return zero results.

missinglink · 2018-11-09T13:23:31Z

I just wrote some unit tests to cover all the different autocomplete query permutations in #1240

…atches

orangejulius · 2019-01-28T17:56:05Z

I took a look at modifying this PR to lower the boost for exact matches a little. It didn't seem to have any effect, but might be worth a little more experimentation.

missinglink · 2021-08-09T11:08:21Z

This PR is so old now it's unlikely it can ever be merged, mainly because it uses the addressit parser which has long since been retired.

It's a bit of a pity though, I was investigating a report similar to the issues described here, I wonder if we can update the code and still get the benefits shown above but with the more modern parser.

missinglink requested a review from orangejulius October 9, 2018 13:49

orangejulius force-pushed the autocomplete_boost_exact_matches branch from 4140535 to 20764b8 Compare October 15, 2018 20:19

orangejulius force-pushed the autocomplete_boost_exact_matches branch from 20764b8 to 910a807 Compare November 1, 2018 17:53

missinglink force-pushed the autocomplete_boost_exact_matches branch from 910a807 to 8acbbaf Compare November 8, 2018 11:36

missinglink force-pushed the autocomplete_boost_exact_matches branch from 33f082a to c0cd6b8 Compare November 11, 2018 17:00

orangejulius force-pushed the autocomplete_boost_exact_matches branch from c0cd6b8 to 9206a23 Compare December 20, 2018 16:52

orangejulius force-pushed the autocomplete_boost_exact_matches branch from 9206a23 to a92bd8e Compare January 11, 2019 17:13

feat(autocomplete_boost_exact_matches): improved boosting for exact m…

f2847af

…atches

orangejulius force-pushed the autocomplete_boost_exact_matches branch from a92bd8e to f2847af Compare January 14, 2019 22:17

feat(addressit): bump addressit version to pull in PRs authored by us

3198ef2

Joxit mentioned this pull request Oct 15, 2021

How to improve focus point queries #1567

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improved boosting for exact matches where the last token is in fact complete #1202

improved boosting for exact matches where the last token is in fact complete #1202

missinglink commented Oct 9, 2018 •

edited

Loading

missinglink commented Oct 9, 2018 •

edited

Loading

missinglink commented Oct 9, 2018

orangejulius commented Oct 16, 2018 •

edited

Loading

missinglink commented Oct 25, 2018 •

edited

Loading

Joxit commented Oct 26, 2018

orangejulius commented Oct 29, 2018

orangejulius commented Oct 29, 2018 •

edited

Loading

missinglink commented Nov 9, 2018

missinglink commented Nov 9, 2018

missinglink commented Nov 9, 2018 •

edited

Loading

missinglink commented Nov 9, 2018

orangejulius commented Jan 28, 2019

missinglink commented Aug 9, 2021 •

edited

Loading

improved boosting for exact matches where the last token is in fact complete #1202

Are you sure you want to change the base?

improved boosting for exact matches where the last token is in fact complete #1202

Conversation

missinglink commented Oct 9, 2018 • edited Loading

missinglink commented Oct 9, 2018 • edited Loading

missinglink commented Oct 9, 2018

orangejulius commented Oct 16, 2018 • edited Loading

Notable new failures

Notable improvements

missinglink commented Oct 25, 2018 • edited Loading

Joxit commented Oct 26, 2018

orangejulius commented Oct 29, 2018

orangejulius commented Oct 29, 2018 • edited Loading

missinglink commented Nov 9, 2018

missinglink commented Nov 9, 2018

single token inputs

multiple token inputs

thoughts / conclusions

missinglink commented Nov 9, 2018 • edited Loading

missinglink commented Nov 9, 2018

orangejulius commented Jan 28, 2019

missinglink commented Aug 9, 2021 • edited Loading

missinglink commented Oct 9, 2018 •

edited

Loading

missinglink commented Oct 9, 2018 •

edited

Loading

orangejulius commented Oct 16, 2018 •

edited

Loading

missinglink commented Oct 25, 2018 •

edited

Loading

orangejulius commented Oct 29, 2018 •

edited

Loading

missinglink commented Nov 9, 2018 •

edited

Loading

missinglink commented Aug 9, 2021 •

edited

Loading