Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improved boosting for exact matches where the last token is in fact complete #1202

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

missinglink
Copy link
Member

@missinglink missinglink commented Oct 9, 2018

Today I was looking into the scoring of queries for autocomplete transit stops and managed to fix some incorrect scoring in some cases:

before:

/v1/autocomplete?text=stop 2

 1)	SW 2nd & Alder (TriMet Stop ID 7095), Portland, OR, USA
 2)	W Burnside & SW 2nd (TriMet Stop ID 9526), Portland, OR, USA
 3)	SW 2nd & Taylor (TriMet Stop ID 7098), Portland, OR, USA
...

/v1/autocomplete?text=2905

 1)	2905 SW 1st Ave, Portland, OR, USA
 2)	2905 SW 4th Ave, Portland, OR, USA
 3)	2905 SE 17th Ave, Portland, OR, USA

after:

/v1/autocomplete?text=stop 2

 1)	A Ave & Chandler (TriMet Stop ID 2), Lake Oswego, OR, USA
 2)	W Burnside & SW 2nd (TriMet Stop ID 9526), Portland, OR, USA
 3)	NW Everett & 2nd (TriMet Stop ID 1612), Portland, OR, USA

/v1/autocomplete?text=2905

 1)	SE Hwy 212 & 122nd (TriMet Stop ID 2905), Clackamas County, OR, USA
 2)	2905 SW 1st Ave, Portland, OR, USA
 3)	2905 SW 4th Ave, Portland, OR, USA

So.. explaining this one is a bit complex but I'll give it a shot!

For autocomplete we split the input text in to 'tokens', and then we mark those tokens as complete or incomplete, the gist of this is that every token except the last one is 'complete', the last one we don't know if the user has finished typing that word or not.

An example query:

input: "stop 2"
tokens: [ "stop", "2" ]
complete: [ "stop" ]
incomplete: [ "2" ]

So before this PR we would already boost documents matching stop in the phrase index, this is important because it boosted exact matching phrases higher and didn't use the ngram index.

So this prevented "Clayton Avenue" and "Clay Ave" from scoring the same despite sharing common ngrams.

This works great when the first token is complete and the last one isn't.

This PR now covers the additional case where the last token is in fact also complete (something we cannot tell from just looking at the text)

So in the example above, we will generate one more subquery which also tries to match stop 2, and in this case, it would find an exact match, giving that document an additional boost.

@missinglink
Copy link
Member Author

missinglink commented Oct 9, 2018

I also looked in to getting numerically ascending scoring in this case, such as:

?text=stop 2

Stop 2
Stop 20
Stop 21
...

... however, this is not currently possible due to how the peliasIndexOneEdgeGram treats numbers as a single token:

GET /pelias/_analyze?analyzer=peliasIndexOneEdgeGram&text=222

{
   "tokens": [
      {
         "token": "222",
         "start_offset": 0,
         "end_offset": 3,
         "type": "word",
         "position": 0
      }
   ]
}

This means we can't currently do prefix matching on numbers.
I suspect we did this because storing prefix ngrams for numbers would get pretty messy pretty fast :)

@missinglink
Copy link
Member Author

cc/ @fpurcell

@orangejulius orangejulius force-pushed the autocomplete_boost_exact_matches branch from 4140535 to 20764b8 Compare October 15, 2018 20:19
@orangejulius
Copy link
Member

orangejulius commented Oct 16, 2018

Well, I can confirm this does exactly what it says on the tin. It boosts exact matches for autocomplete. Quite a bit. This does actually negatively affect the autocomplete acceptance test scores.

  master autocomplete_boost_exact_matches Change
Pass 3551 3491 -1.69%
Fail 1683 1685 0.12%
Regressions 4556 4616 1.32%
Percent 53.46 52.86 -1.12%
Search passes, autocomplete fails 233 213 -8.58%
Jitter 73 105 43.84%
Pass only on last char 23 33 43.48%

Overall though, the code looks great, and I believe the principle is sound. Perhaps we should look at lowering the boost for this query clause, so it's a little less powerful.

Notable new failures

Most of the failures seem to be while autocompleting admin area names

Pari
image

image

brook

Notable improvements

/v1/autocomplete?text=100 20th
image
This used to return streets that didn't match exactly, but now it does.

@missinglink
Copy link
Member Author

missinglink commented Oct 25, 2018

I'm still interested in merging this, the 'Failures' represent a change, but I'm not convinced it's negative.

With our current implementation, it's impossible to find some of those records, for example:

/v1/autocomplete?text=pari, italy

 1) Métropole du Grand Paris, Villejuif, France
 2) Paris, France
 3) Paris, France
 4) Paris, France
 5) Paris, France
 6) East Baton Rouge Parish, Central, LA, USA
 7) East Baton Rouge Parish, LA, USA
 8) Jefferson Parish, LA, USA
 9) Jefferson Parish, Lafitte, LA, USA
10) Pariaman, Indonesia

So this means that there are small towns with exact linguistic matches being drowned out by larger cities to the extent that they are no longer discoverable via autocomplete.

This change would allow those exact matches to be discovered, while still allowing the user to continue specifying more of the input until they receive the required result.

[edit] in fact the query /v1/autocomplete?text=pari, estonia returns exactly the same results as the query above.

@Joxit
Copy link
Member

Joxit commented Oct 26, 2018

This also repairs a use case of one of my client.
Here is his use case :

/v1/autocomplete?text=Schirrhein&sources=wof,oa,osm

 1) 9 Rue De Schirrhein, Soufflenheim, France
 2) 6 Rue De Schirrhein, Soufflenheim, France
 3) 18 Rue De Schirrhein, Soufflenheim, France
 4) 33 Rue De Schirrhein, Soufflenheim, France
 5) 10 Rue De Schirrhein, Soufflenheim, France
 6) 28 Rue De Schirrhein, Soufflenheim, France
 7) 11 Rue De Schirrhein, Soufflenheim, France
 8) 25 Rue De Schirrhein, Soufflenheim, France
 9) 1 Rue De Schirrhein, Soufflenheim, France
10) 24 Rue De Schirrhein, Soufflenheim, France

As a result he only has complete addresses, except that he searches for the city.
With this PR here is what he will have

/v1/autocomplete?text=Schirrhein&sources=wof,oa,osm

 1) Schirrhein, France
 2) Schirrhein, France
 3) Schirrhein, France
 4) FC Schirrhein, Schirrhein, France
 5) 9 Rue De Schirrhein, Soufflenheim, France
 6) 6 Rue De Schirrhein, Soufflenheim, France
 7) 18 Rue De Schirrhein, Soufflenheim, France
 8) 33 Rue De Schirrhein, Soufflenheim, France
 9) 10 Rue De Schirrhein, Soufflenheim, France
10) 28 Rue De Schirrhein, Soufflenheim, France

@orangejulius
Copy link
Member

I'm coming around to this PR.

My main objection at this point is that it adds another potential query clause to our already complex autocomplete queries (as I've been digging through them in the last few days, the number of possible query layouts is actually quite staggering!).

I still think the boosting is a little too strong, but especially for the single token input case, it's clear we are currently under-scoring exactly matching results.

I'm open to ideas to tune it a bit, but if there aren't any that are immediately obvious, we can look to simplify our autocomplete queries separately, and it seems like the good far outweighs the bad already here!

@orangejulius
Copy link
Member

orangejulius commented Oct 29, 2018

The Brooklyn results concern me as well, it looks like now our queries would be predisposed to match fairly low importance records that have multiple words and don't even start with the word from the input text. Any thoughts on how we could fix this? That would be a huge improvement.

@orangejulius orangejulius force-pushed the autocomplete_boost_exact_matches branch from 20764b8 to 910a807 Compare November 1, 2018 17:53
@missinglink missinglink force-pushed the autocomplete_boost_exact_matches branch from 910a807 to 8acbbaf Compare November 8, 2018 11:36
@missinglink
Copy link
Member Author

I wrote up a wiki article explaining the pros/cons of different autocomplete query styles https://github.com/pelias/api/wiki/Autocomplete-search-logic

@missinglink
Copy link
Member Author

@orangejulius I had a look at the autocomplete queries today following on from writing the wiki article linked above ^

I think that it's appropriate to generate 4 different queries for the hybrid approach, depending on the amount of tokens provided in the input and their state of completion.

single token inputs

For a single-token input there are two options:

  • The token is complete (it was output by a parser)
  • The token is incomplete (it was not able to be parsed)

In the first situation (which is fairly rare), we produce one gram per token using the peliasQueryFullToken analyser and compare it to the phrase.default index, only one query clause is generated.

In the second situation we also produce one gram per token using the peliasQueryPartialToken analyser but this time we query the name.default index, only one query clause is generated.

This PR would add a second clause in this situation which additionally queries the phrase.default index too, effectively boosting exact matches, so it would generate two clauses.

note: the difference between peliasQueryFullToken and peliasQueryPartialToken is simply how confident it is to apply synonym substitution

multiple token inputs

For a multi-token input there are also two options:

  • The last token is complete (it was output by a parser)
  • The last token is incomplete (it was not able to be parsed)

I wont repeat myself, but basically the same thing happens as explained above for the last token.
In both cases we generate a query against phrase.default for the prior tokens (all but the last one).

So it generates two clauses for each of these options, and would produce a third clause with this PR.

thoughts / conclusions

I think the "hybrid" approach is superious to both the "exact" and "loose" methods, it looks impossible to avoid two clauses for a multi-token input using "hybrid"

We might be able to reduce complexity by dropping support for the "rare cases" where all tokens have been classified as complete, or at least we can just use normal search at this point. This would reduce the complexity to two possible query permutations.

I'm not 100% sure on the reasoning behind the constant_score clause, this will need more investigation.

I still think adding a third clause will 'tip the balance' of "hybrid" more towards "strict" than "loose" but I feel that while it's a change, it's a marked improvement in user experience.

We should spend a little more time tuning this balance between "strict" an "loose" to a point we feel is appropriate, I agree it might be a little too far in the direction of "strict" right now.

@missinglink
Copy link
Member Author

missinglink commented Nov 9, 2018

Also worth noting that our min_ngram size is currently set at 2, so if a clause was to be generated against the name.default index which tested a single-character token, then the clause is discarded.

We might be able to get rid of this too, but we'd need to check the queries to ensure we're not doing something like a MUST on that which would return zero results.

@missinglink
Copy link
Member Author

I just wrote some unit tests to cover all the different autocomplete query permutations in #1240

@missinglink missinglink force-pushed the autocomplete_boost_exact_matches branch from 33f082a to c0cd6b8 Compare November 11, 2018 17:00
@orangejulius orangejulius force-pushed the autocomplete_boost_exact_matches branch from c0cd6b8 to 9206a23 Compare December 20, 2018 16:52
@orangejulius orangejulius force-pushed the autocomplete_boost_exact_matches branch from 9206a23 to a92bd8e Compare January 11, 2019 17:13
@orangejulius orangejulius force-pushed the autocomplete_boost_exact_matches branch from a92bd8e to f2847af Compare January 14, 2019 22:17
@orangejulius
Copy link
Member

I took a look at modifying this PR to lower the boost for exact matches a little. It didn't seem to have any effect, but might be worth a little more experimentation.

@missinglink
Copy link
Member Author

missinglink commented Aug 9, 2021

This PR is so old now it's unlikely it can ever be merged, mainly because it uses the addressit parser which has long since been retired.

It's a bit of a pity though, I was investigating a report similar to the issues described here, I wonder if we can update the code and still get the benefits shown above but with the more modern parser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants