-
-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improved boosting for exact matches where the last token is in fact complete #1202
base: master
Are you sure you want to change the base?
Conversation
I also looked in to getting numerically ascending scoring in this case, such as:
... however, this is not currently possible due to how the
This means we can't currently do prefix matching on numbers. |
cc/ @fpurcell |
4140535
to
20764b8
Compare
Well, I can confirm this does exactly what it says on the tin. It boosts exact matches for autocomplete. Quite a bit. This does actually negatively affect the autocomplete acceptance test scores.
Overall though, the code looks great, and I believe the principle is sound. Perhaps we should look at lowering the boost for this query clause, so it's a little less powerful. Notable new failuresMost of the failures seem to be while autocompleting admin area names Notable improvements/v1/autocomplete?text=100 20th |
I'm still interested in merging this, the 'Failures' represent a change, but I'm not convinced it's negative. With our current implementation, it's impossible to find some of those records, for example:
So this means that there are small towns with exact linguistic matches being drowned out by larger cities to the extent that they are no longer discoverable via autocomplete. This change would allow those exact matches to be discovered, while still allowing the user to continue specifying more of the input until they receive the required result. [edit] in fact the query |
This also repairs a use case of one of my client.
As a result he only has complete addresses, except that he searches for the city.
|
I'm coming around to this PR. My main objection at this point is that it adds another potential query clause to our already complex autocomplete queries (as I've been digging through them in the last few days, the number of possible query layouts is actually quite staggering!). I still think the boosting is a little too strong, but especially for the single token input case, it's clear we are currently under-scoring exactly matching results. I'm open to ideas to tune it a bit, but if there aren't any that are immediately obvious, we can look to simplify our autocomplete queries separately, and it seems like the good far outweighs the bad already here! |
The Brooklyn results concern me as well, it looks like now our queries would be predisposed to match fairly low importance records that have multiple words and don't even start with the word from the input text. Any thoughts on how we could fix this? That would be a huge improvement. |
20764b8
to
910a807
Compare
910a807
to
8acbbaf
Compare
I wrote up a wiki article explaining the pros/cons of different autocomplete query styles https://github.com/pelias/api/wiki/Autocomplete-search-logic |
@orangejulius I had a look at the autocomplete queries today following on from writing the wiki article linked above ^ I think that it's appropriate to generate 4 different queries for the hybrid approach, depending on the amount of tokens provided in the input and their state of completion. single token inputsFor a single-token input there are two options:
In the first situation (which is fairly rare), we produce one gram per token using the In the second situation we also produce one gram per token using the This PR would add a second clause in this situation which additionally queries the
multiple token inputsFor a multi-token input there are also two options:
I wont repeat myself, but basically the same thing happens as explained above for the last token. So it generates two clauses for each of these options, and would produce a third clause with this PR. thoughts / conclusionsI think the "hybrid" approach is superious to both the "exact" and "loose" methods, it looks impossible to avoid two clauses for a multi-token input using "hybrid" We might be able to reduce complexity by dropping support for the "rare cases" where all tokens have been classified as complete, or at least we can just use normal search at this point. This would reduce the complexity to two possible query permutations. I'm not 100% sure on the reasoning behind the I still think adding a third clause will 'tip the balance' of "hybrid" more towards "strict" than "loose" but I feel that while it's a change, it's a marked improvement in user experience. We should spend a little more time tuning this balance between "strict" an "loose" to a point we feel is appropriate, I agree it might be a little too far in the direction of "strict" right now. |
Also worth noting that our We might be able to get rid of this too, but we'd need to check the queries to ensure we're not doing something like a |
I just wrote some unit tests to cover all the different autocomplete query permutations in #1240 |
33f082a
to
c0cd6b8
Compare
c0cd6b8
to
9206a23
Compare
9206a23
to
a92bd8e
Compare
a92bd8e
to
f2847af
Compare
I took a look at modifying this PR to lower the boost for exact matches a little. It didn't seem to have any effect, but might be worth a little more experimentation. |
This PR is so old now it's unlikely it can ever be merged, mainly because it uses the It's a bit of a pity though, I was investigating a report similar to the issues described here, I wonder if we can update the code and still get the benefits shown above but with the more modern parser. |
Today I was looking into the scoring of queries for autocomplete transit stops and managed to fix some incorrect scoring in some cases:
before:
after:
So.. explaining this one is a bit complex but I'll give it a shot!
For autocomplete we split the input text in to 'tokens', and then we mark those tokens as complete or incomplete, the gist of this is that every token except the last one is 'complete', the last one we don't know if the user has finished typing that word or not.
An example query:
So before this PR we would already boost documents matching
stop
in the phrase index, this is important because it boosted exact matching phrases higher and didn't use the ngram index.So this prevented "Clayton Avenue" and "Clay Ave" from scoring the same despite sharing common ngrams.
This works great when the first token is complete and the last one isn't.
This PR now covers the additional case where the last token is in fact also complete (something we cannot tell from just looking at the text)
So in the example above, we will generate one more subquery which also tries to match
stop 2
, and in this case, it would find an exact match, giving that document an additional boost.