Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR improves users fuzzy search results.
Searching algorithm is refactored in a more simple way:
limit
parameter (thus, always keeping the list sorted with theN=limit
best results)The score of each user correspond to the highest Jaro-Winkler similarity between the query and:
This method aims to fit real searches: queries are often related to one of these 5 strings, but we don't know which one. The higher the similarity is between the query and one of them, the more likely the query is related to it. For well-constructed queries, false positives will always come after the good results.
Before running the Jaro-Winkler algorithm, all strings are "unaccentuated" to make the similarity algorithm insensitive to accents. Moreover, queries from the
/users/search
endpoint are "capworded". We assume that queries from this endpoint are often the beginning of a name / firstname / nickname. This way, queries likemax
will better matchMaxou
thanKmax
, which may better corresponds to the enduser's search.This method has proven to give better results on limited subsets of users (~30) and queries (~20), while keeping one of the highest performance. Other methods tested include:
sort_user()
functionSequenceMatcher
of the standard DiffLib librarypartial_ratio()
from RapidFuzz librarytoken_ratio()
from RapidFuzz librarypartial_token_ratio()
from RapidFuzz libraryOn the tested data with a
limit
parameter of 10, the new function introduced by this PR takes 50% more time than the old one.This is due to more similarities beeing computed (with 4 instead of 5, both take approximately the same time).
However, it still seems to be reasonable for production use, and is better than other tested algorithms, especially when
list
or number of users in the database increase.Checklist