Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Users fuzzy search enhancements #541

Merged
merged 3 commits into from
Sep 29, 2024
Merged

Users fuzzy search enhancements #541

merged 3 commits into from
Sep 29, 2024

Conversation

Petitoto
Copy link
Member

@Petitoto Petitoto commented Aug 25, 2024

Description

This PR improves users fuzzy search results.

Searching algorithm is refactored in a more simple way:

  • for each user:
    • attribute a score based on the query, the user attributes and a similarity algorithm
    • insert the score and its corresponding user in a sorted list
    • remove the lowest score of the list if its size is superior to the limit parameter (thus, always keeping the list sorted with the N=limit best results)

The score of each user correspond to the highest Jaro-Winkler similarity between the query and:

  • firstname
  • name
  • firstname + name
  • name + firstname
  • nickname (if exists)

This method aims to fit real searches: queries are often related to one of these 5 strings, but we don't know which one. The higher the similarity is between the query and one of them, the more likely the query is related to it. For well-constructed queries, false positives will always come after the good results.

Before running the Jaro-Winkler algorithm, all strings are "unaccentuated" to make the similarity algorithm insensitive to accents. Moreover, queries from the /users/search endpoint are "capworded". We assume that queries from this endpoint are often the beginning of a name / firstname / nickname. This way, queries like max will better match Maxou than Kmax, which may better corresponds to the enduser's search.

This method has proven to give better results on limited subsets of users (~30) and queries (~20), while keeping one of the highest performance. Other methods tested include:

  • previous sort_user() function
  • SequenceMatcher of the standard DiffLib library
  • Jaro-Winkler algorithm from RapidFuzz library (however, on longer strings, RapidFuzz pretend to be faster than Jellyfish)
  • Indel algorithm from RapidFuzz library
  • Damerau-Levenshtein algorithm from RapidFuzz library
  • partial_ratio() from RapidFuzz library
  • token_ratio() from RapidFuzz library
  • partial_token_ratio() from RapidFuzz library

On the tested data with a limit parameter of 10, the new function introduced by this PR takes 50% more time than the old one.
This is due to more similarities beeing computed (with 4 instead of 5, both take approximately the same time).
However, it still seems to be reasonable for production use, and is better than other tested algorithms, especially when list or number of users in the database increase.

Checklist

  • Created tests which fail without the change (if possible)
  • All tests passing
  • Extended the documentation, if necessary

Copy link

codecov bot commented Aug 25, 2024

Codecov Report

Attention: Patch coverage is 94.11765% with 1 line in your changes missing coverage. Please review.

Project coverage is 81.83%. Comparing base (91302a3) to head (57f861a).
Report is 37 commits behind head on main.

Files with missing lines Patch % Lines
app/utils/tools.py 93.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #541      +/-   ##
==========================================
+ Coverage   81.77%   81.83%   +0.06%     
==========================================
  Files         125      125              
  Lines        9485     9504      +19     
==========================================
+ Hits         7756     7778      +22     
+ Misses       1729     1726       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@armanddidierjean armanddidierjean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is brillant!

@Rotheem Rotheem merged commit 0a1ac0a into main Sep 29, 2024
9 checks passed
@Rotheem Rotheem deleted the fuzzy-search-improve branch September 29, 2024 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants