Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Users fuzzy search enhancements (#541)
### Description This PR improves users fuzzy search results. Searching algorithm is refactored in a more simple way: - for each user: - attribute a score based on the query, the user attributes and a similarity algorithm - insert the score and its corresponding user in a sorted list - remove the lowest score of the list if its size is superior to the `limit` parameter (thus, always keeping the list sorted with the `N=limit` best results) The score of each user correspond to the highest Jaro-Winkler similarity between the query and: - firstname - name - firstname + name - name + firstname - nickname (if exists) This method aims to fit real searches: queries are often related to one of these 5 strings, but we don't know which one. The higher the similarity is between the query and one of them, the more likely the query is related to it. For well-constructed queries, false positives will always come after the good results. Before running the Jaro-Winkler algorithm, all strings are "unaccentuated" to make the similarity algorithm insensitive to accents. Moreover, queries from the `/users/search` endpoint are "capworded". We assume that queries from this endpoint are often the beginning of a name / firstname / nickname. This way, queries like `max` will better match `Maxou` than `Kmax`, which may better corresponds to the enduser's search. This method has proven to give better results on limited subsets of users (~30) and queries (~20), while keeping one of the highest performance. Other methods tested include: - previous `sort_user()` function - `SequenceMatcher` of the standard DiffLib library - Jaro-Winkler algorithm from RapidFuzz library (however, on longer strings, RapidFuzz pretend to be faster than Jellyfish) - Indel algorithm from RapidFuzz library - Damerau-Levenshtein algorithm from RapidFuzz library - `partial_ratio()` from RapidFuzz library - `token_ratio()` from RapidFuzz library - `partial_token_ratio()` from RapidFuzz library On the tested data with a `limit` parameter of 10, the new function introduced by this PR takes 50% more time than the old one. This is due to more similarities beeing computed (with 4 instead of 5, both take approximately the same time). However, it still seems to be reasonable for production use, and is better than other tested algorithms, especially when `list` or number of users in the database increase. ### Checklist - [ ] Created tests which fail without the change (if possible) - [x] All tests passing - [ ] Extended the documentation, if necessary --------- Co-authored-by: Petitoto <[email protected]>
- Loading branch information