Skip to content

Commit

Permalink
Users fuzzy search enhancements (#541)
Browse files Browse the repository at this point in the history
### Description

This PR improves users fuzzy search results.

Searching algorithm is refactored in a more simple way:
- for each user: 
- attribute a score based on the query, the user attributes and a
similarity algorithm
    - insert the score and its corresponding user in a sorted list
- remove the lowest score of the list if its size is superior to the
`limit` parameter (thus, always keeping the list sorted with the
`N=limit` best results)

The score of each user correspond to the highest Jaro-Winkler similarity
between the query and:
- firstname
- name
- firstname + name
- name + firstname
- nickname (if exists)

This method aims to fit real searches: queries are often related to one
of these 5 strings, but we don't know which one. The higher the
similarity is between the query and one of them, the more likely the
query is related to it. For well-constructed queries, false positives
will always come after the good results.

Before running the Jaro-Winkler algorithm, all strings are
"unaccentuated" to make the similarity algorithm insensitive to accents.
Moreover, queries from the `/users/search` endpoint are "capworded". We
assume that queries from this endpoint are often the beginning of a name
/ firstname / nickname. This way, queries like `max` will better match
`Maxou` than `Kmax`, which may better corresponds to the enduser's
search.

This method has proven to give better results on limited subsets of
users (~30) and queries (~20), while keeping one of the highest
performance. Other methods tested include:
- previous `sort_user()` function
- `SequenceMatcher` of the standard DiffLib library
- Jaro-Winkler algorithm from RapidFuzz library (however, on longer
strings, RapidFuzz pretend to be faster than Jellyfish)
- Indel algorithm from RapidFuzz library
- Damerau-Levenshtein algorithm from RapidFuzz library
- `partial_ratio()` from RapidFuzz library
- `token_ratio()` from RapidFuzz library
- `partial_token_ratio()` from RapidFuzz library

On the tested data with a `limit` parameter of 10, the new function
introduced by this PR takes 50% more time than the old one.
This is due to more similarities beeing computed (with 4 instead of 5,
both take approximately the same time).
However, it still seems to be reasonable for production use, and is
better than other tested algorithms, especially when `list` or number of
users in the database increase.

### Checklist

- [ ] Created tests which fail without the change (if possible)
- [x] All tests passing
- [ ] Extended the documentation, if necessary

---------

Co-authored-by: Petitoto <[email protected]>
  • Loading branch information
Petitoto and Petitoto authored Sep 29, 2024
1 parent 9cf7dd5 commit 0a1ac0a
Show file tree
Hide file tree
Showing 2 changed files with 27 additions and 41 deletions.
9 changes: 5 additions & 4 deletions app/core/users/endpoints_users.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import logging
import re
import string
import uuid
from datetime import UTC, datetime, timedelta

Expand Down Expand Up @@ -103,9 +104,9 @@ async def search_users(
user: models_core.CoreUser = Depends(is_user_an_ecl_member),
):
"""
Search for a user using Jaro_Winkler distance algorithm. The
`query` will be compared against users name, firstname and nickname
Search for a user using Jaro_Winkler distance algorithm.
The `query` will be compared against users name, firstname and nickname.
Assume that `query` is the beginning of a name, so we can capitalize words to improve results.
**The user must be authenticated to use this endpoint**
"""
Expand All @@ -116,7 +117,7 @@ async def search_users(
excluded_groups=excludedGroups,
)

return sort_user(query, users)
return sort_user(string.capwords(query), users)


@router.get(
Expand Down
59 changes: 22 additions & 37 deletions app/utils/tools.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
import bisect
import logging
import os
import re
import secrets
import unicodedata
from collections.abc import Sequence
from pathlib import Path
from typing import TYPE_CHECKING, TypeVar
Expand Down Expand Up @@ -80,50 +82,33 @@ def sort_user(
Search for users using Fuzzy String Matching
`query` will be compared against `users` name, firstname and nickname.
Accents will be ignored.
The size of the answer can be limited using `limit` parameter.
Use Jellyfish library
Use Jaro-Winkler algorithm from Jellyfish library.
"""

# TODO: we may want to cache this object. Its generation may take some time if there is a big user base
names = [f"{user.firstname} {user.name}" for user in users]
nicknames = [user.nickname for user in users]
scored: list[
tuple[CoreUser, float, float, int]
] = [ # (user, name_score, nickname_score, index)
(
user,
def unaccent(s: str) -> str:
return unicodedata.normalize("NFKD", s).encode("ASCII", "ignore").decode("utf8")

query = unaccent(query)
scored: list[tuple[CoreUser, float]] = []
for user in users:
firstname = unaccent(user.firstname)
name = unaccent(user.name)
nickname = unaccent(user.nickname) if user.nickname else None
score = max(
jaro_winkler_similarity(query, firstname),
jaro_winkler_similarity(query, name),
jaro_winkler_similarity(query, f"{firstname} {name}"),
jaro_winkler_similarity(query, f"{name} {firstname}"),
jaro_winkler_similarity(query, nickname) if nickname else 0,
index,
)
for index, (user, name, nickname) in enumerate(
zip(users, names, nicknames, strict=True),
)
]

results = []
for _ in range(min(limit, len(scored))):
maximum_name = max(scored, key=lambda r: r[1])
maximum_nickname = max(scored, key=lambda r: r[2])
if maximum_name[1] > maximum_nickname[2]:
results.append(maximum_name)
scored[maximum_name[3]] = ( # We don't want to use this user again
maximum_name[0],
-1,
-1,
maximum_name[3],
)
else:
results.append(maximum_nickname)
scored[maximum_nickname[3]] = ( # We don't want to use this user again
maximum_nickname[0],
-1,
-1,
maximum_nickname[3],
)

return [result[0] for result in results]
bisect.insort(scored, (user, score), key=(lambda s: s[1]))
if len(scored) > limit:
scored.pop(0)

return [user for user, _ in reversed(scored)]


async def is_group_id_valid(group_id: str, db: AsyncSession) -> bool:
Expand Down

0 comments on commit 0a1ac0a

Please sign in to comment.