Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mappings: add ngram analyzer #127

Merged
merged 7 commits into from
Aug 22, 2024

Conversation

jrcastro2
Copy link
Contributor

@jrcastro2 jrcastro2 commented Feb 21, 2024

@carlinmack
Copy link
Contributor

carlinmack commented Jul 17, 2024

does this need rebasing? it seems like v2.0.0 was edited by Alex a few days after this PR https://github.com/inveniosoftware/invenio-users-resources/blob/master/invenio_users_resources/records/mappings/os-v1/users/user-v2.0.0.json

* Allows partial matches on search
* closes CERNDocumentServer/cds-rdm#114
@carlinmack carlinmack force-pushed the improve-user-search branch from b6792e7 to 20bbd8e Compare July 17, 2024 08:42
@carlinmack
Copy link
Contributor

carlinmack commented Jul 18, 2024

Cases which are failing:

query matching usernames my hypothesis
preferences.visibility:restricted ['pubres'] res
[email protected] ['pub', 'pubres'] inveniosoftware (from user email) and res
[email protected] ['pubres', 'pub'] pub
email:[email protected] ['pub'] inveniosoftware (from user email)
email:[email protected] ['pub'] inveniosoftware (from user email)
preferences.email_visibility:restricted ['pubres'] res
invalid:test ['pub'] inv

@carlinmack
Copy link
Contributor

carlinmack commented Jul 18, 2024

The tests that were failing were searches which should produce no results. They were failing because the n-gram analyser was taking n-grams of the word (pubres -> pub, ... res) and matching on the query. We have swapped to an edge_ngram tokeniser with min 2 max 20.

This means it will match from 2 characters up to 20 from the beginning of the token. This won't have a huge impact on the number of tokens (compared with n-gram) as it is dependant on the length of the token. If we had 2-5 on edge_ngram then it would not produce tokens for 6+ letter words

We also had to change two of the test cases to be specific matches

image

All the tests now pass as they do not match on the search queries

@carlinmack carlinmack force-pushed the improve-user-search branch 3 times, most recently from 609b9bd to bd0afa2 Compare July 22, 2024 14:16
@carlinmack
Copy link
Contributor

Copy link
Contributor Author

@jrcastro2 jrcastro2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚀

@carlinmack carlinmack force-pushed the improve-user-search branch from bd0afa2 to a23a3e6 Compare July 23, 2024 15:08
@carlinmack
Copy link
Contributor

New test case fails on current master

image

but passes with PR

image

@ntarocco ntarocco merged commit 5080d8e into inveniosoftware:master Aug 22, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants