-
-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalize diacritics in player lookup #138
Comments
I am considering normalizing the accented and diacritical characters using a method I found here. This seems to slow down the lookup process considerably, because it has to check and normalize every character in every field of every player record. I am going to think about this and do some more testing. |
If that is not performant enough how about adding an attribute to the player object like 'normalizedFullName' so the result would be pre-computed? (edit: or does that object come from the underlying API?) |
I realized last night that the API call is only requesting specific fields, so I checked if there's another field that includes the ascii version of the name, and there is one: Full record, for reference:
|
I forgot to mention this issue in the commit/PR but this is fixed in v1.7.2 |
Great, thanks. 'Acuna' now works but 'Ronald Acuna' still doesn't. Would it make sense to split the input and search for the individual words? |
If you agree on splitting the search query, want me to put in a PR? Would you also want a parameter for the function allowing the caller to choose what should qualify as a match? Options could be all words, any words, or full string. The "match all words" case would be just about as performant as current, since in most cases the first word will not match and we will move on, but the "match any words" case would require more loops linearly with the number of words (again assuming most players are not matches). |
statsapi.lookup_player('Acuna') returns an empty list, not including Ronald Acuña Jr. because it is not a direct text match due to the diacritic on the n. This effectively makes the player lookup feature unusable for search queries coming from user input, as it may not be obvious to an end user that they need to use characters like this and they may not have easy access to these characters on their keyboard. Additionally, some MLB players like Emmanuel Ramirez don't use diacritics in their names, while others like José Ramírez do, and users won't know when they have to use the diacritics to get their expected search results.
The text was updated successfully, but these errors were encountered: