-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EN Tokenizer Error: 'shell' tokenized as 'she', 'll', etc. #847
Comments
Thanks for the report! The If you install from master, it should be fixed now – we'll also make a bug fix release soon that will include those changes. |
Thanks. Looking at en.tokenizer_exceptions.EXCLUDE_EXC, there is perhaps one other case that should be added, id. As in "The id and the ego...". Currently 'id' tokenizes as 'i', 'd' and shouldn't in this case. |
Thanks, good point! Thinking about it, this is actually a tricky one... In general, we prefer to base the default tokenizer exceptions on what's most common. If we come across To deal with this problem, we've been thinking about adding a new method to the |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
The 1.6.0 tokenizer is incorrectly tokenizing words that have a 'she' prefix.
Examples:
'This sea shell is unique', tokenizes 'shell' as 'she', 'll'
'The shovel is in the shed', tokenizes 'shed' as 'she', 'd'
Your Environment
The text was updated successfully, but these errors were encountered: