-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NER doesn't identify lowercase entities #701
Comments
Hi, We're working on NER models that are less case sensitive, but in the meantime, there are a few ways to exert rule-based control of the NER, to fix these cases. For single tokens, you could use the tokenizer exceptions as follows:
You can read more about the tokenizer exceptions here: https://spacy.io/docs/usage/customizing-tokenizer The tokenizer exceptions solution works well for single words, but doesn't help you with something like 'south korea'. For that you could use the rule matcher: https://spacy.io/docs/usage/rule-based-matching . Remember to add an on_match callback to actually assign the entities --- the matcher itself just identifies the spans; you still need to set the attributes. The problem in general is that both the tagger and entity recogniser make use of several feature functions that are case sensitive. This is good in general, but can be problematic for certain text types. Here's a suggestion I've been thinking about for a while, but haven't played with yet. It probably takes a little bit of tuning. The relevant feature functions ask about the word's "shape", whether it's upper case, whether it's lower case, and it's distributional similarity cluster. We can redefine the values of these features for specific words, and thereby trick the models into making a different decision. To do this, first look up the word in spaCy's vocabulary, to get the relevant Lexeme object:
For a more systematic approach, we can find all word that are usually title-cased:
You probably want some margin on the probability, but for now we'll just take everything that's more common in title-case than in lower-case. Now we iterate over these usually titled words, look up the lower-case version, and rewrite the features:
At first glance, this appears to work:
If you give this a try, please let us all know how you go :) |
Thanks for such a great reply. I got re-tasked to another project temporarily but will be coming back this soon. I'm also tempted to see how many false positives I'd get if I simply title cased a query before passing it to the NER. |
@honnibal Regarding the last option, in short, we're comparing the smoothed log probability estimate of token's type for each title cased word and its lower case version in the vocab. If the probability of the lower case version is less than the probability of the title case version, then we assume it is more likely to be title case. Next, we update the token attributes relevant for NER classification of the lower case version to match that of the title case version so the NER will think its an entity. Neat! I'll have to get back to you on how well this works in my domain. Can you tell me more about the smoothed log probability estimate? I see how it is defined as a property in the lexeme code, but I'm interested in knowing how it is calculated. Couldn't find that part. Couple minor code changes:
|
Finding the probability for the lower case compared to title case, and then updating the token attributes to mark it as an entity makes sense, but fails when generalised over a big set of data like name of persons. Would it be fine if we train a model on a training set which includes the same set of lines in all the lower , upper and title case ?? |
@Spawnakshay If you have the training data yourself, then yes forcing to lower-case makes sense. The complication is that I can't ship you the training data I'm using, because of licensing constraints. @bluefuzz01 : The log probability was estimated from counts over the Reddit comment corpus 2009-2015 (~80b tokens), smoothed using Simple Good-Turing estimation (Gale's publication "Good-Turing estimation without the tears"). The smoothing implementation is in the |
The new version 1.8.0 comes with bug fixes to the NER training procedure and a new We've also updated the docs with more information on training and NER training in particular:
I hope this helps! |
To someone who checks this part of issue tracker, one easy way to mitigate against this is to run your poorly-formatted text through Truecaser first, then apply the NER. |
Truecaser seems a good solution for case issue in NER with Spacy but the model is big and might not be good for real time applications. @arjunmenon do you have smaller (but less accurate) model? |
Hey @arezae PS - sorry getting back late. coudn't keep track of this issue. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
As the title suggests, entities in lower case are not recognized as entities. I also noticed entities in upper case are not recognized either. It seems to only recognize entities with title/proper case:
EX: United States but not united states or UNITED STATES
Are there any plans to improve detection for these instances? Has anyone attempted this problem yet? If so, what did you do to deal with these cases?
Thanks!
Your Environment
The text was updated successfully, but these errors were encountered: