Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong tokenization of "Shell" as "She, ll" #775

Closed
vikrantsharma7 opened this issue Jan 25, 2017 · 3 comments
Closed

Wrong tokenization of "Shell" as "She, ll" #775

vikrantsharma7 opened this issue Jan 25, 2017 · 3 comments
Labels
lang / en English language data and models

Comments

@vikrantsharma7
Copy link

vikrantsharma7 commented Jan 25, 2017

nlp = spacy.load("en")
doc = nlp(u"Shell is a good brand.")
pretty_print(doc)  # custom function to print dependency tree

gives the following output:

is (VBZ) ROOT
l ---   ll (MD) nsubj
l ------   She (PRP) nsubj
r ---   brand (NN) attr
l ------   a (DT) det
l ------   good (JJ) amod
r ---   . (.) punct

"Shell" should not be tokenized into ["She", "ll"].
Could not even find this mapping in tokenizer/specials.json

Any way to fix this?

  • Operating System: Ubuntu 16.04 LTS
  • Python Version Used: 2.7.12
  • spaCy Version Used: 1.6.0
@ines
Copy link
Member

ines commented Jan 25, 2017

Ah, damn, this should have be added to the excluded tokenizer exceptions in spacy/en/tokenizer_exceptions.py. Will fix this right now and add a regression test.

The specials.json is the old, deprecated data btw. It's still in use in the current models, but this will change with the v2.0 release to keep things more consistent.

@ines ines added lang / en English language data and models performance labels Jan 25, 2017
@ines ines closed this as completed in 209c37b Jan 25, 2017
ines added a commit that referenced this issue Jan 25, 2017
@vikrantsharma7
Copy link
Author

Thanks! :)

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lang / en English language data and models
Projects
None yet
Development

No branches or pull requests

2 participants