Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contractions do not have the correct lemma #717

Closed
kootenpv opened this issue Jan 2, 2017 · 8 comments
Closed

Contractions do not have the correct lemma #717

kootenpv opened this issue Jan 2, 2017 · 8 comments
Labels
bug Bugs and behaviour differing from documentation lang / en English language data and models

Comments

@kootenpv
Copy link
Contributor

kootenpv commented Jan 2, 2017

I made a pip package called contractions to solve contractions, but it is rather slow (even though I tried to optimise for speed). I did that before working with spacy :)

# seetree is my function that wraps contractions and spacy.nlp
seetree("yall don't want nothin' to do with it")

 Input: You all do not want nothing to do with it

 want	(VERB,	ROOT)
---- You	(PRON,	nsubj)
-------- all	(DET,	appos)
---- do	(VERB,	aux)
---- not	(ADV,	neg)
---- nothing	(NOUN,	dobj)
-------- do	(VERB,	relcl)
------------ to	(PART,	aux)
------------ with	(ADP,	prep)
---------------- it	(PRON,	pobj)

I'm mostly wondering why you handle it like this:

In [268]: list(nlp("You're happy"))[1].lemma_
Out[268]: "'re"

In [269]: list(nlp("You are happy"))[1].lemma_
Out[269]: 'be'

Why not replace 're with are so that the lemma would be correct?

@kootenpv
Copy link
Contributor Author

kootenpv commented Jan 2, 2017

I looked it up in the code, it does seem like the lemma mentioned is there:

https://github.com/explosion/spaCy/blob/master/spacy/en/tokenizer_exceptions.py#L896

In [271]: list(nlp("You're happy"))[1].lemma
Out[271]: 536

In [272]: list(nlp("You are happy"))[1].lemma
Out[272]: 488

Strange, seems like a bug?

@honnibal
Copy link
Member

honnibal commented Jan 2, 2017

Thanks for the report. The data definitely looks correct, so this seems like a bug.

I'm travelling today so can't easily check, so just to confirm: are you on the most recent version (1.5)?

@kootenpv
Copy link
Contributor Author

kootenpv commented Jan 2, 2017

Yes (same session :-)):

In [327]: spacy.__version__
Out[327]: '1.5.0'

It is not a problem with all contractions strangely.

In [328]: list(nlp("You have happiness"))[1].lemma
Out[328]: 484

In [329]: list(nlp("You've happiness"))[1].lemma
Out[329]: 484

@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Jan 2, 2017
@honnibal
Copy link
Member

honnibal commented Jan 2, 2017

I wonder whether the new exception data is being loaded for English...I think it might be preferring to load the exceptions in the model, using the (deprecated) text file.

If so, least know the tokenizer is in sync with the existing trained weights.

@ines
Copy link
Member

ines commented Jan 2, 2017

seetree("yall don't want nothin' to do with it")

On a slightly unrelated note, I just realised that both yall and nothin' (and similar spellings) aren't yet covered in the tokenizer exceptions. I'll be adding those now, so they'll be available as soon as this issue is fixed.

@kootenpv
Copy link
Contributor Author

kootenpv commented Jan 2, 2017

@ines Feel free to have a look at https://github.com/kootenpv/contractions/blob/master/contractions/__init__.py and see if there is anything else you'd like to add.

@ines
Copy link
Member

ines commented Jan 2, 2017

@kootenpv Ah, this is perfect, thanks! 👍 There are definitely a few that we haven't covered.

@ines ines added the lang / en English language data and models label Jan 8, 2017
@ines ines added this to the Update lemmatizer and morphology milestone Feb 18, 2017
ines added a commit that referenced this issue Mar 13, 2017
@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation lang / en English language data and models
Projects
None yet
Development

No branches or pull requests

3 participants