Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add noun_chunks to Span #658

Merged
merged 2 commits into from
Nov 24, 2016
Merged

Add noun_chunks to Span #658

merged 2 commits into from
Nov 24, 2016

Conversation

pokey
Copy link
Contributor

@pokey pokey commented Nov 24, 2016

Add noun_chunks to Span

Description

Support iterating noun_chunks of a Span. Also add doc attribute to Docclass for uniformity. Wasn't sure the best way to remove code duplication between noun_chunks in Doc and noun_chunks in Span.

Motivation and Context

Useful to be able to find noun_chunks in a span, rather than the whole doc. Eg all noun_chunks in a sentence.

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change fixing an issue)
  • New feature (non-breaking change adding functionality to spaCy)
  • Breaking change (fix or feature causing change to spaCy's existing functionality)
  • Documentation (Addition to documentation of spaCy)

Checklist:

  • My code follows spaCy's code style.
  • My change requires a change to spaCy's documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Copy link
Member

@honnibal honnibal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great patch! Very elegant implementation.

I think this patch is further evidence that we should add an AbstractBaseClass over Doc and Span. Then we could deduplicate these things and declare the interface better. I've avoided this so far because I usually favour a "you ain't gonna need it" approach to introducing abstractions. I think it's time to make the switch, though.

I think TokenSequence might be a good name for this.

Update the German noun chunks iterator, so that it also works on Span objects.
@honnibal
Copy link
Member

honnibal commented Nov 24, 2016

I noticed that the German iterator needed the same trick. It's unfortunately difficult to avoid some of this per-language duplication — I think the cure is likely to be worse than the disease, because there's no way to say for sure which parts of the logic different languages will always share.

@honnibal honnibal merged commit 1f24795 into explosion:master Nov 24, 2016
@honnibal
Copy link
Member

Btw @pokey — I'd like to add you to the contributors.md list, if that's okay? We usually list contributors as "Full name, username", but an alias instead of the full name would be fine, too. Let me know what you prefer :)

@pokey
Copy link
Contributor Author

pokey commented Nov 24, 2016

Thanks :-)

Re TokenSequence: Agreed. Would be nice to formalize the extent to which these two classes can be treated the same

Re attribution: Pokey Rule, pokey

@pokey
Copy link
Contributor Author

pokey commented Nov 24, 2016

Also, why do we use doc.vocab.strings at all? Why not just import the symbols from spacy.symbols? Then we wouldn't need to access the doc object.

@pokey
Copy link
Contributor Author

pokey commented Nov 24, 2016

Finally, why don't we allow PROPN and PRON after conjunctions?

@tokestermw
Copy link
Contributor

Hi, I've tested .noun_chunks on a Span and it doesn't seem to be working (on SpaCy 1.8.2).

    doc = nlp(u"Employees are recruiting talented staffers from overseas.")
    list(doc.noun_chunks)  # [Employees, talented staffers]

    span = doc[0:4]
    list(span.noun_chunks)  # ERROR

    for sent in doc.sents:
        list(sent.noun_chunks)  # ERROR

And I get the following error.

/lib/python2.7/site-packages/spacy/tokens/span.pyx in __get__ (spacy/tokens/span.cpp:7229)()
    231             spans = []
    232             for start, end, label in self.doc.noun_chunks_iterator(self):
--> 233                 spans.append(Span(self, start, end, label=label))
    234             for span in spans:
    235                 yield span

TypeError: Argument 'doc' has incorrect type (expected spacy.tokens.doc.Doc, got spacy.tokens.span.Span)

@tokestermw
Copy link
Contributor

Small hack but I was able to use noun_chunks_iterator from the doc.

[sent[left_i:right_i] for left_i, right_i, _ in sent.doc.noun_chunks_iterator(sent)]
# [Employees, talented staffers]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants