Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subject Object Extraction within Spacy #523

Closed
Mustyy opened this issue Oct 13, 2016 · 14 comments
Closed

Subject Object Extraction within Spacy #523

Mustyy opened this issue Oct 13, 2016 · 14 comments
Labels
enhancement Feature requests and improvements

Comments

@Mustyy
Copy link

Mustyy commented Oct 13, 2016

Hi

I'm using the code written by nicschrading for Subject Verb Object Extraction
I/m wondering how come the subject doesnt represent the entities analyzed by Spacy
for example the sentence "Bloomberg announced today that Gordian Capital, a Singapore-based institutional fund management platform, will implement the Bloomberg Entity Exchange solution to help its clients pursue new fund opportunities faster."

SVO = "('capital', 'implement', 'solution'), ('clients', 'pursue', 'opportunities')"

Is there a way to make the subject Gordian Capital instead of just capital?

Thank you

@honnibal
Copy link
Member

Try

import spacy

nlp = spacy.load('en')
doc = nlp(u'Bloomberg announced today that Gordian Capital, a Singapore-based institutional fund management platform, will implement the Bloomberg Entity Exchange solution to help its clients pursue new fund opportunities faster.')

for ent in list(doc.ents):
    ent.merge(ent.tag_, ent.text, ent.ent_type_)

If the entity recogniser is picking up Gordian Capital as a named entity, then this should retokenize it, so that you get one token. This makes the subsequent logic much easier to write.

An alternative solution is to use word.subtree, or word.left_edge and word.right_edge. This allows you to get the section of the dependency tree, instead of just the subject word.

You can get a feel for this using the displaCy visualizer: https://demos.explosion.ai/displacy/?text=Bloomberg%20announced%20today%20that%20Gordian%20Capital%2C%20a%20Singapore-based%20institutional%20fund%20management%20platform%2C%20will%20implement%20the%20Bloomberg%20Entity%20Exchange%20solution%20to%20help%20its%20clients%20pursue%20new%20fund%20opportunities%20faster.&model=en&cpu=1&cph=1

Toggle the option "collapse phrases" to see how the retokenization works.

@Mustyy
Copy link
Author

Mustyy commented Oct 17, 2016

Where is the script for this code?
Or rather can I insert this code into the svo script written by nicschrading?

Thank you though, the solution looks quite ideal.

@Mustyy Mustyy closed this as completed Oct 20, 2016
@Mustyy Mustyy reopened this Oct 20, 2016
@Mustyy
Copy link
Author

Mustyy commented Oct 20, 2016

@honnibal
Hey
Thanks so much for the insights

One last thing
Is there a way to find the Index of entity & tokens to extract better Subject's and Object's
For instance take the sentence " Today Morgan Stanley fires Vice President due to allegations of corruption"
The SVO = Stanley fires Vice

What I would like to do is for the token to go further right and further left
So that we end up with "Morgan Stanley fires Vice President"

Morgan Stanley being the Subject
fires being the Verb
Vice President or VP being the object

Perhaps like a while loop : subj token list matches 1 entity or more
add 1 more token to the list

Thoughts?

Much Appreciated
Thank you

@honnibal
Copy link
Member

honnibal commented Oct 20, 2016

The problem here seems to me to be that Morgan Stanley isn't found as a named entity. How about this:

import spacy


def merge_phrase(matcher, doc, i, matches):
    '''
    Merge a phrase. We have to be careful here because we'll change the token indices.
    To avoid problems, merge all the phrases once we're called on the last match.
    '''
    if i != len(matches)-1:
        return None
    # Get Span objects
    spans = [(ent_id, label, doc[start : end]) for ent_id, label, start, end in matches]
    for ent_id, label, span in spans:
        span.merge(label=label, tag='NNP' if label else span.root.tag_)

nlp = spacy.load('en')
nlp.matcher.add_entity('MorganStanley', on_match=merge_phrase)
nlp.matcher.add_pattern('MorganStanley', [{'orth': 'Morgan'}, {'orth': 'Stanley'}], label='ORG')
nlp.pipeline = [nlp.tagger, nlp.entity, nlp.matcher, nlp.parser]

# Okay, now we've got our pipeline set up...
doc = nlp(u'Morgan Stanley fires Vice President')
for word in doc:
    print(word.text, word.tag_, word.dep_, word.head.text, word.ent_type_)

@Mustyy
Copy link
Author

Mustyy commented Oct 20, 2016

@honnibal
Hi

Hope all is well

Shouldn't Morgan Stanley should be defined as a ORG as in the bank Morgan Stanley?
and can I replace doc = nlp(u'Morgan Stanley fires Vice President') with
doc = nlp(u'Today Morgan Stanley fires Vice President due to allegations of corruption')

@honnibal
Copy link
Member

Gah. Short on sleep :p. Edited, thanks.

@Mustyy
Copy link
Author

Mustyy commented Oct 20, 2016

Oh I hope you get some rest soon

Alright i will replace doc = nlp(u'Morgan Stanley fires Vice President') with
doc = nlp(u'Today Morgan Stanley fires Vice President due to allegations of corruption')

and i will run it now to test this

@Mustyy
Copy link
Author

Mustyy commented Oct 20, 2016

@honnibal
Quick note
When I run it I get
Traceback (most recent call last):
File "spacypipe1.py", line 20, in
nlp.matcher.add_entity('MorganStanley', on_match=merge_phrase)
AttributeError: 'spacy.matcher.Matcher' object has no attribute 'add_entity'

@honnibal
Copy link
Member

What version are you running?

@Mustyy
Copy link
Author

Mustyy commented Oct 20, 2016

@honnibal

I believe its the latest one and on Python3
I will do an upgrade install now

The big picture is to use these entities as replacements for Subjects and Objects when we are outputting the SVO.
So the token would refer to the index of the entity to find out what completes "Stanley"
so going left once would result in a Subject = "Morgan Stanley"
Does that make sense?

@honnibal
Copy link
Member

Well, if you just want to go left one, you might want to look at the token.nbor() method and the token.i attribute.

@Mustyy
Copy link
Author

Mustyy commented Oct 20, 2016

@honnibal Thank you
Well both options together would be ideal as well
But I have to get the first part working, it's still throwing the error

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements
Projects
None yet
Development

No branches or pull requests

3 participants