-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Add example file to show answer to Issue #252
- Loading branch information
Showing
1 changed file
with
59 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
"""Issue #252 | ||
Question: | ||
In the documents and tutorials the main thing I haven't found is examples on how to break sentences down into small sub thoughts/chunks. The noun_chunks is handy, but having examples on using the token.head to find small (near-complete) sentence chunks would be neat. | ||
Lets take the example sentence on https://api.spacy.io/displacy/index.html | ||
displaCy uses CSS and JavaScript to show you how computers understand language | ||
This sentence has two main parts (XCOMP & CCOMP) according to the breakdown: | ||
[displaCy] uses CSS and Javascript [to + show] | ||
& | ||
show you how computers understand [language] | ||
I'm assuming that we can use the token.head to build these groups. In one of your examples you had the following function. | ||
def dependency_labels_to_root(token): | ||
'''Walk up the syntactic tree, collecting the arc labels.''' | ||
dep_labels = [] | ||
while token.head is not token: | ||
dep_labels.append(token.dep) | ||
token = token.head | ||
return dep_labels | ||
""" | ||
from __future__ import print_function, unicode_literals | ||
|
||
# Answer: | ||
# The easiest way is to find the head of the subtree you want, and then use the | ||
# `.subtree`, `.children`, `.lefts` and `.rights` iterators. `.subtree` is the | ||
# one that does what you're asking for most directly: | ||
|
||
from spacy.en import English | ||
nlp = English() | ||
|
||
doc = nlp(u'displaCy uses CSS and JavaScript to show you how computers understand language') | ||
for word in doc: | ||
if word.dep_ in ('xcomp', 'ccomp'): | ||
print(''.join(w.text_with_ws for w in word.subtree)) | ||
|
||
# It'd probably be better for `word.subtree` to return a `Span` object instead | ||
# of a generator over the tokens. If you want the `Span` you can get it via the | ||
# `.right_edge` and `.left_edge` properties. The `Span` object is nice because | ||
# you can easily get a vector, merge it, etc. | ||
|
||
doc = nlp(u'displaCy uses CSS and JavaScript to show you how computers understand language') | ||
for word in doc: | ||
if word.dep_ in ('xcomp', 'ccomp'): | ||
subtree_span = doc[word.left_edge.i : word.right_edge.i + 1] | ||
print(subtree_span.text, '|', subtree_span.root.text) | ||
print(subtree_span.similarity(doc)) | ||
print(subtree_span.similarity(subtree_span.root)) | ||
|
||
|
||
# You might also want to select a head, and then select a start and end position by | ||
# walking along its children. You could then take the `.left_edge` and `.right_edge` | ||
# of those tokens, and use it to calculate a span. | ||
|
||
|
||
|