This page describes how to take Pyserini output and apply spaCy to do some NLP basics on it.
First, download the spaCy package and model:
pip install spacy
python -m spacy download en_core_web_sm
In this guide, we use model en_core_web_sm
, which is a small English model trained on written web text (blogs, news, comments).
There are many other models supporting different languages, you can download the best one for your application.
Use Pyserini's SimpleSearcher
to fetch document from the MS MARCO pre-built index msmarco-passage
:
import json
from pyserini.search import SimpleSearcher
# Initialize a searcher from a pre-built index
searcher = SimpleSearcher.from_prebuilt_index('msmarco-passage')
# Fetch raw text of a document given its docid
raw = searcher.doc('1').raw()
# Get actual content from raw
content = json.loads(raw)['contents']
print(content)
content
should be as follows:
The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peaceful uses of atomic energy continues to have an impact on history and science.
Load spaCy's pre-trained model to a Language
object called nlp
, then call the nlp
on content
to get a processed Doc
object:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(content)
From Doc
, we can apply spaCy's NLP features on our document.
In this guide, we will talk about Tokenization, POS Tagging, NER and Sentence Segmentation.
Each Doc
object contains individual Token
objects, and you can iterate over them:
for token in doc:
print(token.text)
The result should be as follows:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | ... |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
The | Manhattan | Project | and | its | atomic | bomb | helped | bring | an | end | to | World | War | II | . | ... |
There are many linguistic annotations contained in Token
's attributes, such as
TEXT: The original word text.
LEMMA: The base form of the word.
POS: The simple UPOS part-of-speech tag.
DEP: Syntactic dependency, i.e. the relation between tokens.
SHAPE: The word shape – capitalization, punctuation, digits.
STOP: Is the token part of a stop list, i.e. the most common words of the language?
These attributes can be easily accessed by:
for token in doc:
print(token.text, token.lemma_, token.pos_, token.dep_, token.shape_, token.is_stop)
The output is shown in the following table:
TEXT | LEMMA | POS | DEP | SHAPE | STOP |
---|---|---|---|---|---|
The | the | DET | det | Xxx | True |
Manhattan | Manhattan | PROPN | compound | Xxxxx | False |
Project | Project | PROPN | nsubj | Xxxxx | False |
and | and | CCONJ | cc | xxx | True |
its | -PRON- | DET | poss | xxx | True |
atomic | atomic | ADJ | amod | xxxx | False |
bomb | bomb | NOUN | conj | xxxx | False |
helped | help | VERB | aux | xxxx | False |
bring | bring | VERB | ROOT | xxxx | False |
an | an | DET | det | xx | True |
end | end | NOUN | dobj | xxx | False |
to | to | ADP | prep | xx | True |
World | World | PROPN | compound | Xxxxx | False |
War | War | PROPN | compound | Xxx | False |
II | II | PROPN | pobj | XX | False |
. | . | PUNCT | punct | . | False |
... | ... | ... | ... | ... | ... |
spaCy can recognize various types of named entities in a document:
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
The following table shows recognized entities:
TEXT | START | END | LABEL | DESCRIPTION |
---|---|---|---|---|
The Manhattan Project | 0 | 21 | ORG | Companies, agencies, institutions, etc. |
World War II | 65 | 77 | EVENT | Named hurricanes, battles, wars, sports events, etc. |
Doc
also contains segmented sentences as Span
objects, we can iterate over them:
for sent in doc.sents:
print(sent.text)
Then we have sentences:
# | SENTENCE |
---|---|
0 | The Manhattan Project and its atomic bomb helped bring an end to World War II. |
1 | Its legacy of peaceful uses of atomic energy continues to have an impact on history and science. |