Skip to content

Latest commit

 

History

History
632 lines (551 loc) · 18.9 KB

Experiment-5.md

File metadata and controls

632 lines (551 loc) · 18.9 KB
'''
Get two texts
Remove stop words
Map the text to vector spaces
Compute cosine(vec1, vec2)
Use SciPy
Take a call if you should do Stemming or not
'''
#We shall calculate the similarity between two documents
#These two documents will be produced by stemming the same sentence
#The first will be the result of Porter stemming and the second will be a result of Lemmatization

#The below program uses the Porter Stemming Algorithm for stemming.
import nltk
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()

document_1 = []
word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms. Lilies are pretty."
# First Word tokenization
nltk_tokens = nltk.word_tokenize(word_data)
#Next find the roots of the word
for w in nltk_tokens:
    print("Actual: %s  Stem: %s"  % (w,porter_stemmer.stem(w)))
    document_1.append(porter_stemmer.stem(w))

print('\n')
port = ' '.join(word for word in document_1)
print("After porter stemming we get:",port)
    
#When we execute the above code, it produces the following result
Actual: It  Stem: It
Actual: originated  Stem: origin
Actual: from  Stem: from
Actual: the  Stem: the
Actual: idea  Stem: idea
Actual: that  Stem: that
Actual: there  Stem: there
Actual: are  Stem: are
Actual: readers  Stem: reader
Actual: who  Stem: who
Actual: prefer  Stem: prefer
Actual: learning  Stem: learn
Actual: new  Stem: new
Actual: skills  Stem: skill
Actual: from  Stem: from
Actual: the  Stem: the
Actual: comforts  Stem: comfort
Actual: of  Stem: of
Actual: their  Stem: their
Actual: drawing  Stem: draw
Actual: rooms  Stem: room
Actual: .  Stem: .
Actual: Lilies  Stem: lili
Actual: are  Stem: are
Actual: pretty  Stem: pretti
Actual: .  Stem: .


After porter stemming we get: It origin from the idea that there are reader who prefer learn new skill from the comfort of their draw room . lili are pretti .
#Lemmatization is similar to stemming but it brings context to the words
document_2 = []
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
nltk_tokens = nltk.word_tokenize(word_data)
for w in nltk_tokens:
    print("Actual: %s  Lemma: %s"  % (w,wordnet_lemmatizer.lemmatize(w)))
    document_2.append(wordnet_lemmatizer.lemmatize(w))

print('\n')
lem = ' '.join(word for word in document_2)
print("After lemmatization we get:",lem)
        
#it produces the following result:
Actual: It  Lemma: It
Actual: originated  Lemma: originated
Actual: from  Lemma: from
Actual: the  Lemma: the
Actual: idea  Lemma: idea
Actual: that  Lemma: that
Actual: there  Lemma: there
Actual: are  Lemma: are
Actual: readers  Lemma: reader
Actual: who  Lemma: who
Actual: prefer  Lemma: prefer
Actual: learning  Lemma: learning
Actual: new  Lemma: new
Actual: skills  Lemma: skill
Actual: from  Lemma: from
Actual: the  Lemma: the
Actual: comforts  Lemma: comfort
Actual: of  Lemma: of
Actual: their  Lemma: their
Actual: drawing  Lemma: drawing
Actual: rooms  Lemma: room
Actual: .  Lemma: .
Actual: Lilies  Lemma: Lilies
Actual: are  Lemma: are
Actual: pretty  Lemma: pretty
Actual: .  Lemma: .


After lemmatization we get: It originated from the idea that there are reader who prefer learning new skill from the comfort of their drawing room . Lilies are pretty .
#Defining and computing cosine similarity
import re, math
from collections import Counter

WORD = re.compile(r'\w+')

def get_cosine(vec1, vec2):
     intersection = set(vec1.keys()) & set(vec2.keys())
     numerator = sum([vec1[x] * vec2[x] for x in intersection])

     sum1 = sum([vec1[x]**2 for x in vec1.keys()])
     sum2 = sum([vec2[x]**2 for x in vec2.keys()])
     denominator = math.sqrt(sum1) * math.sqrt(sum2)

     if not denominator:
        return 0.0
     else:
        return float(numerator) / denominator

def text_to_vector(text):
     words = WORD.findall(text)
     return Counter(words)

vector1 = text_to_vector(port)
vector2 = text_to_vector(lem)

cosine = get_cosine(vector1, vector2)

print('Cosine:', cosine)
Cosine: 0.8333333333333334
#Since cosine similarity is only 0.88, it shows that the documents that resulted were drastically different between 
#stemming and lemmatization

#what if we stem the lemmatized sentence again using Lancaster stemmer instead?

document_3 = []
from nltk.stem import LancasterStemmer 
stemmerLan = LancasterStemmer() 
nltk_tokens = nltk.word_tokenize(lem)
for w in nltk_tokens:
    print("Actual: %s  Lancaster: %s"  % (w,stemmerLan.stem(w)))
    document_3.append(stemmerLan.stem(w))
    
print('\n')
lanc = ' '.join(word for word in document_3)
print("After lemmatization and Lancaster stemming we get:",lanc)
Actual: It  Lancaster: it
Actual: originated  Lancaster: origin
Actual: from  Lancaster: from
Actual: the  Lancaster: the
Actual: idea  Lancaster: ide
Actual: that  Lancaster: that
Actual: there  Lancaster: ther
Actual: are  Lancaster: ar
Actual: reader  Lancaster: read
Actual: who  Lancaster: who
Actual: prefer  Lancaster: pref
Actual: learning  Lancaster: learn
Actual: new  Lancaster: new
Actual: skill  Lancaster: skil
Actual: from  Lancaster: from
Actual: the  Lancaster: the
Actual: comfort  Lancaster: comfort
Actual: of  Lancaster: of
Actual: their  Lancaster: their
Actual: drawing  Lancaster: draw
Actual: room  Lancaster: room
Actual: .  Lancaster: .
Actual: Lilies  Lancaster: lily
Actual: are  Lancaster: ar
Actual: pretty  Lancaster: pretty
Actual: .  Lancaster: .


After lemmatization and Lancaster stemming we get: it origin from the ide that ther ar read who pref learn new skil from the comfort of their draw room . lily ar pretty .
#Since Lemmatization and Lancaster together are not stemming the words correctly, we shall only use lemmatization
#we shall now compare president Obama's inaugural speech in English vs Globish after Lemmatization 
#and stop word removal

#Source: http://www.jpn-globish.com/file/obama-speech.pdf

#English
document_1 = """I stand here today humbled by the task before us,
grateful for the trust you have bestowed, mindful of the
sacrifices borne by our ancestors. I thank President
Bush for his service to our nation, as well as the
generosity and cooperation he has shown throughout
this transition.
Forty-four Americans have now taken the presidential
oath. The words have been spoken during rising tides
of prosperity and the still waters of peace. Yet, every
so often the oath is taken amidst gathering clouds and
raging storms. At these moments, America has carried
on not simply because of the skill or vision of those in
high office, but because We the People have remained
faithful to the ideals of our forbearers, and true to our
founding documents.
So it has been. So it must be with this generation of
Americans.
That we are in the midst of crisis is now well
understood. Our nation is at war, against a farreaching
network of violence and hatred. Our
economy is badly weakened, a consequence of greed
and irresponsibility on the part of some, but also our
collective failure to make hard choices and prepare
the nation for a new age.
Homes have been lost; jobs shed; businesses
shuttered. Our health care is too costly; our schools
fail too many; and each day brings further evidence
that the ways we use energy strengthen our
adversaries and threaten our planet.
These are the indicators of crisis, subject to data and
statistics. Less measurable but no less profound is a
sapping of confidence across our land – a nagging
fear that America’s decline is inevitable, and that the
next generation must lower its sights.
Today I say to you that the challenges we face are
real. They are serious and they are many. They will
not be met easily or in a short span of time. But know
this, America – they will be met. On this day, we
gather because we have chosen hope over fear, unity
of purpose over conflict and discord.
On this day, we come to proclaim an end to the petty
grievances and false promises, the recriminations and
worn out dogmas, that for far too long have strangled
our politics. """

#Globish:
# "Jean-Paul Nerrière, the author of Globish presents it as a natural language 
# as opposed to an artificial or constructed language, claiming that it is a 
# codification of a reduced set of English patterns as used by non-native speakers of the language.
# The name Globish is a portmanteau of "global" and "English"." - Wikipedia 
# (https://en.wikipedia.org/wiki/Globish_(Nerri%C3%A8re))


document_2 = """I stand here today full of respect for the work before us.
I want to thank you for the trust you have given, and I
remember the sacrifices made by our ancestors. I thank
President Bush for his service to our nation, as well as
for the spirit of giving and cooperation he has shown
during this change-over.
Forty four Americans have now been sworn in as
president. The words have been spoken during rising
waves of wealth and well-being and the still waters of
peace. Yet, every so often, these words of honor are
spoken surrounded by gathering clouds and wild storms.
At these times, America has carried on not simply
because those in high office were skilled or could see
into the future. But it has been because We the People
have kept believing in the values of our first fathers, and
stayed true to the documents that created our country.
So it has been. So it must be with this modern-day
population of Americans
It is well understood now that we are in the middle of a
crisis. Our nation is at war, against a far-reaching,
organized system of violence and hate. Our economy
has been badly weakened. This is the result of extreme
desire for great wealth by some people, and failure to
act responsibly. But we have all failed to make hard
choices and to get the nation ready for a new age.
Homes have been lost; jobs given up; businesses have
closed. Our health care costs too much; our schools fail
too many; and each day brings further proof that the
ways we use energy make those against us stronger and
threaten our world
These are the ways we can measure a crisis. Another
problem is just as great, but we cannot measure it as
easily. It is the draining of our own belief in America --
a fear that America’s fall is surely coming and that
future Americans must lower their hopes.
Today I say to you that the trials we face are real. They
are serious and they are many. They will not be met
easily or in a short time. But know this, America -- they
will be met. On this day we gather because we have
chosen hope over fear. We have chosen united purpose
over fighting and over noisy argument.
On this day we come to announce an end to narrowminded
arguing, to the lies, and to the accusing and
worn out teachings that for far too long have killed our
politics. """
##stopword removal
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
document1 = []
document2 = []

import re
document1 = re.split('; |, |\*|\n',document_1)
document2 = re.split('; |, |\*|\n',document_2)

doc1_tok = [nltk.word_tokenize(w) for w in document1]
flat_list1 = [item for sublist in doc1_tok for item in sublist]

doc2_tok = [nltk.word_tokenize(w) for w in document2]
flat_list2 = [item for sublist in doc2_tok for item in sublist]

filter1 = [w for w in flat_list1 if not w in stop_words]
filter2 = [w for w in flat_list2 if not w in stop_words]
##Lemmatization
doc_1 = []
doc_2 = []
filter_1 = []
filter_2 = []

filter_1 = ' '.join(word for word in filter1)
filter_2 = ' '.join(word for word in filter2)

for w in filter1:
    doc_1.append(stemmerLan.stem(w))

lem1 = ' '.join(word for word in doc_1)

for w in filter2:
    doc_2.append(stemmerLan.stem(w))

lem2 = ' '.join(word for word in doc_2)

##Cosine Similarity
vectora = text_to_vector(filter_1)
vectorb = text_to_vector(filter_2)
vector1 = text_to_vector(lem1)
vector2 = text_to_vector(lem2)

cosine1 = get_cosine(vectora, vectorb)
cosine2 = get_cosine(vector1, vector2)

print('Cosine similarity between English and Globish before Lemmtization: ',cosine1)
print('Cosine similarity between English and Globish after Lemmatization:', cosine2)
Cosine similarity between English and Globish before Lemmtization:  0.6689692534580189
Cosine similarity between English and Globish after Lemmatization: 0.6956343155835307
#Tagging the words
nltk.pos_tag(filter1)
[('I', 'PRP'),
 ('stand', 'VBP'),
 ('today', 'NN'),
 ('humbled', 'VBD'),
 ('task', 'NN'),
 ('us', 'PRP'),
 (',', ','),
 ('grateful', 'JJ'),
 ('trust', 'NN'),
 ('bestowed', 'VBN'),
 ('mindful', 'JJ'),
 ('sacrifices', 'NNS'),
 ('borne', 'JJ'),
 ('ancestors', 'NNS'),
 ('.', '.'),
 ('I', 'PRP'),
 ('thank', 'VBD'),
 ('President', 'NNP'),
 ('Bush', 'NNP'),
 ('service', 'NN'),
 ('nation', 'NN'),
 ('well', 'RB'),
 ('generosity', 'JJ'),
 ('cooperation', 'NN'),
 ('shown', 'VBN'),
 ('throughout', 'IN'),
 ('transition', 'NN'),
 ('.', '.'),
 ('Forty-four', 'JJ'),
 ('Americans', 'NNPS'),
 ('taken', 'VBN'),
 ('presidential', 'JJ'),
 ('oath', 'NN'),
 ('.', '.'),
 ('The', 'DT'),
 ('words', 'NNS'),
 ('spoken', 'VBN'),
 ('rising', 'VBG'),
 ('tides', 'NNS'),
 ('prosperity', 'NN'),
 ('still', 'RB'),
 ('waters', 'VBZ'),
 ('peace', 'NN'),
 ('.', '.'),
 ('Yet', 'CC'),
 ('every', 'DT'),
 ('often', 'RB'),
 ('oath', 'VBZ'),
 ('taken', 'VBN'),
 ('amidst', 'RP'),
 ('gathering', 'VBG'),
 ('clouds', 'NNS'),
 ('raging', 'VBG'),
 ('storms', 'NNS'),
 ('.', '.'),
 ('At', 'IN'),
 ('moments', 'NNS'),
 ('America', 'NNP'),
 ('carried', 'VBD'),
 ('simply', 'RB'),
 ('skill', 'JJ'),
 ('vision', 'NN'),
 ('high', 'JJ'),
 ('office', 'NN'),
 ('We', 'PRP'),
 ('People', 'VBP'),
 ('remained', 'JJ'),
 ('faithful', 'JJ'),
 ('ideals', 'NNS'),
 ('forbearers', 'NNS'),
 ('true', 'JJ'),
 ('founding', 'JJ'),
 ('documents', 'NNS'),
 ('.', '.'),
 ('So', 'RB'),
 ('.', '.'),
 ('So', 'CC'),
 ('must', 'MD'),
 ('generation', 'NN'),
 ('Americans', 'NNPS'),
 ('.', '.'),
 ('That', 'DT'),
 ('midst', 'NN'),
 ('crisis', 'NN'),
 ('well', 'RB'),
 ('understood', 'RB'),
 ('.', '.'),
 ('Our', 'PRP$'),
 ('nation', 'NN'),
 ('war', 'NN'),
 ('farreaching', 'NN'),
 ('network', 'NN'),
 ('violence', 'NN'),
 ('hatred', 'VBD'),
 ('.', '.'),
 ('Our', 'PRP$'),
 ('economy', 'NN'),
 ('badly', 'RB'),
 ('weakened', 'VBD'),
 ('consequence', 'NN'),
 ('greed', 'NN'),
 ('irresponsibility', 'NN'),
 ('part', 'NN'),
 ('also', 'RB'),
 ('collective', 'JJ'),
 ('failure', 'NN'),
 ('make', 'VBP'),
 ('hard', 'JJ'),
 ('choices', 'NNS'),
 ('prepare', 'VB'),
 ('nation', 'NN'),
 ('new', 'JJ'),
 ('age', 'NN'),
 ('.', '.'),
 ('Homes', 'VBZ'),
 ('lost', 'VBN'),
 ('jobs', 'NNS'),
 ('shed', 'VBN'),
 ('businesses', 'NNS'),
 ('shuttered', 'VBD'),
 ('.', '.'),
 ('Our', 'PRP$'),
 ('health', 'NN'),
 ('care', 'NN'),
 ('costly', 'JJ'),
 ('schools', 'NNS'),
 ('fail', 'VBP'),
 ('many', 'JJ'),
 ('day', 'NN'),
 ('brings', 'VBZ'),
 ('evidence', 'NN'),
 ('ways', 'NNS'),
 ('use', 'VBP'),
 ('energy', 'NN'),
 ('strengthen', 'NN'),
 ('adversaries', 'NNS'),
 ('threaten', 'VBP'),
 ('planet', 'NN'),
 ('.', '.'),
 ('These', 'DT'),
 ('indicators', 'NNS'),
 ('crisis', 'NN'),
 ('subject', 'JJ'),
 ('data', 'NN'),
 ('statistics', 'NNS'),
 ('.', '.'),
 ('Less', 'RBR'),
 ('measurable', 'JJ'),
 ('less', 'RBR'),
 ('profound', 'JJ'),
 ('sapping', 'VBG'),
 ('confidence', 'NN'),
 ('across', 'IN'),
 ('land', 'NN'),
 ('–', 'NN'),
 ('nagging', 'VBG'),
 ('fear', 'NN'),
 ('America', 'NNP'),
 ('’', 'NNP'),
 ('decline', 'NN'),
 ('inevitable', 'JJ'),
 ('next', 'JJ'),
 ('generation', 'NN'),
 ('must', 'MD'),
 ('lower', 'VB'),
 ('sights', 'NNS'),
 ('.', '.'),
 ('Today', 'NN'),
 ('I', 'PRP'),
 ('say', 'VBP'),
 ('challenges', 'NNS'),
 ('face', 'VBP'),
 ('real', 'JJ'),
 ('.', '.'),
 ('They', 'PRP'),
 ('serious', 'JJ'),
 ('many', 'JJ'),
 ('.', '.'),
 ('They', 'PRP'),
 ('met', 'VBD'),
 ('easily', 'RB'),
 ('short', 'JJ'),
 ('span', 'NN'),
 ('time', 'NN'),
 ('.', '.'),
 ('But', 'CC'),
 ('know', 'VBP'),
 ('America', 'NNP'),
 ('–', 'NNP'),
 ('met', 'VBD'),
 ('.', '.'),
 ('On', 'IN'),
 ('day', 'NN'),
 ('gather', 'CC'),
 ('chosen', 'VBN'),
 ('hope', 'VBP'),
 ('fear', 'JJ'),
 ('unity', 'NN'),
 ('purpose', 'VBP'),
 ('conflict', 'NN'),
 ('discord', 'NN'),
 ('.', '.'),
 ('On', 'IN'),
 ('day', 'NN'),
 ('come', 'VB'),
 ('proclaim', 'JJ'),
 ('end', 'NN'),
 ('petty', 'NN'),
 ('grievances', 'NNS'),
 ('false', 'JJ'),
 ('promises', 'NNS'),
 ('recriminations', 'NNS'),
 ('worn', 'VBP'),
 ('dogmas', 'RB'),
 ('far', 'RB'),
 ('long', 'RB'),
 ('strangled', 'JJ'),
 ('politics', 'NNS'),
 ('.', '.')]
'''
Inference:

Lemmatizing has the following advantages over stemming:
1. It preserves the context of the tense, for e.g., originated lemmatizes to originated. The tense is still a past
   participle. This means it preserves the syntatical class of the word.
2. It takes into consideration the morphological analysis of the words. To do so, it is necessary to have detailed 
   dictionaries which the algorithm can look through to link the form back to its lemma. Another important difference 
   to highlight is that a lemma is the base form of all its inflectional forms, whereas a stem isn’t. 
   This is why regular dictionaries are lists of lemmas, not stems.

The cosine similiarity after lemmatization increased by 0.03, which is the magnitude of 1. Despite a very small increase
in a numerical sense, intuitively this 0.03 increase is significant. This is a 4% increase. This provides evidence that by 
preserving the lemma of the word, we can identify similarity between two web pages. This could prove to be the first stage 
in developing a plagiarism detection tools for academic purposes or for search engines to eliminate duplicate pages from their 
search results. Although it could be argued that plagirism tools would prefer to use a stemmer such as Porter. Such stemmers 
provide supurious results such as stemming 'Lilies' to 'Lili'. This would likely result in overfitting the two documents.
Another extension would be the use of min-hash rather than cosine similarity to identify how identical two documents.

'''