Skip to content

Implemented Naive bayes algorithm to classify text files into different categories

Notifications You must be signed in to change notification settings

jigyasak05/Text-Classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Text-Classifier

Implementation of Naive bayes algorithm on 20_newsgroup dataset to classify text files consising of emails into 20 different categories. The implementation involves: Splitting the data into training and testing and making a dataframe where the path of files and their corresponding class( category ) is stored. All the files in the training data are traversed and some of the top frequency words, except the stop Words are selected as features. A dataframe is then made where for every file in training data the frequency of the words selected as features are stored and their corresponding class values. Similarly a dataframe for testing data is also formed. For the fit function , a 2-level dictionary called as 'counts' is formed in which the key values are the classes( categories ). Within these , another dictionary is formed whose key values are the features selected. Now all the files in the dataframe belonging to a particular category are selected and the sum of frequencies of the corresponding features were stored in the dictionary 'counts' under that particular category. A total count of frequencies is also stored for each class. For making predictions, each tuple from testing data(where each tuple denotes a file in testing data) is passed to the predict single fuction which calculates its probability for each class using naive bayes and predicts the class with maximum probability. Naive Bayes is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. A Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature alt text

Here , P(A)= probability of class/category.

P(B)=Probability of testing document

P(B|A)= Probability of the document given that it belongs to the particular class. For calculating P(document | category), for every feature (top frequency words selected as features) , we find the probability of that word occurring in the document given that the document belongs to a particular category.All the log probabilities are then added .

Laplace correction is also considered and the Class with maximum probability is predicted.

About

Implemented Naive bayes algorithm to classify text files into different categories

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published