As nowadays we usually do shopping based on other users reviews on amazon. So reviews matters a most. Here Based on that reviews I have developed a system which tells whether particular review is positive or not.
Python 3.x
- Download the
reviews
for as much as categories you want from here and put this alljson.gz
files indata
folder. - Run the
data_preprocessing.ipynb
file which takes input fromdata
folder(which is having all gzs files) and output will be cleaned data, we call itdata.csv
- The
model_generation_loading.ipynb
takes input as data.csv and gives Machine Learning models we call itcount_tf_gs.joblib
. Later on this model can be used to predict any new review.
preprocess.py
which converts gz to json, and make DataFrame combined with all categories we call itfinal_data.csv
- pandas_profiling helps us to generate report including most of all details about DataFrame with visualization. See more about it here
- From 1-5 range reviews. we taking 1-2 as Negative, 3 as Neutral and 4-5 as Positive.
- While we found we have lots of data in category of
pos
, it needs to be balanced so we are doing balancing of this data. - As we can see, this is reviewText are raw text data. So we are applying some text pre-processing techniques to make clean texts.
- Then combined all data in one DataFrame and saving as
data.csv
- Taking generated data.csv and splitting this in train and test split by sklearn library. Here I have taken 25% in test set.
- In this step, we are making pipeline for all further data processing and prediction from
sklearn.pipeline
.- We are converting text data into
CountVector
. see more about it here. - Then transforming a count matrix to a normalized tf-idf representation.
- Then ML model LinearSVC (Linear Support Verctor Classifier) is applied. as we have linear data distribution so linear SVM is good suites for us.
- We are converting text data into
- saving the model with joblib library. So in future we can direct load the model and get our results easily.
- We are Generating Classification report to understand our model better and plotting pie chart for precision(positive predictive value). we have accuracy of around
76%
, which is way better for this amount of data. - Loading model for future use and prediction with new reviews.
-> Go for more categories in order to get more and more data.
-> For reviewText can do more text pre-processing like via embedding, spelling correction and find similarities between reviews.
-> Deep Learning model like GRU or LSTM as they work with sequence data.