-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathR2_new.Rmd
166 lines (146 loc) · 6.75 KB
/
R2_new.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
title: "R2"
subtitle: "Predicting Diabetes in Pima Indian Population"
author: "Meghana Tatineni"
date: "02/19/2019"
output:
prettydoc::html_pretty:
theme: tactile
---
###Introduction to Machine Learning in R
```{r message=FALSE, warning=FALSE}
#install.packages(c("dplyr","ggplot2","caret","reshape","kernlab"))
#Data Manipulation
library(dplyr)
#Visulization
library(ggplot2)
#Machine Learning
library(caret)
#Data Manipulation
library(reshape)
#Machine Learning
library(kernlab)
```
The dataset is originally from National Institute of Diabetes and Digestive and Kidney Diseases. The purpose of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.
This datset is restricted to females at least 21 years old of Pima Indian heritage.
```{r}
#Read in Data
#Set Working Directory
diab<-read.csv("diabetes.csv")
```
#Exploratory Data Analysis
After loading our data, we are performing Exploratory Data Analysis to get a sense of what we are working with and how much data cleansing and wrangling we have to do.
Usually 80% of the data science work is preparing the data for analysis and 20% is modeling our data.
Thankfully, this data set is already clean so we won't be doing much data prep. This will never be the case when working on a real data science project.
```{r pressure}
#Looking at Data
head(diab)
str(diab)
#Change Outcome to Factor
diab$Outcome<- as.factor(diab$Outcome)
#number of missing values
sum(is.na.data.frame(diab))
```
There are no NA values in this dataset so we must have no missing data. Wrong! Looking at the first couple of rows, there are 0 values for Skin Thickness, Insulin, and Glucose which does not make sense.
Let's return the number of 0 values for each column to get an idea of how much of our data is missing.
```{r}
#number of 0 values
colSums(diab == 0)
colSums(diab == 0)/768
```
#Cleaning Data
Since over 40% of the variable Insulin and 30% of SkinThickness is missing, we will remove these varibles from our dataset.
We are also going to replace the 0 values with NA.
```{r}
#Replace 0 with NA
#remove Insulin and Skin Thickness
diab<- diab %>% select(-Insulin, -SkinThickness) %>%
mutate_each(funs(replace(.,.==0,NA)),-Outcome,-Pregnancies)
```
##Imputation with mean
To deal with the zero values, we will use Imputation.Imputation is just replacing missing values with substituted values.We will replace the missing values with the mean of each feature.
Read more here:https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4
```{r}
#Replace missing values with Mean
for(i in 1:5) {
diab[is.na(diab[,i]),i] <- mean(diab[,i], na.rm = T)
}
```
#Lets do some Visualization!
Now that we fixed the missing values, lets graph the distributions of all variables for diabetics and non diabetics.
```{r}
#Distribution for All Variables
library(ggplot2)
library(reshape)
diab_melt<-melt(diab,id.vars = "Outcome")
ggplot(diab_melt, aes(value,fill=factor(Outcome)))+
facet_wrap(~variable, scales="free") +
geom_density()+
scale_fill_manual(values=c("green", "red")) +
labs(title="Distribution of Variables for Diabetics and Non Diabetics")
```
```{r}
#Box Plot
ggplot(diab, aes(x=Outcome, y=DiabetesPedigreeFunction,color=Outcome))+
geom_boxplot()+
theme_bw()+
scale_colour_brewer(palette = "Set2",name = "Diabetes")+
labs(title="Box Plots of Diabetes Pedigree Function")
```
#Lets Start Modeling
Before we start modeling, we need to split our data into a testing and training set.The purpose of this is to prevent overfitting.
Here we are randomly splitting our data into 75% traning set and 25% testing set. We use the training set to train our model and the testing set to determine the accuracy of the model.
```{r}
#Simple Train and Test Split
library(caret)
train.rows<- createDataPartition(y= diab$Outcome, p=0.75, list = FALSE)
train<- diab[train.rows,]
test<- diab[-train.rows,]
```
We can finally start the modeling our data using statistical methods and machine learning!
We will be comparing four different classification technquies by using supervised Machine Learning algorithms. Supervised learning means be already know the outcome of our data. For this data, we already know the person is either diabetic or not.
We are using four Machine Learning models
1.Logistic Regression
2.K Nearest Neighbors
3.Support Vector Machine
##Logistic Regression
Logistic regression is a simple statistical model which predicts a binary response (ex.YES/NO). For this data, we are predicting whether a woman is diabetic or not.
Read more here:https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc
```{r}
fit_log<-glm(train$Outcome~BMI+Glucose+Pregnancies, data=train, family="binomial")
summary(fit_log)
predict_log<-predict(fit_log, test, type="response")
predict_log1<-as.factor(ifelse(predict_log <.5, "0","1"))
confusionMatrix(data = predict_log1, reference = test$Outcome, positive = "1",
dnn = c("Algorithm predicted values", "Actual Test Values"))
```
##K-nearest neighbors
For each test data point, we would be looking at the K nearest training data points and take the most frequently occurring classes and assign that class to the test data.
Read more here:https://medium.com/@adi.bronshtein/a-quick-introduction-to-k-nearest-neighbors-algorithm-62214cea29c7
```{r}
knnFit<-train(Outcome~BMI+Glucose+Pregnancies, data=train, method="knn",
preProcess=c("center","scale"))
predict_data<-predict(knnFit, newdata=test)
confusionMatrix(data = predict_data, reference = test$Outcome, positive = "1",
dnn = c("Algorithm predicted values", "Actual Test Values"))
#plotting different number of neighbors and accuracy
plot(knnFit)
```
##Support Vector Machine
Support vector machines attempt to pass a linearly separable hyperplane through a dataset to classify the data into two groups.
Read more here:https://towardsdatascience.com/support-vector-machines-a-brief-overview-37e018ae310f
```{r}
svmfit<-train(Outcome~BMI+Glucose+Pregnancies, data=train, method="svmLinear",
preProcess=c("center","scale"))
predict_data<-predict(svmfit, newdata=test)
confusionMatrix(data = predict_data, reference = test$Outcome,
dnn = c("Algorithm predicted values", "Actual Test Values"))
#Tuning Paramter c with Grid Search
grid <- expand.grid(C = c(0.01,0.1,0.5, 0.75, 1, 1.5, 1.75, 2,5))
svm_Linear_Grid <- train(Outcome ~., data = train, method = "svmLinear",
preProcess = c("center", "scale"),
tuneGrid = grid)
svm_Linear_Grid
plot(svm_Linear_Grid)
```
To increase our classification rate, we can try tuning our parameters for each classification algorithm and try other classification algorithms such as random forests and neural networks.