forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPA1_template.Rmd
209 lines (168 loc) · 7.82 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
---
title: "Reproducible Research: Peer Assessment 1"
author: "Andrew E. Davidson"
date: "Sep 15, 2015"
output:
html_document:
keep_md: true
---
```{r setup, echo=FALSE}
options(scipen=999)
```
## The data
The variables included in this dataset are:
steps: Number of steps taking in a 5-minute interval (missing values are coded as NA)
date: The date on which the measurement was taken in YYYY-MM-DD format
interval: Identifier for the 5-minute interval in which measurement was taken
## Loading and preprocessing the data
1. Load the data (i.e. read.csv())
The github repo we forked from contains the data file activity.zip.
The following code will automatically unzip the data file into a subdirectory
if needed. In general to save space we do not want to store the unzipped version
of the data in git.
```{r readData}
dataDir <- "./data/"
dataFile <- sprintf("%s%s", dataDir, "activity.csv")
if (!file.exists(dataDir)) { dir.create(dataDir)}
if (!file.exists(dataFile)) {
unzip("activity.zip", exdir=dataDir)
dateDownLoaded <- date()
write(dateDownLoaded, file=sprintf("%s%s", dataDir, "dateDownLoaded.txt"))
}
data <- read.csv(dataFile, header=TRUE)
```
2. Process/transform the data (if necessary) into a format suitable for your analysis
By default read.csv treats the data column as string
factors. We use as.Date() to convert factors to type date
```{r preProcess data}
data[[2]] <- as.Date(data[[2]])
```
Check to see if any of the fields are missing (For this part of the assignment we can ignore the missing values in the dataset. Also check if data is ordered by date
)
```{r exploreData}
sum(is.na(data$steps))
sum(is.na(data$date))
sum(is.na(data$interval))
is.unsorted(data$date)
```
Explore the interval data
```{r explore data understanding intervals}
numIntervalsInDay <- (24 * 60) / 5
numUniqueIntervalsInData <- length(unique(data$interval))
rangeOfInterval <- range(data$interval)
```
there are `r numIntervalsInDay` 5 min. intervals in any gvien day. Our data
contains `r numUniqueIntervalsInData` differnt intervals. The range of interval
value is `r rangeOfInterval`. By visually inspecting the interval values I think
the values are actually some sort of time stamp. I.E. 2355 is military time for
11:55 pm. We should be able to treat the interval data like a factor
## What is mean total number of steps taken per day?
1. Make a histogram of the total number of steps taken each day
```{r total number of steps taken histogram}
stepsPerDay <- aggregate(steps ~ date, data=data, FUN=sum)
library(ggplot2)
qplot(stepsPerDay$steps, geom="histogram") + ggtitle("Steps taken per day")
```
2. Calculate the mean and median number of steps per day ps taken each date
```{r calcuate average number of steps}
averageStepsPerDay <- round(mean(stepsPerDay$steps), digits=2)
medianStepsPerDay <- median(stepsPerDay$steps)
```
The mean number of steps taken each day is `r averageStepsPerDay`.
The median number of steps taken each day is `r medianStepsPerDay`
## What is the average daily activity pattern?
1. Make a time series plot (i.e. type = "l") of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis)
```{r activity plot}
averageStepsByInterval <- aggregate(steps ~ interval, data=data, FUN=mean)
p <- ggplot(data=averageStepsByInterval, aes(x=interval, y=steps, group=1))
p + geom_line() + ggtitle("average daily activity pattern")
```
2. Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?
```{r find average number of steps}
m <- max(averageStepsByInterval$steps)
r <- averageStepsByInterval[averageStepsByInterval$steps==m,]
intervalWithMostNumSteps <- r$interval
```
The interval with the most number of steps is `r intervalWithMostNumSteps`
## Imputing missing values
1. Calculate and report the total number of missing values in the dataset (i.e. the total number of rows with NAs)
```{r fix missing data}
numMissingSteps <- sum(is.na(data$steps))
numMissingDates <- sum(is.na(data$date))
numMissingInterval <- sum(is.na(data$interval))
#data[is.na(data)] <- 0
```
The number of rows missing step data is `r numMissingSteps`
The number of rows missing date data is `r numMissingDates`
The number of rows missing interval data is `r numMissingInterval`
2. Devise a strategy for filling in all of the missing values in the dataset.
The strategy does not need to be sophisticated. For example, you could use the
mean/median for that day, or the mean for that 5-minute interval, etc.
My stratey is to loop over all rows in the data frame. If the number of steps
is NA, assign the average number of steps for that interval
3. Create a new dataset that is equal to the original dataset but with the
missing data filled in.
```{r fixNA}
fixedData <- read.csv(dataFile, header=TRUE)
fixedData[[2]] <- as.Date(fixedData[[2]])
n <- nrow(data)
for (i in 1:n) {
if (is.na(fixedData[i,1])) {
interval <- fixedData[i,3]
x <- averageStepsByInterval[averageStepsByInterval$interval == interval,]$steps
# we want to round, steps are ints not reals
r <- round(x)
fixedData[i,1] <- r
#cat('interval:', interval, " x:", x, " r:", r, " new:", fixedData[i,1], "\n")
}
}
sum(is.na(fixedData$steps))
sum(is.na(fixedData$date))
sum(is.na(fixedData$interval))
```
4. Make a histogram of the total number of steps taken each day and Calculate
and report the mean and median total number of steps taken per day. Do these
values differ from the estimates from the first part of the assignment?
What is the impact of imputing missing data on the estimates of the total daily
number of steps?
```{r histogram and summary}
stepsPerDay2 <- aggregate(steps ~ date, data=fixedData, FUN=sum)
library(ggplot2)
qplot(stepsPerDay2$steps, geom="histogram") + ggtitle("Steps per day")
averageStepsPerDay2 <- round(mean(stepsPerDay2$steps), digits=2)
medianStepsPerDay2 <- median(stepsPerDay2$steps)
```
The original mean number of steps taken each day is `r averageStepsPerDay`.
The orginal median number of steps taken each day is `r medianStepsPerDay`
After replacing missing step data with the average number of steps for the
interval
The mean number of steps taken each day is `r averageStepsPerDay2`.
The median number of steps taken each day is `r medianStepsPerDay2`
So fixing the data had a small over all effect.
## Are there differences in activity patterns between weekdays and weekends?
1. Create a new factor variable in the dataset with two levels – “weekday” and
“weekend” indicating whether a given date is a weekday or weekend day.
```{r create weekday factor}
w <- weekdays(fixedData$date)
weekEnds <- w == "Saturday" | w == "Sunday"
#weekDays <- !weekEnds
dayType <- ifelse(weekEnds, "weekEnd", "weekDay")
fixedData$dayType <- as.factor(dayType)
```
2. Make a panel plot containing a time series plot (i.e. type = "l") of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis).
```{r graph weekday and weekend activity}
weekDayData <- fixedData[fixedData$dayType == "weekDay",]
averageStepsByIntervalWeekDay <- aggregate(steps ~ interval, data=weekDayData, FUN=mean)
p1 <- ggplot(data=averageStepsByIntervalWeekDay, aes(x=interval, y=steps, group=1))
p1 <- p1 + geom_line() + ggtitle("weekday")
weekEndData <- fixedData[fixedData$dayType == "weekEnd",]
averageStepsByIntervalWeekEnd <- aggregate(steps ~ interval, data=weekEndData, FUN=mean)
p2 <- ggplot(data=averageStepsByIntervalWeekEnd, aes(x=interval, y=steps, group=1))
p2 <- p2 + geom_line() + ggtitle("weekend")
library(grid)
pushViewport(viewport(layout = grid.layout(2, 1)))
print(p1, vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
print(p2, vp = viewport(layout.pos.row = 2, layout.pos.col = 1))
```
Yes the activity levels appear to be different. It looks like the subjects are
more active durring the weekend.