forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
137 lines (114 loc) · 3.56 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
title: "Reproducible Research: Peer Assessment 1"
output:
html_document:
keep_md: true
---
```{r}
library(ggplot2)
```
## Loading and preprocessing the data
```{r echo=TRUE}
data <- read.csv(unz("activity.zip", "activity.csv"))
str(data)
```
## What is mean total number of steps taken per day?
```{r echo=TRUE}
dailystepcount <- aggregate(formula = steps ~ date, data = data, FUN = sum, na.rm = TRUE)
```
Mean of Daily step count
```{r echo=TRUE}
mean(dailystepcount$steps)
```
Median of Daily step count
```{r echo=TRUE}
median(dailystepcount$steps)
```
###Histogram for daily step counts
```{r}
g = ggplot(dailystepcount, aes(x=steps))
g = g + geom_histogram(binwidth=5000, color = "white", fill = "grey")
print(g)
```
## What is the average daily activity pattern?
```{r, echo=TRUE}
time_series<-aggregate(formula = steps ~ interval, data = data, FUN = mean, na.rm = TRUE)
g = ggplot(time_series, aes(x=interval, y = steps)) + geom_line()
g = g + xlab("5-min Interval") + ylab("Average Steps taken") + ggtitle("Average daily steps pattern")
print(g)
```
###Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?
```{r echo=TRUE}
time_series[which.max(time_series$steps),]
```
## Imputing missing values
Total number of missing values
```{r, echo=TRUE}
sum(is.na(data$steps))
```
###Impute the missing values
Strategy to impute : Take the overall average for the interval for which step count is missing. If an overall average is not available, then take 0 as step count.
```{r, echo=TRUE}
#Filled data set
imputeddata<-data
for(i in 1:nrow(imputeddata))
{
if(is.na(imputeddata[i,]$steps))
{
#take the mean of that interval as the replacement of NA value
replace <- mean(time_series[imputeddata[i,]$interval,]$steps)
#if there is no mean, it means there is absolutely no data on any
#day for that interval, then take 0 as step count
if(is.na(replace)){
replace = 0
}
imputeddata[i,]$steps = replace
}
}
```
###Stats after imputing
```{r echo=TRUE}
dailystepcount_imputed <- aggregate(formula = steps ~ date, data = imputeddata, FUN = sum)
daily_mean_imputed<-mean(dailystepcount_imputed$steps)
daily_median_imputed<-median(dailystepcount_imputed$steps)
```
Mean of Daily step count = `r daily_mean_imputed`
Median of Daily step count = `r daily_median_imputed`
```{r, echo=TRUE}
g = ggplot(dailystepcount_imputed, aes(x=steps))
g = g + geom_histogram(binwidth=5000, color = "white", fill = "grey")
print(g)
```
###Impact imputing data
1. Counts in some buckets have increased
2. Increase very prominent in the lowest bucket i.e. (0-5000)
This could indicate some issues while collecting smaller step counts
## Are there differences in activity patterns between weekdays and weekends?
###Get the extra column to indicate whether it is a weekday or weekend
```{r}
imputeddata$date <- as.Date(imputeddata$date)
#write a function to get whether it is weekend or weekday
getday<-function(date)
{
day = "";
if(grepl("Saturday|Sunday",weekdays(date)))
{
day = "weekend";
}
else
{
day = "weekday"
}
day;
}
imputeddata$day <-sapply(imputeddata$date, getday)
head(imputeddata)
```
###Panel Plot for weekdays and weekends
```{r}
time_series_withday <- aggregate(formula = steps ~ interval+day, data = imputeddata, FUN = sum, na.rm = TRUE)
g = ggplot(time_series_withday, aes(x=interval, y = steps)) + geom_line()
g = g + xlab("5-min Interval") + ylab("Average Number Steps taken") + ggtitle("Difference b/w Weekends and Weekdays Activity")
g = g + facet_grid(day~.)
print(g)
```