Unzip the archive:
unzip("activity.zip")
Load the data:
raw_activity <- read.csv("activity.csv", colClasses = c("integer", "Date", "integer"))
Load ggplot2
and dplyr
libraries:
library(ggplot2)
library(dplyr)
Group input data by steps (NA values are filtered away):
daily.steps <- raw_activity %>%
na.omit() %>%
group_by(date) %>%
summarize(steps = sum(steps))
A histogram of days per number of steps:
g1 <- ggplot(daily.steps, aes(x = steps)) +
geom_histogram(bins = 20, fill = "salmon") +
xlab("Steps") +
ylab("Number of days")
print(g1)
summary(daily.steps$steps)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 6778 10400 9354 12810 21190
Activity, grouped by interval:
avg.intervals <- raw_activity %>%
na.omit() %>%
group_by(interval) %>%
summarize(steps = mean(steps, na.rm = TRUE)) %>%
as.data.frame()
Plot of average daily activity:
g2 <- ggplot(avg.intervals, aes(x = interval, y = steps)) +
geom_area(colour = "darkgray", fill = "lightgray") +
xlab("Minutes") +
ylab("Steps")
print(g2)
Maximum steps were taken in
avg.intervals[which.max(avg.intervals$steps), 1]
## [1] 835
minutes (in average).
The number of missing values:
sum(is.na(raw_activity))
## [1] 2304
Missing values will be imputed with averaged interval activity value:
daily.steps.new <- raw_activity %>%
left_join(avg.intervals, by = "interval") %>%
mutate(steps = ifelse(is.na(steps.x), steps.y, steps.x)) %>%
select(-steps.x, -steps.y) %>%
group_by(date) %>%
summarize(steps = sum(steps, na.rm = TRUE))
Comparrisson of histograms with and without imputting:
g3 <- ggplot() +
geom_histogram(data = daily.steps, bins = 20, aes(steps, fill = "With NA's", y= -..count..)) +
geom_histogram(data = daily.steps.new, bins = 20, aes(steps, fill = "With imputed", y= ..count..)) +
xlab("Steps") +
ylab("Number of days") +
scale_fill_hue("Type")
print(g3)
In general both datasets seem similar, apart from a peak at approximately 11000 steps.
Summary of the imputed dataset:
summary(daily.steps.new$steps)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 41 9819 10770 10770 12810 21190
After imputing median and mean values both became equal to 10770.
Grouping by Weekday/Weekend
weekly.activity <- raw_activity %>%
transform(weekday = as.factor(ifelse(weekdays(date) %in% c("Saturday", "Sunday"), "Weekend", "Weekday"))) %>%
group_by(weekday, interval) %>%
summarize(steps = mean(steps, na.rm = TRUE))
Activity profiles:
g4 <- ggplot(weekly.activity, aes(x = interval, y = steps, fill = weekday, colour = weekday)) +
geom_area(alpha=0.6, lwd=0.8)
print(g4)
Activity profiles have similar shape, but overall, during weekends level of activity is significantly higher.