Reproducible Research: Peer Assessment 1

Loading and preprocessing the data

Unzip the archive:

unzip("activity.zip")

Load the data:

raw_activity <- read.csv("activity.csv", colClasses = c("integer", "Date", "integer"))

Load ggplot2 and dplyr libraries:

library(ggplot2)
library(dplyr)

What is mean total number of steps taken per day?

Group input data by steps (NA values are filtered away):

daily.steps <- raw_activity %>%
    na.omit() %>%
    group_by(date) %>%
    summarize(steps = sum(steps))

A histogram of days per number of steps:

g1 <- ggplot(daily.steps, aes(x = steps)) +
    geom_histogram(bins = 20, fill = "salmon") +
    xlab("Steps") +
    ylab("Number of days")
print(g1)

Summary:

summary(daily.steps$steps)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    6778   10400    9354   12810   21190

What is the average daily activity pattern?

Activity, grouped by interval:

avg.intervals <- raw_activity %>%
    na.omit() %>%
    group_by(interval) %>%
    summarize(steps = mean(steps, na.rm = TRUE)) %>%
    as.data.frame()

Plot of average daily activity:

g2 <- ggplot(avg.intervals, aes(x = interval, y = steps)) +
    geom_area(colour = "darkgray", fill = "lightgray") +
    xlab("Minutes") +
    ylab("Steps")
print(g2)

Maximum steps were taken in

avg.intervals[which.max(avg.intervals$steps), 1]

## [1] 835

minutes (in average).

Imputing missing values

The number of missing values:

sum(is.na(raw_activity))

## [1] 2304

Missing values will be imputed with averaged interval activity value:

daily.steps.new <- raw_activity %>%
    left_join(avg.intervals, by = "interval") %>%
    mutate(steps = ifelse(is.na(steps.x), steps.y, steps.x)) %>%
    select(-steps.x, -steps.y) %>%
    group_by(date) %>%
    summarize(steps = sum(steps, na.rm = TRUE))

Comparrisson of histograms with and without imputting:

g3 <- ggplot() +
    geom_histogram(data = daily.steps, bins = 20, aes(steps, fill = "With NA's", y= -..count..)) +
    geom_histogram(data = daily.steps.new, bins = 20, aes(steps, fill = "With imputed", y= ..count..)) +
    xlab("Steps") +
    ylab("Number of days") +
    scale_fill_hue("Type")
print(g3)

In general both datasets seem similar, apart from a peak at approximately 11000 steps.

Summary of the imputed dataset:

summary(daily.steps.new$steps)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      41    9819   10770   10770   12810   21190

After imputing median and mean values both became equal to 10770.

Are there differences in activity patterns between weekdays and weekends?

Grouping by Weekday/Weekend

weekly.activity <- raw_activity %>%
    transform(weekday = as.factor(ifelse(weekdays(date) %in% c("Saturday", "Sunday"), "Weekend", "Weekday"))) %>%
    group_by(weekday, interval) %>%
    summarize(steps = mean(steps, na.rm = TRUE))

Activity profiles:

g4 <- ggplot(weekly.activity, aes(x = interval, y = steps, fill = weekday, colour = weekday)) +
    geom_area(alpha=0.6, lwd=0.8)
print(g4)

Activity profiles have similar shape, but overall, during weekends level of activity is significantly higher.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PA1_template.md

PA1_template.md

Reproducible Research: Peer Assessment 1

Loading and preprocessing the data

What is mean total number of steps taken per day?

What is the average daily activity pattern?

Imputing missing values

Are there differences in activity patterns between weekdays and weekends?

Files

PA1_template.md

Latest commit

History

PA1_template.md

File metadata and controls

Reproducible Research: Peer Assessment 1

Loading and preprocessing the data

What is mean total number of steps taken per day?

What is the average daily activity pattern?

Imputing missing values

Are there differences in activity patterns between weekdays and weekends?