Skip to content

Commit

Permalink
Merge pull request #185 from SISBID/summarize-2024
Browse files Browse the repository at this point in the history
Small modifications to summarize to include some brief info about factors
  • Loading branch information
avahoffman authored Aug 13, 2024
2 parents 9277434 + 382a11d commit 9ab3e59
Show file tree
Hide file tree
Showing 7 changed files with 5,247 additions and 690 deletions.
15 changes: 7 additions & 8 deletions labs/data-summarization-lab-key.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ library(tidyverse)
circ <- read_csv("https://sisbid.github.io/Data-Wrangling/data/Charm_City_Circulator_Ridership.csv")
```

1. How many days are in the data set? You can assume each observation/row is a different day (hint: get the number of rows).
1. Each row is a different day. How many days are in the data set?

```{r q1}
nrow(circ)
Expand Down Expand Up @@ -51,30 +51,29 @@ circ %>%
count(is.na(daily))
```

4. Group the data by day of the week (`day`). Next, find the mean daily ridership (`daily` column) and the sample size. (hint: use `group_by` and `summarize` functions)
4. Group the data by day of the week (`day`). Find the mean daily ridership (`daily` column). (hint: use `group_by` and `summarize` functions)

```{r q4}
circ %>%
group_by(day) %>%
summarise(mean = mean(daily, na.rm = TRUE),
n = n())
summarize(mean = mean(daily, na.rm = TRUE))
```

## **Extra practice:**
## **Practice on your own**

5. What is the median of `orangeBoardings`(use `median()`).

```{r q6}
circ %>%
summarise(median = median(orangeBoardings, na.rm = TRUE))
summarize(median = median(orangeBoardings, na.rm = TRUE))
# OR
circ %>% pull(orangeBoardings) %>% median(na.rm = TRUE)
```

6. Take the median of `orangeBoardings`(use `median()`), but this time stratify by day of the week.
6. Take the median of `orangeBoardings`(use `median()`), but this time group by day of the week.

```{r q7}
circ %>%
group_by(day) %>%
summarise(median = median(orangeBoardings, na.rm = TRUE))
summarize(median = median(orangeBoardings, na.rm = TRUE))
```
126 changes: 70 additions & 56 deletions labs/data-summarization-lab-key.html
Original file line number Diff line number Diff line change
Expand Up @@ -357,101 +357,115 @@ <h1 class="title toc-ignore">Data Summarization Lab Key</h1>
<h2>Data used</h2>
<p>Circulator Lanes Dataset: the data is from <a href="https://data.baltimorecity.gov/Transportation/Charm-City-Circulator-Ridership/wwvu-583r" class="uri">https://data.baltimorecity.gov/Transportation/Charm-City-Circulator-Ridership/wwvu-583r</a></p>
<p>Available on: <a href="https://sisbid.github.io/Data-Wrangling/data/Charm_City_Circulator_Ridership.csv" class="uri">https://sisbid.github.io/Data-Wrangling/data/Charm_City_Circulator_Ridership.csv</a></p>
<pre class="r"><code>library(tidyverse)

circ &lt;- read_csv(&quot;https://sisbid.github.io/Data-Wrangling/data/Charm_City_Circulator_Ridership.csv&quot;)</code></pre>
<pre class="r"><code>library(tidyverse)</code></pre>
<pre><code>## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (&lt;http://conflicted.r-lib.org/&gt;) to force all conflicts to become errors</code></pre>
<pre class="r"><code>circ &lt;- read_csv(&quot;https://sisbid.github.io/Data-Wrangling/data/Charm_City_Circulator_Ridership.csv&quot;)</code></pre>
<pre><code>## Rows: 1146 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: &quot;,&quot;
## chr (2): day, date
## dbl (13): orangeBoardings, orangeAlightings, orangeAverage, purpleBoardings,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.</code></pre>
<ol style="list-style-type: decimal">
<li>How many days are in the data set? You can assume each
observation/row is a different day (hint: get the number of rows).</li>
<li>Each row is a different day. How many days are in the data set?</li>
</ol>
<pre class="r"><code>nrow(circ)</code></pre>
<pre><code>[1] 1146</code></pre>
<pre><code>## [1] 1146</code></pre>
<pre class="r"><code>dim(circ)</code></pre>
<pre><code>[1] 1146 15</code></pre>
<pre><code>## [1] 1146 15</code></pre>
<pre class="r"><code>circ %&gt;%
nrow()</code></pre>
<pre><code>[1] 1146</code></pre>
<pre><code>## [1] 1146</code></pre>
<ol start="2" style="list-style-type: decimal">
<li>What is the total (sum) number of boardings on the green bus
(<code>greenBoardings</code> column)?</li>
</ol>
<pre class="r"><code>sum(circ$greenBoardings, na.rm = TRUE)</code></pre>
<pre><code>[1] 935564</code></pre>
<pre><code>## [1] 935564</code></pre>
<pre class="r"><code>circ %&gt;% pull(greenBoardings) %&gt;% sum(na.rm = TRUE)</code></pre>
<pre><code>[1] 935564</code></pre>
<pre><code>## [1] 935564</code></pre>
<pre class="r"><code>count(circ, wt = greenBoardings)</code></pre>
<pre><code># A tibble: 1 × 1
n
&lt;dbl&gt;
1 935564</code></pre>
<pre><code>## # A tibble: 1 × 1
## n
## &lt;dbl&gt;
## 1 935564</code></pre>
<ol start="3" style="list-style-type: decimal">
<li>How many days are missing daily ridership (<code>daily</code>
column)? Use <code>is.na()</code> and <code>sum()</code>.</li>
</ol>
<pre class="r"><code>daily &lt;- circ %&gt;% pull(daily)
sum(is.na(daily))</code></pre>
<pre><code>[1] 124</code></pre>
<pre><code>## [1] 124</code></pre>
<pre class="r"><code># Can also
circ %&gt;%
count(is.na(daily))</code></pre>
<pre><code># A tibble: 2 × 2
`is.na(daily)` n
&lt;lgl&gt; &lt;int&gt;
1 FALSE 1022
2 TRUE 124</code></pre>
<pre><code>## # A tibble: 2 × 2
## `is.na(daily)` n
## &lt;lgl&gt; &lt;int&gt;
## 1 FALSE 1022
## 2 TRUE 124</code></pre>
<ol start="4" style="list-style-type: decimal">
<li>Group the data by day of the week (<code>day</code>). Next, find the
mean daily ridership (<code>daily</code> column) and the sample size.
(hint: use <code>group_by</code> and <code>summarize</code>
functions)</li>
<li>Group the data by day of the week (<code>day</code>). Find the mean
daily ridership (<code>daily</code> column). (hint: use
<code>group_by</code> and <code>summarize</code> functions)</li>
</ol>
<pre class="r"><code>circ %&gt;%
group_by(day) %&gt;%
summarise(mean = mean(daily, na.rm = TRUE),
n = n())</code></pre>
<pre><code># A tibble: 7 × 3
day mean n
&lt;chr&gt; &lt;dbl&gt; &lt;int&gt;
1 Friday 8961. 164
2 Monday 7340. 164
3 Saturday 6743. 163
4 Sunday 4531. 163
5 Thursday 7639. 164
6 Tuesday 7642. 164
7 Wednesday 7779. 164</code></pre>
summarize(mean = mean(daily, na.rm = TRUE))</code></pre>
<pre><code>## # A tibble: 7 × 2
## day mean
## &lt;chr&gt; &lt;dbl&gt;
## 1 Friday 8961.
## 2 Monday 7340.
## 3 Saturday 6743.
## 4 Sunday 4531.
## 5 Thursday 7639.
## 6 Tuesday 7642.
## 7 Wednesday 7779.</code></pre>
</div>
<div id="extra-practice" class="section level2">
<h2><strong>Extra practice:</strong></h2>
<div id="practice-on-your-own" class="section level2">
<h2><strong>Practice on your own</strong></h2>
<ol start="5" style="list-style-type: decimal">
<li>What is the median of <code>orangeBoardings</code>(use
<code>median()</code>).</li>
</ol>
<pre class="r"><code>circ %&gt;%
summarise(median = median(orangeBoardings, na.rm = TRUE))</code></pre>
<pre><code># A tibble: 1 × 1
median
&lt;dbl&gt;
1 3074</code></pre>
summarize(median = median(orangeBoardings, na.rm = TRUE))</code></pre>
<pre><code>## # A tibble: 1 × 1
## median
## &lt;dbl&gt;
## 1 3074</code></pre>
<pre class="r"><code># OR
circ %&gt;% pull(orangeBoardings) %&gt;% median(na.rm = TRUE)</code></pre>
<pre><code>[1] 3074</code></pre>
<pre><code>## [1] 3074</code></pre>
<ol start="6" style="list-style-type: decimal">
<li>Take the median of <code>orangeBoardings</code>(use
<code>median()</code>), but this time stratify by day of the week.</li>
<code>median()</code>), but this time group by day of the week.</li>
</ol>
<pre class="r"><code>circ %&gt;%
group_by(day) %&gt;%
summarise(median = median(orangeBoardings, na.rm = TRUE))</code></pre>
<pre><code># A tibble: 7 × 2
day median
&lt;chr&gt; &lt;dbl&gt;
1 Friday 4014.
2 Monday 3336
3 Saturday 2963
4 Sunday 1900
5 Thursday 3485
6 Tuesday 3484
7 Wednesday 3576 </code></pre>
summarize(median = median(orangeBoardings, na.rm = TRUE))</code></pre>
<pre><code>## # A tibble: 7 × 2
## day median
## &lt;chr&gt; &lt;dbl&gt;
## 1 Friday 4014.
## 2 Monday 3336
## 3 Saturday 2963
## 4 Sunday 1900
## 5 Thursday 3485
## 6 Tuesday 3484
## 7 Wednesday 3576</code></pre>
</div>


Expand Down
8 changes: 4 additions & 4 deletions labs/data-summarization-lab.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ library(tidyverse)
circ <- read_csv("https://sisbid.github.io/Data-Wrangling/data/Charm_City_Circulator_Ridership.csv")
```

1. How many days are in the data set? You can assume each observation/row is a different day (hint: get the number of rows).
1. Each row is a different day. How many days are in the data set?

```{r q1}
Expand All @@ -39,21 +39,21 @@ circ <- read_csv("https://sisbid.github.io/Data-Wrangling/data/Charm_City_Circul
```

4. Group the data by day of the week (`day`). Next, find the mean daily ridership (`daily` column) and the sample size. (hint: use `group_by` and `summarize` functions)
4. Group the data by day of the week (`day`). Find the mean daily ridership (`daily` column). (hint: use `group_by` and `summarize` functions)

```{r q4}
```

## **Extra practice:**
## **Practice on your own**

5. What is the median of `orangeBoardings`(use `median()`).

```{r q6}
```

6. Take the median of `orangeBoardings`(use `median()`), but this time stratify by day of the week.
6. Take the median of `orangeBoardings`(use `median()`), but this time group by day of the week.

```{r q7}
Expand Down
Loading

0 comments on commit 9ab3e59

Please sign in to comment.