Skip to content

Commit

Permalink
Cohort 10 chapter 12 (logical vectors): reworked presentation (#119)
Browse files Browse the repository at this point in the history
* Cohort 10 ch 12 (logicals): add new summary

* Cohort 10 ch 12 (logicals): add examples, update and split into slides

* Cohort 10 ch 12 (logicals): add learning objectives

* Cohort 10 ch 12 (logicals): minor improvement

* Cohort 10 ch 12 (logicals): fix an explanation for cond transf

* Cohort 10 ch 12 (logicals): rearrange type compatibility in cond transf

* Cohort 10 ch 12 (logicals): group inside summarize()

---------

Co-authored-by: Ken Đinh Vũ <[email protected]>
  • Loading branch information
florisvdh and Ken-Vu authored Jan 27, 2024
1 parent 99eb842 commit 7c15fb5
Showing 1 changed file with 188 additions and 63 deletions.
251 changes: 188 additions & 63 deletions 12-logical_vectors.Rmd
Original file line number Diff line number Diff line change
@@ -1,125 +1,192 @@
# (PART\*) Transform {-}

```{r, include=F}
library(dplyr)
library(nycflights13)
```

# Logical Vectors

**Learning objectives:**

- What are Vectors and Why are we talking about Logical Vectors

- How to use logical vectors & Boolean Algebra in
- Logical vectors: understanding what they are and why we use them
- Knowing how to generate logical vectors (variables)
- Knowing how to make use of logical vectors (variables):
- to filter data
- to create new variables
- to create summaries
- Understanding the effect of missing values in these operations

- Comparison

- Identifying missing values
## What and why {-}

- What are the logical summary functions we could leverage in R?
- This is all about conditions on data values: they return `TRUE` or `FALSE`.
This vector of `TRUE`/`FALSE` values is a logical vector.
- These conditions -- hence the resulting logical vectors -- play an important role in filtering, mutating and summarizing dataframe columns.

- Conditional Transformations using `if_else()` and `case_when()` functions
## Definition {-}

## What are Logical Vectors
Logical vector: vector of `TRUE`, `FALSE` and/or `NA` values.

- Vectors in R are the same as the arrays in C language, which are used to hold multiple data values of the same type
## Challenging: `NA` values {-}

- The mental model for using Logical Vectors
The most tricky part of **operations** with logical vectors is the effect of missing values.

- Defining constraints on your data to answer Binary Questions
## Operations overview {-}

## How to use logical vectors & Boolean Algebra
- Operations that don't change vector lengths
- comparisons: **generate a logical vector** from a non-logical vector
- boolean algebra: **generate a logical vector** from other logical vectors
- conditional transformations: generate a new vector **from (hierarchical) conditions** (~ logical vectors ~ comparisons)
- **Subsetting** vectors with a logical vector
- **Summarizing** logical vectors

![](images/14_venn_diagrams.png)
## Generating a logical vector (1) {-}

- Combining logical vectors using Boolean algebra
- either by doing (vectorized) **comparisons**

- Venn diagram here shows the logical relations between logical vectors
operators:

### Comparison
- one-to-one: `==`, `!=`, `<`, `<=`, `>`, `>=`
- one-to-many: `%in%`

\
Comparisons are often the way that logical vectors arise, i.e. during exploration, cleaning and analysis
- ... unless the logical vector is already provided, e.g. the observed variable is boolean.

- numeric comparison creates logical vectors, and you can see that in the following example
## Generating a logical vector (2) {-}

```{r}
library(tidyverse)
library(nycflights13)
flights |>
mutate(daytime = dep_time > 600, .keep = "used")
```

- There is an intermediate execution step that forma the logical vector behind the scenes when you use `filter()`
## Generating a logical vector (3) {-}

```{r}
flights |> filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20)
flights |>
mutate(daytime = dep_time > 600, .keep = "used") |>
filter(daytime)
```

- To separate this intermediate step in it's own variable we use `mutate`
## Generating a logical vector (4) {-}

```{r}
flights |> mutate( daytime = dep_time > 600 & dep_time < 2000, approx_ontime = abs(arr_delay) < 20, .keep = "used" )
1:12 %in% c(1, 5, 11)
```

- R doesn't round the float numeric value by default, but R console is rounding close number to integers
## Generating a logical vector (5) {-}

- To round numbers we could use `dplyr::near` function
A comparison _**is**_ a logical vector...

```{r}
x <- c(1 / 49 * 49, sqrt(2) ^ 2)
typeof(x) # double
class(flights$dep_time > 600)
length(flights$dep_time > 600)
head(flights$dep_time > 600)
```

```{r}
print(x)
print(x,digits = 16)
## Generating a logical vector (6) {-}

... so the logical vector does not have to be stored in order to filter data -- it is created on the fly:

```{r}
flights |>
filter(dep_time > 600)
```

### Identifying Missing Values

- There is no indication what's so ever about the `NA` value if it's a logical vector, although the author says that a missing value in a logical vector means that the value could either be `TRUE` or `FALSE`
- `TRUE | TRUE` and `FALSE | TRUE` are both `TRUE`, so `NA | TRUE` must also be `TRUE`. Similar reasoning applies with `NA & FALSE`
## Generating a logical vector (7) {-}

```{r}
df <- tibble(x = c(TRUE, FALSE, NA))
- either by doing (vectorized) comparisons (see before)
- or by **combining** logical vectors or comparisons = **boolean algebra**

- operators: `&`, `|`, `!`, `xor()`
- `{magrittr}` (through `{dplyr}`) provides the aliases `and()`, `or()`, `not()`

\
In numerical operations, `FALSE` is 0 and `TRUE` is 1.
Therefore:

- `&` can be mimicked by `pmin()`
- `|` can be mimicked by `pmax()`

df |>
## Generating a logical vector (8) {-}

Boolean operators:

![](images/14_venn_diagrams.png)

## Generating a logical vector (9) {-}

```{r}
flights |>
mutate(
and = x & (NA),
or = x | NA
daytime = dep_time > 600 & dep_time < 2000,
.keep = "used"
)
```

## Missing values (1) {-}

- For identifying missing values we use `is.na(x)` function which works with any type of vector and returns TRUE for missing values and FALSE for everything else.
In most cases an `NA` value (vector element) is regarded '_missing so we can't (always) know the outcome_':

- hence `NA`in **comparisons** will always return `NA`.
- so `x == NA` will just return `NA` for _all_ elements.
- check for missing values with **`is.na()`**: `TRUE` for missing values and `FALSE` for everything else

## Missing values (2) {-}

```{r}
c(TRUE, NA, FALSE) == NA # NOT useful!!
is.na(c(TRUE, NA, FALSE))
#> [1] FALSE TRUE FALSE
is.na(c(1, NA, 3))
#> [1] FALSE TRUE FALSE
is.na(c("a", NA, "b"))
#> [1] FALSE TRUE FALSE
```

- `%in%` is like asking does a set A of values is included in a set B of values
## Missing values (3) {-}

In most cases an `NA` value (vector element) is regarded '_missing so we can't (always) know the outcome_':

- `NA`in comparisons will always return `NA` (see before).
- `NA` in **boolean algebra**: sometimes the outcome is known, sometimes not (hence `NA`):
- `TRUE & NA` is `NA` but `FALSE & NA` is `FALSE`
- `TRUE | NA` is `TRUE` but `FALSE | NA` is `NA`

## Missing values (4) {-}

```{r}
1:12 %in% c(1, 5, 11)
c(TRUE, FALSE) & NA
c(TRUE, FALSE) | NA
```

## What are the logical summery functions we could leverage in R?
## Missing values (5) {-}

- Main logical summaries: `any()` and `all()`
But `%in%` works differently:

- Numeric logical summaries: `mean()` and `sum()` specially when you want to calculate the percentages of applied constraint or condition
- `NA %in% NA` returns `TRUE`: here `NA` is just regarded as a special value

## Conditional Transformations using `if_else()` and `case_when()` functions
```{r}
flights |>
filter(dep_time %in% c(NA, 0800))
```

## Conditional transformations (1) {-}

- We see these type of transformations everywhere when we clean data
Aim: generate a vector where each element is determined by the value of one or multiple conditions (~ comparisons).

- Its inspired by the SQL way of doing conditional transformations
- one condition: use `if_else()`
- `if_else(condition, true, false, missing = NULL)`

- multiple (hierarchical) conditions: use `case_when(..., .default = NULL)`.
- the _first_ condition that is `TRUE` for an element determines the outcome for that element.

- We use [dplyr::if_else()](https://dplyr.tidyverse.org/reference/if_else.html) when we have just one condition and we want to map it to only two outcomes
## Conditional transformations (2) {-}

```{r}
x <- c(-3:3, NA)
if_else(x > 0, "+ve", "-ve", "???")
```

- We use dplyr's `case_when()` when we want to map multiple conditions to multiple different outcomes.
## Conditional transformations (3) {-}

```{r}
x <- c(-3:3, NA)
Expand All @@ -131,23 +198,81 @@ case_when(
)
```

- Note that both [`if_else()`](https://dplyr.tidyverse.org/reference/if_else.html) and [`case_when()`](https://dplyr.tidyverse.org/reference/case_when.html) require **compatible** types in the output
## Conditional transformations (4) {-}

The different outcomes must be **compatible types**!

Some examples:

- numerical and logical
- strings and factors.
- `NA` is compatible with everything.


## Subsetting vectors {-}

I.e. keep only a subset of a vector, drop the rest, based on some condition.

- This is base R!
- Put a logical vector in the brackets (obtained by one of the previous techniques; often comparison)

E.g.:

```r
condition <- flights$arr_delay > 0
flights$arr_delay[condition]
flights$dep_time[condition]
# or just:
flights$dep_time[flights$arr_delay > 0]
```


## Summary
## Summarizing logical vectors (1) {-}

- Logical Vectors is just a list of Boolean elements
- Summarizing the whole vector:
- `any()`, `all()`: return a logical
- `sum()`, `mean()`: return a numeric

- R designed to evaluate logical vectors based on Boolean Algebra rules
- Summarizing a subset:
- apply a summary function to a subsetted vector

- Logical summaries is a powerful tool for summarizing data based on logic or condition
- If `NA` values are present, the summary result will be `NA`, BUT the `NA` values can also be ignored with: **`na.rm = TRUE`**.

- Conditional Transformation is tool for powering data analysis transformation process and also help you answer a Yes Or No questions about the data
## Summarizing logical vectors (2) {-}

## Learning More
```{r}
flights |>
summarize(
all_delayed = all(dep_delay <= 60, na.rm = TRUE),
any_long_delay = any(arr_delay >= 300, na.rm = TRUE),
.by = c(year, month, day)
)
```


## Summarizing logical vectors (3) {-}

```{r}
flights |>
summarize(
proportion_delayed = mean(dep_delay <= 60, na.rm = TRUE),
count_long_delay = sum(arr_delay >= 300, na.rm = TRUE),
.by = c(year, month, day)
)
```

## Summarizing logical vectors (4) {-}

```{r}
flights |>
summarize(
behind = mean(arr_delay[arr_delay > 0], na.rm = TRUE),
ahead = mean(arr_delay[arr_delay < 0], na.rm = TRUE),
n = n(),
.by = c(year, month, day)
)
```

- [r4ds.io/join](r4ds.io/join) for more book clubs!
- [R Graph Gallery](https://www.r-graph-gallery.com/ggplot2-package.html)
- The [Graphs section](http://www.cookbook-r.com/Graphs/) of the R Cookbook

## Meeting Videos

Expand Down

0 comments on commit 7c15fb5

Please sign in to comment.