title | author | date | knit | site | url | github-repo | twitter-handle | cover-image | documentclass | bibliography | biblio-style | link-citations | description | favicon | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DataAtlas |
|
2024-06-22 |
bookdown::render_book |
bookdown::bookdown_site |
https\://R-Computing-Lab/data-atlas |
R-Computing-Lab/data-atlas |
smasongarrison |
assets/logo.png |
book |
|
apalike |
true |
Data Atlas |
assets/favicon.ico |
The adage 'A picture is worth a thousand words' is just as applicable to Behavior Genetics as it is to journalism. In this talk, I delve into the rich diverse history of data visualization techniques in the field of behavior genetics, tracing advancements from Wright's [@Wright1920] classic path diagram of guinea pigs to the contemporary usage of Manhattan plots. This survey sets the stage for the development of two resources aimed at bolstering the field's visualization toolkit. The first, an atlas employing 'ggplot2' in R, aims to guide researchers in selecting the most appropriate visualization techniques for their data. The second, a proposed R package, seeks to facilitate the integration of 'OpenMx'-compatible visualizations, streamlining the presentation of complex genetic analyses. Emphasizing practical advice, the talk will cover effective uses of path diagrams, family trees, and interaction visualizations, with the goal of improving clarity and interpretability of genetic data. This initiative aspires to democratize advanced data visualization tools, making them more accessible to the behavior genetics community and, in turn, amplifying the impact and reach of its research.
In the dynamic field of behavior genetics, the ability to visualize complex data clearly and effectively is not just a technical skill but a narrative art. This document, accompanying the talk on the historical and innovative approaches to data visualization in behavior genetics, introduces the fundamentals of employing ggplot2, a powerful and versatile package in R for creating quality graphics.
ggplot2
is part of the tidyverse
, an opinionated collection of R
packages designed for data science that share an underlying design philosophy, grammar, and data structures. At the heart of ggplot2
is the concept of a layered grammar of graphics, which allows users to create graphics in a step-by-step, clear and logical way. This framework not only simplifies plotting complex data but also encourages a deeper understanding of the graphical representation itself—making it an ideal tool for behavior geneticists seeking to unvail the hidden stories within their data.
The Data Atlas presented here is designed to serve as a practical guide through the landscape of data visualization techniques specific to the field of behavior genetics. It offers a curated selection of graphical methods that range from the foundational, such as Wright's classic path diagrams, to the advanced, like the interaction visualizations and Manhattan plots, which have become indispensable in the modern geneticist’s toolkit.
Further, this document and the accompanying talk address the development of an innovative R package tailored to integrate with OpenMx and BGmisic, projects aimed at facilitating advanced statistical modeling of complex genetic data. This package is proposed to enhance the ggplot2
environment, making it more conducive to the specialized needs of behavior genetics research—such as the visualization of path diagrams and family trees that are not only informative but are also aesthetically compelling.
As we proceed, this introduction will delve into the specifics of ggplot2 usage in R, demonstrating how this tool can be adapted to meet the unique challenges of visualizing behavioral and genetic data. By bridging the gap between traditional methods and contemporary demands, the resources developed aim to democratize advanced data visualization tools, thereby expanding the accessibility and impact of behavior genetics research.
To navigate these notes, use the table of contents on the left side of the screen. You can open or close the table of contents using the hamburger icon (horizontal bars) at the top of the document. Additionally, there are other icons at the top of the document for searching within the text, and for adjusting the size, font, or color scheme of the page. The document will be updated unpredictably.
These notes lean heavily on other peoples' materials and ideas. I have done my best to document the origin of the materials and ideas. In particular, I have noted people whose work has been a major contribution as well as those who have additional contributions. You can see specific changes by examining the edit history on the git repo
- Jenny Bryan's (jennybryan.org) STAT 545 and Happy Git with R;
- Mason Garrison's PSY 720 Data Science for Psychologists
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
This information is a human-readable summary of (and not a substitute for) the license. Please see https://creativecommons.org/licenses/by-sa/4.0/legalcode for the full legal text.
You are free to:
-
Share---copy and redistribute the material in any medium or format
-
Remix---remix, transform, and build upon the material for any purpose, even commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
-
Attribution---You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
-
ShareAlike---If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
-
No additional restrictions---You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
Notices:
You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation.
No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.
These notes was written in bookdown inside RStudio. The website is hosted with github. The complete source is available from github.
The book style was designed by Desirée De Leon.
This version of the notes was built with:
## Finding R package dependencies ... Done!
## setting value
## version R version 4.4.1 (2024-06-14 ucrt)
## os Windows 11 x64 (build 22635)
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate English_United States.utf8
## ctype English_United States.utf8
## tz America/New_York
## date 2024-06-22
## pandoc 3.1.11 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
Along with these packages:
<div class="datatables html-widget html-fill-item" id="htmlwidget-c3f40e2290c7f0da15ff" style="width:100%;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-c3f40e2290c7f0da15ff">{"x":{"filter":"top","vertical":false,"filterHTML":"<tr>\n <td data-type=\"character\" style=\"vertical-align: top;\">\n <div class=\"form-group has-feedback\" style=\"margin-bottom: auto;\">\n <input type=\"search\" placeholder=\"All\" class=\"form-control\" style=\"width: 100%;\"/>\n <span class=\"glyphicon glyphicon-remove-circle form-control-feedback\"><\/span>\n <\/div>\n <\/td>\n <td data-type=\"character\" style=\"vertical-align: top;\">\n <div class=\"form-group has-feedback\" style=\"margin-bottom: auto;\">\n <input type=\"search\" placeholder=\"All\" class=\"form-control\" style=\"width: 100%;\"/>\n <span class=\"glyphicon glyphicon-remove-circle form-control-feedback\"><\/span>\n <\/div>\n <\/td>\n <td data-type=\"character\" style=\"vertical-align: top;\">\n <div class=\"form-group has-feedback\" style=\"margin-bottom: auto;\">\n <input type=\"search\" placeholder=\"All\" class=\"form-control\" style=\"width: 100%;\"/>\n <span class=\"glyphicon glyphicon-remove-circle form-control-feedback\"><\/span>\n <\/div>\n <\/td>\n <td data-type=\"character\" style=\"vertical-align: top;\">\n <div class=\"form-group has-feedback\" style=\"margin-bottom: auto;\">\n <input type=\"search\" placeholder=\"All\" class=\"form-control\" style=\"width: 100%;\" disabled=\"\"/>\n <span class=\"glyphicon glyphicon-remove-circle form-control-feedback\"><\/span>\n <\/div>\n <\/td>\n<\/tr>","data":[["base64enc","bookdown","bslib","cachem","cli","digest","evaluate","fastmap","fontawesome","fs","glue","highr","htmltools","jquerylib","jsonlite","knitr","lifecycle","memoise","mime","R6","rappdirs","rlang","rmarkdown","sass","tinytex","xfun","yaml"],[null,"0.39","0.7.0","1.1.0","3.6.2","0.6.35","0.23","1.2.0",null,"1.6.4","1.7.0",null,"0.5.8.1","0.1.4","1.8.8","1.46","1.0.4","2.0.1","0.12","2.5.1",null,"1.1.3","2.27","0.4.9",null,"0.44","2.3.8"],["2015-07-28","2024-04-15","2024-03-29","2024-05-16","2023-12-11","2024-03-11","2023-11-01","2024-05-15","2023-08-19","2024-04-25","2024-01-09","2024-05-26","2024-04-04","2021-04-26","2023-12-04","2024-04-06","2023-11-07","2021-11-26","2021-09-28","2021-08-19","2021-01-31","2024-01-10","2024-05-17","2024-03-15","2024-05-06","2024-05-15","2023-12-11"],["CRAN (R 4.4.0)","CRAN (R 4.4.0)","CRAN (R 4.4.0)","CRAN (R 4.4.0)","CRAN (R 4.4.0)","CRAN (R 4.4.0)","CRAN (R 4.4.0)","CRAN (R 4.4.0)","CRAN (R 4.4.0)","CRAN (R 4.4.0)","CRAN (R 4.4.0)","CRAN (R 4.4.0)","CRAN (R 4.4.0)","CRAN (R 4.4.0)","CRAN (R 4.4.0)","CRAN (R 4.4.0)","CRAN (R 4.4.0)","CRAN (R 4.4.0)","CRAN (R 4.4.0)","CRAN (R 4.4.0)","CRAN (R 4.4.0)","CRAN (R 4.4.0)","CRAN (R 4.4.0)","CRAN (R 4.4.0)","CRAN (R 4.4.0)","CRAN (R 4.4.0)","CRAN (R 4.4.0)"]],"container":"<table class=\"cell-border stripe\">\n <thead>\n <tr>\n <th>package<\/th>\n <th>loadedversion<\/th>\n <th>date<\/th>\n <th>source<\/th>\n <\/tr>\n <\/thead>\n<\/table>","options":{"pageLength":27,"autoWidth":true,"bInfo":false,"paging":false,"columnDefs":[{"name":"package","targets":0},{"name":"loadedversion","targets":1},{"name":"date","targets":2},{"name":"source","targets":3}],"order":[],"orderClasses":false,"orderCellsTop":true,"lengthMenu":[10,25,27,50,100]},"selection":{"mode":"multiple","selected":null,"target":"row","selectable":null}},"evals":[],"jsHooks":[]}</script>
This chapter contains a brief introduction to the tidyverse
package suite, which is a collection of R packages designed to make data manipulation and visualization easier and more intuitive. The tidyverse
packages share a common philosophy and syntax, making it easy to learn and use them together. The resulting code is self-documenting, and thus easier to maintain, read, and share, making it ideal for data analysis and visualization.
This crash course is extremely abridged. For a more in-depth treatment of the tidyverse
, I recommend checking out my multimedia textbook Data Science for Psychologists and the R for Data Science book by Hadley Wickham and Garrett Grolemund.
The tidyverse
is a collection of R packages that share a common philosophy and syntax. The packages are designed to work together to make data manipulation and visualization easier and more intuitive. The tidyverse
includes some of the most popular and widely used R packages, such as ggplot2
, dplyr
, tidyr
, and readr
. These packages are built on the principles of tidy data, which is a standard way of organizing data that makes it easy to work with.
Happy families are all alike; every unhappy family is unhappy in its own way.
Leo Tolstoy
Tidy data has the following properties:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
Tidy data is easy to manipulate, visualize, and model, and it is the preferred format for data analysis in the tidyverse
. The tidyverse
packages are designed to work with tidy data, and they provide tools and functions that make it easy to transform and visualize data in this format. Untidy data does not have these properties, and it can be more difficult to work with.
There are some key differences between base R and the tidyverse
. I like to think of them as different dialects. (I'll often say I was raised in based R.) Base R is powerful and flexible, but it can be difficult to learn and use, especially for beginners. The tidyverse
, on the other hand, is designed to be easy to learn and use, with a consistent syntax and a focus on data manipulation and visualization. The tidyverse
also includes a number of functions and tools that make common data analysis tasks easier and more intuitive.
Someone who is familiar with base R might notice several key differences when they're introduced to the tidyverse. In my experience the main differences are:
- Syntax and Readability: The tidyverse uses a more consistent and readable syntax that emphasizes clarity and the flow of operations. For example, it leverages the
%>%
pipe operator from the magrittr package extensively, allowing for code to be written in a way that is easy to read from left to right, much like a series of steps.
Pipe operators are used in many programming languages to chain together operations. In R, the pipe operator %>%
is used to pass the output of one function to the next function as the first argument. This allows you to write code in a way that is easy to read and understand, with each step of the data analysis process clearly laid out. To use the pipe operator, you simply place it between two functions, like this:
output <- function1(input) %>% function2()
When these pipes are chained together, they create a clear and readable sequence of operations that can be easily understood by others. This makes it easier to write and maintain code, and it can also help you avoid errors by breaking down complex tasks into smaller, more manageable steps. Here's an example of how the pipe operator can be used to chain together a series of data manipulation operations in the tidyverse:
data %>%
filter(condition) %>%
select(variables) %>%
mutate(new_variable = expression) %>%
group_by(grouping_variable) %>%
summarize(summary_statistic)
Much of the rest of the differences are which functions are used for common tasks. For example:
-
Data Manipulation: While base R uses functions like
subset
,merge
, andapply
, the tidyverse offersdplyr
which provides functions likefilter
,select
,mutate
,summarize
, andjoin
. These functions are often seen as more intuitive and powerful for data manipulation tasks. -
Data Import and Tidying: Base R provides functions like
read.table
andread.csv
for data import, and various base functions for data cleaning and reshaping (e.g.,reshape
,aggregate
). The tidyverse, on the other hand, hasreadr
for reading data andtidyr
for tidying data. The tidying functions, such as gather and spread (now evolved intopivot_longer
andpivot_wider
), focus on the concept of "tidy data" where each variable forms a column, each observation forms a row, and each type of observational unit forms a table. -
Handling Factor Variables: In base R, factors are often automatically created when importing data which can lead to unexpected behaviors if not properly managed. The tidyverse's
forcats
package provides tools that make factor management more explicit and less error-prone. -
Functional Programming: Base R uses apply functions like
lapply
andsapply
for operations on lists or vectors. The tidyverse introducespurrr
which provides a much needed enhancment for R functional programming with a consistent and coherent set of tools for working with functions and vectors.
Other differences are more subtle but can have a big impact on the user experience, such as how graphics can be handled and how data types are stored:
-
Visualization: Base R uses traditional plotting functions such as plot and hist. The tidyverse offers ggplot2, based on the grammar of graphics, which allows for building plots layer by layer and integrates well with the rest of the tidyverse data manipulation packages.
-
Data Types and Storage: The tidyverse introduces tibbles via the tibble package, which are modern reimaginings of data frames. Tibbles are data frames but slightly tweaked to work better in the tidyverse; for example, they never change a variable's type when adding rows, don't use row names, and provide a better print method that shows only the first 10 rows and all columns that fit on the screen.
Overall, the tidyverse provides a coherent system of packages that facilitate data manipulation, analysis, and visualization in a way that is generally considered to be more user-friendly, especially for those new to R or programming in general. However, this can also lead to a bit of a learning curve for those deeply entrenched in base R.
In this chapter, we explore the power of ggplot2
within the tidyverse package suite for creating compelling visual representations of twin studies in behavior genetics. ggplot2
's layer-oriented approach to building data visualizations allows researchers to intuitively map out the complexities inherent in twin data, providing insights that are crucial for both hypothesis testing and exploratory data analysis.
ggplot2
is a powerful data visualization package in R that is part of the tidyverse suite of packages. It is based on the grammar of graphics, a coherent system for describing and building visualizations. The grammar of graphics is based on the idea that a plot can be decomposed into a set of independent components, such as data, aesthetics, and geoms (geometric objects). By combining these components, ggplot2 allows for the creation of complex and informative visualizations that can be easily customized and extended.
The structure of the code for plots can be summarized as follows:
ggplot(data = [[dataset]],
mapping = aes(x = [[x-variable]],
y = [[y-variable]])) +
geom_xxx() +
other options
Each component of the plot is added in layers. The ggplot()
function initializes the plot, aes()
specifies the aesthetic mappings (how variables are mapped to visual properties), and geom_xxx()
adds a geometric object (points, lines, bars, etc.) to the plot. Additional layers can be added to further customize the plot, such as labels, titles, and themes. We'll dive into these soon enough, but first, let's walk through a simple example to illustrate the basic structure of a ggplot2 plot.
To make this more concrete, let's consider an example using twin data on height from the OpenMX
package, which is in the twinData
data. These 3,808 pairs of twins are from the Australian National Health and Medical Research Council Twin Registry. The dataset contains information on the height, weight, and body mass index (BMI) of twins, along with their zygosity and other demographic information.
library(tidyverse) # Load the tidyverse packages
library(OpenMx) # Load the OpenMx package
library(BGmisc) # Load the BGmisc package
library(conflicted) # to handle conflicts
conflicted::conflicts_prefer(OpenMx::vech,dplyr::filter) # Resolve conflicts
data(twinData)
glimpse(twinData)
#> Rows: 3,808
#> Columns: 16
#> $ fam <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
#> $ age <int> 21, 24, 21, 21, 19, 26, 23, 29, 24, 28, 29, 19, 23, 22, 23, 2…
#> $ zyg <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ part <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
#> $ wt1 <int> 58, 54, 55, 66, 50, 60, 65, 40, 60, 76, 48, 70, 51, 53, 58, 4…
#> $ wt2 <int> 57, 53, 50, 76, 48, 60, 65, 39, 57, 64, 51, 67, 51, 55, 57, 5…
#> $ ht1 <dbl> 1.70, 1.63, 1.65, 1.57, 1.61, 1.60, 1.75, 1.56, 1.76, 1.70, 1…
#> $ ht2 <dbl> 1.70, 1.63, 1.68, 1.65, 1.63, 1.57, 1.77, 1.53, 1.77, 1.73, 1…
#> $ htwt1 <dbl> 20.1, 20.3, 20.2, 26.8, 19.3, 23.4, 21.2, 16.4, 19.4, 26.3, 2…
#> $ htwt2 <dbl> 19.7, 19.9, 17.7, 27.9, 18.1, 24.3, 20.7, 16.7, 18.2, 21.4, 2…
#> $ bmi1 <dbl> 21.0, 21.1, 21.0, 23.0, 20.7, 22.1, 21.4, 19.6, 20.7, 22.9, 2…
#> $ bmi2 <dbl> 20.9, 21.0, 20.1, 23.3, 20.3, 22.3, 21.2, 19.7, 20.3, 21.4, 2…
#> $ cohort <chr> "younger", "younger", "younger", "younger", "younger", "young…
#> $ zygosity <fct> MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, M…
#> $ age1 <int> 21, 24, 21, 21, 19, 26, 23, 29, 24, 28, 29, 19, 23, 22, 23, 2…
#> $ age2 <int> 21, 24, 21, 21, 19, 26, 23, 29, 24, 28, 29, 19, 23, 22, 23, 2…
I'm going to tidy up the data a bit by renaming the columns and removing some unnecessary columns. I'll also filter out any rows with missing values in the height columns. I've also added a column to represent the sex of the twins and their zygosity.
twinData_cleaned <- twinData %>%
rename( # Rename columns
family = fam,
weight_t1 = wt1,
weight_t2 = wt2,
height_t1 = ht1,
height_t2 = ht2,
bmi_t1 = bmi1,
bmi_t2 = bmi2,
age_t1 = age1,
age_t2 = age2,
heightweight_t1 = htwt1,
heightweight_t2 = htwt2) %>%
select(-c(# Remove unnecessary columns
part,
zyg,
age_t1,
age_t2)) %>%
filter(# Remove rows with missing values
!is.na(height_t1),
!is.na(height_t2)) %>%
mutate( # Create new columns
sex = case_when(zygosity %in% c("MZFF","DZFF") ~ "F",
zygosity %in% c("MZMM","DZMM") ~ "M",
TRUE ~ "OS"),
zyg = case_when(zygosity %in% c("MZFF","MZMM") ~ "MZ",
zygosity %in% c("DZFF","DZMM","DZOS") ~ "DZ",
TRUE ~ NA_character_ ))
glimpse(twinData_cleaned)
#> Rows: 3,667
#> Columns: 14
#> $ family <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
#> $ age <int> 21, 24, 21, 21, 19, 26, 23, 29, 24, 28, 29, 19, 23, 22…
#> $ weight_t1 <int> 58, 54, 55, 66, 50, 60, 65, 40, 60, 76, 48, 70, 51, 53…
#> $ weight_t2 <int> 57, 53, 50, 76, 48, 60, 65, 39, 57, 64, 51, 67, 51, 55…
#> $ height_t1 <dbl> 1.70, 1.63, 1.65, 1.57, 1.61, 1.60, 1.75, 1.56, 1.76, …
#> $ height_t2 <dbl> 1.70, 1.63, 1.68, 1.65, 1.63, 1.57, 1.77, 1.53, 1.77, …
#> $ heightweight_t1 <dbl> 20.1, 20.3, 20.2, 26.8, 19.3, 23.4, 21.2, 16.4, 19.4, …
#> $ heightweight_t2 <dbl> 19.7, 19.9, 17.7, 27.9, 18.1, 24.3, 20.7, 16.7, 18.2, …
#> $ bmi_t1 <dbl> 21.0, 21.1, 21.0, 23.0, 20.7, 22.1, 21.4, 19.6, 20.7, …
#> $ bmi_t2 <dbl> 20.9, 21.0, 20.1, 23.3, 20.3, 22.3, 21.2, 19.7, 20.3, …
#> $ cohort <chr> "younger", "younger", "younger", "younger", "younger",…
#> $ zygosity <fct> MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, …
#> $ sex <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F",…
#> $ zyg <chr> "MZ", "MZ", "MZ", "MZ", "MZ", "MZ", "MZ", "MZ", "MZ", …
The code structure for our plot would look like this:
ggplot(data = twinData_cleaned, mapping = aes(x = height_t1,
y = height_t2)) +
geom_point() +
labs(title = "Height Comparison between Twins")
It would produce a scatter plot comparing the height of twins with the height of twin 1 on the x-axis and the height of twin 2 on the y-axis.
The code above demonstrates the basic structure of a ggplot2 plot. To break down the key components of the plot you can press the left and right arrows on your keyboard to navigate through the slides below. Or if you prefer, you can click on the slide to advance to the next one. Below the slides you'll find the same content in a more traditional format.
#> Error in df_url$link: $ operator is invalid for atomic vectors
Start with the twinData_cleaned
data frame
ggplot(data = twinData_cleaned) #<<
Start with the twinData_cleaned
data frame, map twin 1's height to the x-axis
ggplot(data = twinData_cleaned,
mapping = aes(x = height_t1)) #<<
Start with the twinData_cleaned
data frame, map twin 1's height to the x-axis, and and map twin 2's height to the y-axis.
ggplot(data = twinData_cleaned,
mapping = aes( x = height_t1,
y = height_t2)) #<<
Start with the twinData_cleaned
data frame, map twin 1's height to the x-axis, and and map twin 2's height to the y-axis. Represent each observation with a point
ggplot(data = twinData_cleaned,
mapping = aes( x = height_t1,
y = height_t2)) +
geom_point() #<<
Start with the twinData_cleaned
data frame, map twin 1's height to the x-axis, and and map twin 2's height to the y-axis. Represent each observation with a point, and map zygosity to the color of each point.
ggplot(data = twinData_cleaned,
mapping = aes( x = height_t1,
y = height_t2,
color = zyg)) + #<<
geom_point()
Start with the twinData_cleaned
data frame, map twin 1's height to the x-axis, and and map twin 2's height to the y-axis. Represent each observation with a point, and map zygosity to the color of each point. Title the plot "Height Comparison between Twins"
ggplot(data = twinData_cleaned,
mapping = aes( x = height_t1,
y = height_t2,
color = zyg)) +
geom_point() +
labs(title = "Height Comparison between Twins") #<<
Start with the twinData_cleaned
data frame, map twin 1's height to the x-axis, and and map twin 2's height to the y-axis. Represent each observation with a point, and map zygosity to the color of each point. Title the plot "Height Comparison between Twins", add the subtitle "by zygosity"
ggplot(data = twinData_cleaned,
mapping = aes( x = height_t1,
y = height_t2,
color = zyg)) +
geom_point() +
labs(title = "Height Comparison between Twins",
subtitle = "by zygosity") #<<
Start with the twinData_cleaned
data frame, map twin 1's height to the x-axis, and and map twin 2's height to the y-axis. Represent each observation with a point, and map zygosity to the color of each point. Title the plot "Height Comparison between Twins", add the subtitle "by zygosity", label the x and y axes as "Height of Twin 1 (m)" and "Height of Twin 2 (m)", respectively
ggplot(data = twinData_cleaned,
mapping = aes( x = height_t1,
y = height_t2,
color = zyg)) +
geom_point() +
labs(title = "Height Comparison between Twins",
subtitle = "by zygosity",
x = "Height of Twin 1 (m)", y = "Height of Twin 2 (m)") #<<
Start with the twinData_cleaned
data frame, map twin 1's height to the x-axis, and and map twin 2's height to the y-axis. Represent each observation with a point, and map zygosity to the color of each point. Title the plot "Height Comparison between Twins", add the subtitle "by zygosity", label the x and y axes as "Height of Twin 1 (m)" and "Height of Twin 2 (m)", respectively , label the legend "Zygosity"
ggplot(data = twinData_cleaned,
mapping = aes( x = height_t1,
y = height_t2,
color = zyg)) +
geom_point() +
labs(title = "Height Comparison between Twins",
subtitle = "by zygosity",
x = "Height of Twin 1 (m)", y = "Height of Twin 2 (m)",
color = "Zygosity") #<<
Start with the twinData_cleaned
data frame, map twin 1's height to the x-axis, and and map twin 2's height to the y-axis. Represent each observation with a point, and map zygosity to the color of each point. Title the plot "Height Comparison between Twins", add the subtitle "by zygosity", label the x and y axes as "Height of Twin 1 (m)" and "Height of Twin 2 (m)", respectively , label the legend "Zygosity", and add a caption for the data source.
ggplot(data = twinData_cleaned,
mapping = aes( x = height_t1,
y = height_t2,
color = zyg)) +
geom_point() +
labs(title = "Height Comparison between Twins",
subtitle = "by zygosity",
x = "Height of Twin 1 (m)", y = "Height of Twin 2 (m)",
color = "Zygosity",
caption = "Source: Australian National Health and Medical Research Council Twin Registry / OpenMx package") #<<
Start with the twinData_cleaned
data frame, map twin 1's height to the x-axis, and and map twin 2's height to the y-axis. Represent each observation with a point, and map zygosity to the color of each point. Title the plot "Height Comparison between Twins", add the subtitle "by zygosity", label the x and y axes as "Height of Twin 1 (m)" and "Height of Twin 2 (m)", respectively , label the legend "Zygosity", and add a caption for the data source. Finally, use a discrete color scale that is designed to be perceived by viewers with common forms of color blindness.
ggplot(data = twinData_cleaned,
mapping = aes( x = height_t1,
y = height_t2,
color = zyg)) +
geom_point() +
labs(title = "Height Comparison between Twins",
subtitle = "by zygosity",
x = "Height of Twin 1 (m)", y = "Height of Twin 2 (m)",
color = "Zygosity",
caption = "Source: Australian National Health and Medical Research Council Twin Registry / OpenMx package") +
scale_color_viridis_d() #<<
Wide form data is a common way to store data, especially when it comes from family data. surveys or experiments where each family has multiple members (twins, siblings etc) In this form, each row represents a family, and each column represents a variable or measurement for a specific family member. This format is easy to understand and work with, but it can be challenging for certain types of analyses or visualizations. In this section, demonstrate straightforward ways to import, summarize, and visualize wide form data using the twinData
dataset from the OpenMx
package.
We're going to use the twinData
dataset from the OpenMx
package. This dataset contains 3,808 pairs of Australian twins on the 12 variables.
library(tidyverse)
library(NlsyLinks)
library(discord)
library(BGmisc)
library(OpenMx)
library(conflicted) # to handle conflicts
conflicted::conflicts_prefer(OpenMx::vech,dplyr::filter) # Resolve conflicts
data(twinData)
df_wide <- twinData
Let's take a look at the structure of the dataset using the class
and glimpse
functions.
class(df_wide)
## [1] "data.frame"
glimpse(df_wide)
## Rows: 3,808
## Columns: 16
## $ fam <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
## $ age <int> 21, 24, 21, 21, 19, 26, 23, 29, 24, 28, 29, 19, 23, 22, 23, 2…
## $ zyg <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ part <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ wt1 <int> 58, 54, 55, 66, 50, 60, 65, 40, 60, 76, 48, 70, 51, 53, 58, 4…
## $ wt2 <int> 57, 53, 50, 76, 48, 60, 65, 39, 57, 64, 51, 67, 51, 55, 57, 5…
## $ ht1 <dbl> 1.7000, 1.6299, 1.6499, 1.5698, 1.6099, 1.5999, 1.7500, 1.559…
## $ ht2 <dbl> 1.7000, 1.6299, 1.6799, 1.6499, 1.6299, 1.5698, 1.7698, 1.529…
## $ htwt1 <dbl> 20.0692, 20.3244, 20.2020, 26.7759, 19.2894, 23.4375, 21.2245…
## $ htwt2 <dbl> 19.7232, 19.9481, 17.7154, 27.9155, 18.0662, 24.3418, 20.7476…
## $ bmi1 <dbl> 20.9943, 21.0828, 21.0405, 23.0125, 20.7169, 22.0804, 21.3861…
## $ bmi2 <dbl> 20.8726, 20.9519, 20.1210, 23.3043, 20.2583, 22.3454, 21.2270…
## $ cohort <chr> "younger", "younger", "younger", "younger", "younger", "young…
## $ zygosity <fct> MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, M…
## $ age1 <int> 21, 24, 21, 21, 19, 26, 23, 29, 24, 28, 29, 19, 23, 22, 23, 2…
## $ age2 <int> 21, 24, 21, 21, 19, 26, 23, 29, 24, 28, 29, 19, 23, 22, 23, 2…
As we can see, the dataset contains 3808 observations of 16 variables. Most of these variables include two entries, one for twin 1 and one for twin 2. The zygosity
variable indicates the zygosity of the twins, while other variables represent different measurements or characteristics of the twins, and denote which twin they refer to by the suffixes 1
and 2
.
We can also see that the dataset contains a mix of numeric and character variables. To improve the readability of the dataset, we are going to split the information from the zygosity variable into two separate variables: sex
and zyg
. This will make it easier to analyze and visualize the data later on.
df_wide <- df_wide %>% mutate(sex =
case_when(zygosity %in% c("MZFF","DZFF") ~ "F",
zygosity %in% c("MZMM","DZMM") ~ "M",
TRUE ~ "OS"),
zyg = case_when(zygosity %in% c("MZFF","MZMM") ~ "MZ",
zygosity %in% c("DZFF","DZMM","DZOS") ~ "DZ",
TRUE ~ NA_character_ ))
The provided R code uses the tidyverse
package's dplyr
library to manipulate a data frame named df_wide
. The mutate
function is used to create new variables or modify existing ones within the data frame. (In this case it is creating sex
and over writing zyg
. Here's an annotation of what each part of the code is doing:
- Data Frame Assignment:
df_wide <- df_wide %>%
- This line indicates that we are taking the existing data frame
df_wide
and using the%>%
operator (pipe) to pass it through additional functions. The result will be stored back into thedf_wide
.
- Mutate Function:
mutate(sex = ..., zyg = ...)
- The
mutate
function is used to add new columns to the data frame or change existing ones. In this case, one new columnsex
is being added and the existing columnzyg
is being modified.
- Creating
sex
Column:
sex = case_when(...)
creates a new column named sex based on conditions applied to the zygosity column. The case_when function is similar to a series of if-else statements. For each row, it checks the conditions in order and assigns a value to sex based on the first matching condition.
- Conditions for
sex
Column:
-
zygosity %in% c("MZFF","DZFF") ~ "F"
condition checks if thezygosity
value is either "MZFF" or "DZFF". Iftrue
, "F" (Female) is assigned to the sex column. -
zygosity %in% c("MZMM","DZMM") ~ "M"
checks if the zygosity value is either "MZMM" or "DZMM". If true, "M" (Male) is assigned to the sex column. -
TRUE ~ "OS"
is a catch-all condition that assigns "OS" (Other or Unknown sex) if none of the above conditions are met.
- Creating
zyg
Column:
- Similar to the sex column,
zyg = case_when(...)
overwrites an old columnzyg
that categorizes zygosity into broader categories.
- Conditions for
zyg
Column:
-
zygosity %in% c("MZFF","MZMM") ~ "MZ"
checks if zygosity is either "MZFF" or "MZMM" and assigns "MZ" (Monozygotic) indicating identical twins. -
zygosity %in% c("DZFF","DZMM","DZOS") ~ "DZ"
checks if zygosity falls into any of "DZFF", "DZMM", or "DZOS", assigning "DZ" (Dizygotic) indicating fraternal twins. -
TRUE ~ NA_character_
assigns a missing value (NA
) if none of the above conditions are met, possibly used to handle cases where zygosity data is not clearly defined or is missing.
This code effectively categorizes individuals in the dataset based on zygosity into more workable groups for sex and twin status, which can (and will) be useful for further analysis and visualization.
Let's calculate summary statistics for numeric variables across the full sample. This will provide a quick overview of central tendencies and variability in the dataset. When working with wide form data, it is often helpful to start with summarizing by the data structure you already have. In this case, we will calculate summary statistics by specific twin (twin 1 or twin 2).
Calculate summary statistics for numeric variables across the full sample. This helps provide a quick overview of central tendencies and variability.
# Calculate summary statistics for numeric variables across the full sample
summary_stats <- df_wide %>%
summarise(across(where(is.numeric), list(
mean = ~mean(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE),
median = ~median(., na.rm = TRUE),
min = ~min(., na.rm = TRUE),
max = ~max(., na.rm = TRUE),
IQR = ~IQR(., na.rm = TRUE)
), .names = "{col}_{fn}")) %>%
pivot_longer(
cols = everything(),
names_to = c("variable", "statistic"),
names_sep = "_"
) %>%
pivot_wider(
names_from = statistic,
values_from = value
)
summary_stats
## # A tibble: 13 × 7
## variable mean sd median min max IQR
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 fam 1904. 1099. 1904. 1 3808 1904.
## 2 age 34.5 14.2 30 17 88 19
## 3 part 1.93 0.265 2 0 2 0
## 4 wt1 62.2 10.9 60 34 111 15
## 5 wt2 65.6 12.2 65 35 127 16
## 6 ht1 1.66 0.0896 1.65 1.34 1.98 0.130
## 7 ht2 1.69 0.0990 1.7 1.40 1.99 0.140
## 8 htwt1 22.5 3.16 21.9 13.3 41.9 3.84
## 9 htwt2 22.8 3.19 22.3 14.2 46.2 3.78
## 10 bmi1 21.7 0.940 21.6 18.1 26.1 1.21
## 11 bmi2 21.8 0.939 21.7 18.6 26.8 1.18
## 12 age1 34.5 14.2 30 17 88 19
## 13 age2 34.5 14.2 30 17 88 19
As you can see in the table above, the summary statistics provide a quick overview of the central tendencies and variability in the dataset for numeric variables. The table includes the mean, standard deviation, median, minimum, maximum, and interquartile range (IQR) for each numeric variable. But they do so by the twin number, which is useful for our purposes, but... isn't the only descriptive information we'll need. We will need to calculate these statistics for the full sample, not by twin number. But we'll still going to start with the twin number, and then we'll calculate the summary statistics for the full sample.
Create frequency tables for categorical variables like zygosity and sex, providing a clear picture of the distribution of these categories within the dataset. These numbers are easier to calculation when data are wide like this because we're actually interested in the dyad, not the individual.
# Counting 'zygosity' and calculating percentages
zygosity_summary <- df_wide %>%
count(zyg, name = "count") %>%
mutate(percentage = count / sum(count) * 100) %>%
rename(category = zyg) %>% # Renaming the column for clarity
mutate(variable = "zygosity") # Adding a descriptor column for the variable
# Counting 'sex' and calculating percentages
sex_summary <- df_wide %>%
count(sex, name = "count") %>%
mutate(percentage = count / sum(count) * 100) %>%
rename(category = sex) %>% # Renaming the column for clarity
mutate(variable = "sex") # Adding a descriptor column for the variable
# Combining both summaries into a single dataframe
combined_summary <- bind_rows(zygosity_summary, sex_summary) %>%
select(variable, category, everything()) # Reordering columns for clarity
combined_summary
## variable category count percentage
## 1 zygosity DZ 2009 52.75735
## 2 zygosity MZ 1799 47.24265
## 3 sex F 1983 52.07458
## 4 sex M 919 24.13340
## 5 sex OS 906 23.79202
As you can see in the table above, the frequency tables provide a clear picture of the distribution of categories within the dataset for the zygosity and sex. Interestingly, there are many more same sex female twins, which make up 1983/3808 (52%) of the dataset.
What if you want to examine these wideform by cohort? You can use the group_by
function to group the data by cohort and then calculate the summary statistics for each cohort.
library(tidyverse)
# Grouping by 'cohort' and calculating summary statistics for each group across values that are numeric
df_summary <- df_wide %>%
group_by(cohort) %>%
summarise(across(where(is.numeric), list(
mean = ~mean(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE),
median = ~median(., na.rm = TRUE),
IQR = ~IQR(., na.rm = TRUE)
), .names = "{.col}_{.fn}")) %>%
# Ensure only statistic columns are being pivoted
pivot_longer(
cols = -cohort, # Exclude 'cohort' from pivoting
names_to = c("variable", "statistic"),
names_sep = "_"
)
# lock in the variable order
variable_order <- unique(df_summary$variable)
df_summary <- df_summary %>%
pivot_wider(
names_from = statistic,
values_from = value,
values_fn = list(value = mean) # Aggregate if necessary
)
df_summary %>% mutate(variable = factor(variable, levels = variable_order)) %>%
arrange(variable, cohort)
## # A tibble: 26 × 6
## cohort variable mean sd median IQR
## <chr> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 older fam 2859 548. 2859 949
## 2 younger fam 955 551. 955 954
## 3 older age 45.4 12.2 42 19
## 4 younger age 23.5 3.66 23 7
## 5 older part 1.95 0.224 2 0
## 6 younger part 1.92 0.299 2 0
## 7 older wt1 63.8 11.2 62 15
## 8 younger wt1 60.5 10.3 59 14
## 9 older wt2 67.0 12.7 66 19
## 10 younger wt2 64.1 11.6 64 17
## # ℹ 16 more rows
As you can see from the table, the summary statistics are calculated for each cohort across the numeric variables. This provides a quick overview of the central tendencies and variability in the dataset for each cohort. Now, these data are still in wide form, but we can easily convert them to long form if needed.
What about descriptive statistics by zygosity and sex?
library(tidyverse)
# Grouping by 'zyg,sex' and calculating summary statistics for each group
df_summary <- df_wide %>%
group_by(zyg,sex) %>%
summarise(across(where(is.numeric), list(
mean = ~mean(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE),
median = ~median(., na.rm = TRUE),
IQR = ~IQR(., na.rm = TRUE)
), .names = "{.col}_{.fn}")) %>%
# Ensure only statistic columns are being pivoted
pivot_longer(
cols = -c(zyg,sex), # Exclude 'zygosity' from pivoting
names_to = c("variable", "statistic"),
names_sep = "_"
)
## `summarise()` has grouped output by 'zyg'. You can override using the `.groups`
## argument.
# lock in the variable order
variable_order <- unique(df_summary$variable)
df_summary <- df_summary %>%
pivot_wider(
names_from = statistic,
values_from = value,
values_fn = list(value = mean) # Aggregate if necessary
)
df_summary %>% mutate(variable = factor(variable, levels = variable_order)) %>%
arrange(variable,zyg,sex)
## # A tibble: 65 × 7
## # Groups: zyg [2]
## zyg sex variable mean sd median IQR
## <chr> <chr> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 DZ F fam 2109. 1029. 2891 2048
## 2 DZ M fam 2144. 1009. 1370. 2042.
## 3 DZ OS fam 2509. 980. 1852. 1956.
## 4 MZ F fam 1338. 992. 1956. 1956.
## 5 MZ M fam 1750. 1010. 2583 2013
## 6 DZ F age 35.4 14.3 32 20
## 7 DZ M age 32.3 13.9 28 16
## 8 DZ OS age 32.9 13.9 28 17
## 9 MZ F age 35.7 14.3 32 21
## 10 MZ M age 34.4 14.1 31 18
## # ℹ 55 more rows
As you can see from the table, the summary statistics are calculated for each zygosity and sex across the numeric variables.
Histograms are a great way to visualize the distribution of a single variable. Here, we will create histograms for the weight of twin 1 and twin 2.
ggplot(df_wide, aes(x = wt1)) +
geom_histogram(bins = 30, fill = "blue", color = "black") +
labs(x = "Weight of Twin 1", y = "Frequency", title = "Distribution of Weight for Twin 1") +
theme_minimal()
## Warning: Removed 97 rows containing non-finite outside the scale range
## (`stat_bin()`).
ggplot(df_wide, aes(x = wt2)) +
geom_histogram(bins = 30, fill = "red", color = "black") +
labs(x = "Weight of Twin 2", y = "Frequency", title = "Distribution of Weight for Twin 2") +
theme_minimal() + facet_wrap(~zyg)
## Warning: Removed 86 rows containing non-finite outside the scale range
## (`stat_bin()`).
Density Plot of Weight for Twin 1
ggplot(df_wide, aes(x = wt1, fill = zyg)) +
geom_density(alpha = 0.5) +
labs(x = "Weight of Twin 1", y = "Density", title = "Density Plot of Weight for Twin 1 by Zygosity") +
scale_fill_viridis_d(option = "viridis", begin = 0.1, end = 0.85) +
theme_minimal()
## Warning: Removed 97 rows containing non-finite outside the scale range
## (`stat_density()`).
Density Plot of Weight for Twin 2
ggplot(df_wide, aes(x = wt2, fill = zyg)) +
geom_density(alpha = 0.5) +
labs(x = "Weight of Twin 2", y = "Density", title = "Density Plot of Weight for Twin 2 by Zygosity") +
scale_fill_viridis_d(option = "viridis", begin = 0.1, end = 0.85) +
theme_minimal() + facet_wrap(~sex)
## Warning: Removed 86 rows containing non-finite outside the scale range
## (`stat_density()`).
Box Plot of Weight for Twin 1
ggplot(df_wide, aes(x = zyg, y = wt1, fill = zyg)) +
geom_boxplot() +
labs(x = "Zygosity", y = "Weight of Twin 1", title = "Box Plot of Weight for Twin 1 by Zygosity") +
scale_fill_viridis_d(option = "viridis", begin = 0.1, end = 0.85) +
theme_minimal()
## Warning: Removed 97 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Box Plot of Weight for Twin 2
ggplot(df_wide, aes(x = zyg, y = wt2, fill = zyg)) +
geom_boxplot() +
labs(x = "Zygosity", y = "Weight of Twin 2", title = "Box Plot of Weight for Twin 2 by Zygosity") +
scale_fill_viridis_d(option = "viridis", begin = 0.1, end = 0.85) +
theme_minimal()
## Warning: Removed 86 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Paired Boxplot of Weights by Zygosity
ggplot(df_wide, aes(x = zyg, y = wt1, fill = zyg)) +
geom_boxplot() +
geom_boxplot(aes(y = wt2), color = "red", fill = NA) +
labs(x = "Zygosity", y = "Weight", title = "Paired Boxplot of Weights by Zygosity") +
theme_minimal()
## Warning: Removed 97 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 86 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
# Basic Scatter Plot of weight of Twin 1 vs. weight of Twin 2
p <- ggplot(df_wide, aes(x=wt1, y=wt2, color=zyg)) +
geom_point(alpha=.5) +
labs(x = "Weight of Twin 1",
y = "Weight of Twin 2",
title = "Scatterplot of weight by Zygosity") +
scale_color_viridis_d(option = "virdis",
begin = 0.1,end=.85) +
theme_minimal()
p
## Warning in viridisLite::viridis(n, alpha, begin, end, direction, option):
## Option 'virdis' does not exist. Defaulting to 'viridis'.
## Warning: Removed 173 rows containing missing values or values outside the scale range
## (`geom_point()`).
Adding a regression line to the scatter plot.
p + geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 173 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning in viridisLite::viridis(n, alpha, begin, end, direction, option):
## Option 'virdis' does not exist. Defaulting to 'viridis'.
## Warning: Removed 173 rows containing missing values or values outside the scale range
## (`geom_point()`).
library(ggplot2)
library(ggExtra)
# Create marginal density plots for x and y axes
p_x <- ggplot(df_wide, aes(x = wt1, fill = zyg)) +
geom_density(alpha = 0.5) +
theme_minimal() +
scale_fill_viridis_d(option = "viridis", begin = 0.1, end = 0.85) +
theme(axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank())
p_x
## Warning: Removed 97 rows containing non-finite outside the scale range
## (`stat_density()`).
p_y <- ggplot(df_wide, aes(x = wt2, fill = zyg)) +
geom_density(alpha = 0.5) +
scale_fill_viridis_d(option = "viridis", begin = 0.1, end = 0.85) +
coord_flip() +
theme_minimal() +
theme(axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
p_y
## Warning: Removed 86 rows containing non-finite outside the scale range
## (`stat_density()`).
# with marginal histogram
p1 <- ggMarginal(p, type="histogram")
## Warning in viridisLite::viridis(n, alpha, begin, end, direction, option):
## Option 'virdis' does not exist. Defaulting to 'viridis'.
## Warning in viridisLite::viridis(n, alpha, begin, end, direction, option):
## Option 'virdis' does not exist. Defaulting to 'viridis'.
## Warning: Removed 173 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning in viridisLite::viridis(n, alpha, begin, end, direction, option):
## Option 'virdis' does not exist. Defaulting to 'viridis'.
## Warning: Removed 173 rows containing missing values or values outside the scale range
## (`geom_point()`).
p1
# marginal density
p2 <- ggMarginal(p, type="density")
## Warning in viridisLite::viridis(n, alpha, begin, end, direction, option):
## Option 'virdis' does not exist. Defaulting to 'viridis'.
## Warning in viridisLite::viridis(n, alpha, begin, end, direction, option):
## Option 'virdis' does not exist. Defaulting to 'viridis'.
## Warning: Removed 173 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning in viridisLite::viridis(n, alpha, begin, end, direction, option):
## Option 'virdis' does not exist. Defaulting to 'viridis'.
## Warning: Removed 173 rows containing missing values or values outside the scale range
## (`geom_point()`).
p2
# marginal boxplot
p3 <- ggMarginal(p, type="boxplot")
## Warning in viridisLite::viridis(n, alpha, begin, end, direction, option):
## Option 'virdis' does not exist. Defaulting to 'viridis'.
## Warning in viridisLite::viridis(n, alpha, begin, end, direction, option):
## Option 'virdis' does not exist. Defaulting to 'viridis'.
## Warning: Removed 173 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning in viridisLite::viridis(n, alpha, begin, end, direction, option):
## Option 'virdis' does not exist. Defaulting to 'viridis'.
## Warning: Removed 173 rows containing missing values or values outside the scale range
## (`geom_point()`).
p3
library(ggcorrplot)
corr <- round(cor(df_wide %>% select(where(is.numeric))), 2)
ggcorrplot(corr, type = "lower", lab = TRUE, lab_size = 3, method = "circle",
colors = c("tomato2", "white", "springgreen3"), title = "Correlation Matrix of Twin Data", ggtheme = theme_bw)
In contrast to wide form data, long form data, also known as "tidy data," structures datasets where each row represents a single observation, and each column represents a variable. This format is highly beneficial for statistical modeling and data analysis because it simplifies the application of various data manipulation and analysis functions. As before, we'll use the twinData
dataset from the OpenMx
package, but we'll convert it to long form to illustrate handling and analyzing data in this format.
library(tidyverse)
library(NlsyLinks)
library(discord)
library(BGmisc)
library(OpenMx)
library(conflicted) # to handle conflicts
conflicted::conflicts_prefer(OpenMx::vech,dplyr::filter) # Resolve conflicts
data(twinData)
df_long <- twinData %>% select(-age)
# Convert wide data to long form
df_long <- df_long %>%
pivot_longer(
cols = matches('1$|2$'), # Select columns ending in '1' or '2'
cols_vary = "slowest", # Specify that the columns are in the same order for each twin
names_to = c(".value", "twin"), # Split the column names into variable and twin number
names_pattern = "(.*)(1|2)" # Capture the variable and twin number
)
Let's take a look at the structure of the dataset now that it's in long form.
class(df_long)
## [1] "tbl_df" "tbl" "data.frame"
glimpse(df_long)
## Rows: 7,616
## Columns: 11
## $ fam <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
## $ zyg <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ part <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ cohort <chr> "younger", "younger", "younger", "younger", "younger", "young…
## $ zygosity <fct> MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, M…
## $ twin <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "…
## $ wt <int> 58, 54, 55, 66, 50, 60, 65, 40, 60, 76, 48, 70, 51, 53, 58, 4…
## $ ht <dbl> 1.7000, 1.6299, 1.6499, 1.5698, 1.6099, 1.5999, 1.7500, 1.559…
## $ htwt <dbl> 20.0692, 20.3244, 20.2020, 26.7759, 19.2894, 23.4375, 21.2245…
## $ bmi <dbl> 20.9943, 21.0828, 21.0405, 23.0125, 20.7169, 22.0804, 21.3861…
## $ age <int> 21, 24, 21, 21, 19, 26, 23, 29, 24, 28, 29, 19, 23, 22, 23, 2…
The dataset now contains a much larger number of observations, as each twin's data is represented as a separate row. The zygosity variable indicates the zygosity of the twins, while other variables are split into variable and value pairs, reflecting measurements or characteristics of the twins.
To facilitate analyses that depend on sex or zygosity type, we'll add these as new columns derived from zygosity.
# Add 'sex' and 'zyg' columns based on 'zygosity'
df_long <- df_long %>%
mutate(sex = case_when(
zygosity %in% c("MZFF", "DZFF") ~ "F",
zygosity %in% c("MZMM", "DZMM") ~ "M",
TRUE ~ "OS"
),
zyg = case_when(
zygosity %in% c("MZFF", "MZMM") ~ "MZ",
zygosity %in% c("DZFF", "DZMM", "DZOS") ~ "DZ",
TRUE ~ NA_character_
))
Unfortunately, the data do not contain the information for the gender for each twin, so we will just have to settle for noting that the data is missing.
Once again, let's calculate summary statistics for numeric variables across the full sample. This will provide a quick overview of central tendencies and variability in the dataset. When working with long form data, it is often helpful to start with summarizing by the data structure you already have. In this case, we will calculate summary statistics by specific measurement across all twins.
Data Structure and Summary Statistics Examine the transformed data and calculate summary statistics similar to those performed in the wide form.
summary_stats_long <- df_long %>%
summarise(across(where(is.numeric), list(
mean = ~mean(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE),
median = ~median(., na.rm = TRUE),
min = ~min(., na.rm = TRUE),
max = ~max(., na.rm = TRUE),
IQR = ~IQR(., na.rm = TRUE)
), .names = "{col}_{fn}")) %>%
pivot_longer(
cols = everything(),
names_to = c("variable", "statistic"),
names_sep = "_"
) %>%
pivot_wider(
names_from = statistic,
values_from = value
)
summary_stats_long %>%
print(n = 15) # to see more rows
## # A tibble: 7 × 7
## variable mean sd median min max IQR
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 fam 1904. 1099. 1904. 1 3808 1904.
## 2 part 1.93 0.265 2 0 2 0
## 3 wt 63.9 11.7 62 34 127 16
## 4 ht 1.68 0.0958 1.68 1.34 1.99 0.150
## 5 htwt 22.6 3.18 22.2 13.3 46.2 3.83
## 6 bmi 21.8 0.941 21.7 18.1 26.8 1.20
## 7 age 34.5 14.2 30 17 88 19
Generate frequency tables for sex and zyg, paralleling the wide form analysis.
Paralleling the wide form analysis, let us create frequency tables for categorical variables like zygosity and sex. These tables should provide a clear picture of the distribution of these categories within the dataset. In some ways, these calculations are simpler in long form data because each row is already an individual observation.
# Counting 'zygosity' and calculating percentages
zygosity_summary_long <- df_long %>%
count(zyg, name = "count") %>%
mutate(percentage = count / sum(count) * 100) %>%
rename(category = zyg) %>% # Renaming the column for clarity
mutate(variable = "zygosity") # Adding a descriptor column for the variable
# Counting 'sex' and calculating percentages
sex_summary_long <- df_long %>%
count(sex, name = "count") %>%
mutate(percentage = count / sum(count) * 100) %>%
rename(category = sex) %>% # Renaming the column for clarity
mutate(variable = "sex") # Adding a descriptor column for the variable
# Combining both summaries into a single dataframe
combined_summary_long <- bind_rows(zygosity_summary_long, sex_summary_long) %>%
select(variable, category, everything()) # Reordering columns for clarity
combined_summary_long
## # A tibble: 5 × 4
## variable category count percentage
## <chr> <chr> <int> <dbl>
## 1 zygosity DZ 4018 52.8
## 2 zygosity MZ 3598 47.2
## 3 sex F 3966 52.1
## 4 sex M 1838 24.1
## 5 sex OS 1812 23.8
As you can see, the long form data structure allows for a straightforward calculation of frequency tables for categorical variables. The resulting tables provide a clear picture of the distribution, and it does not differ from the wide form analysis, as long as one remembers to that we're now looking at individual twins rather than pairs.
Throughout this document, we have embarked on an exploration of the evolving landscape of data visualization in the field of behavior genetics. By revisiting historical methodologies and embracing modern techniques, we have uncovered a rich tapestry of visual tools that enhance our understanding and communication of genetic data. The atlas of ggplot2
visualizations presented here not only serves as a practical guide but also as an inspiration for researchers to think creatively about how data can be visualized more effectively.