This repo contains materials related to teaching "R Programming for Behavioral Scientists" - taught by Prof Curley - UT Austin, Psychology, Fall 2018.
The materials are designed to be used during class with instruction by Prof Curley, but can be read alone also.
To access materials for class, you can fork the repo, clone/download the repo using the large green button, or you can download individual folders using the DownGit tool here: https://minhaskamal.github.io/DownGit/
Aim: The first week will introduce the R programming language. Students will become familiar with basic programming concepts such as data structures, functions, and objects. Students will learn basic R syntax and become familiar with commonly used base R functions.
During the first class students will also take a background skills and knowledge questionnaire, as well as a values questionnaire where students will describe their thoughts related to how important they value particular concepts such as reproducibility and data visualization.
Values Questionnaire: https://tinyurl.com/y8dzthga
Skills Questionnaire: https://tinyurl.com/y98ctpbh
Reading: Please read Chapter 3-6 of R. Cotton, Learning R: A Step-by-Step Function Guide to Data Analysis, O’Reilly, 2013.
Aim: Students will learn how to import and export data from RStudio. Students will learn how to summarize data and how to reorganize and work flexibly with raw data. Students will learn how to write code for visualizing and analyzing data that can be shared with other collaborators or users. Students will learn how to use RMarkdown and RNotebooks for producing standalone PDF or web-hosted data analysis and visualizations and how to use these methods for demonstrating their workflow to other users.
In class: The instructor will lead a discussion about worked examples and troubleshoot questions. After class, students will complete a data carpentry exercise to be submitted in week 3.
Reading: Chapter 5, 11 & 12 of G. Grolemund & H.Wickham, R for Data Science, O’Reilly, 2016. You can access this book online for free here: http://r4ds.had.co.nz/
Chapters 27,29,30 of G. Grolemund & H.Wickham, R for Data Science, O’Reilly, 2016. (for Markdown stuff).
Please familiarize yourself with the material. Don't worry if you get lost with the code - we will work through it together in class. I would like you to get a sense of the purpose of being able to transform data in different ways using code.
After class, students will complete a data carpentry challenge to be submitted week 3.
Aim: Students will learn how to visualize and explore data using the ggplot2 package. Students will become familiar with how to choose the most appropriate data visualization for different data types by adjust aesthetics and chart types. Students will learn Leland Wilkinson’s grammar of graphics approach to data visualization.
Before Class II (week4): Students will complete a makeover challenge assignment and submit prior to the class. This challenge will consist of a poor visualization of data that the students will be asked to improve or come up with different ways in which the data could be visually represented.
The following reading materials are for reference. Please have a look through to familiarize yourself for the class:
Overview of "grammar of graphics" principles:
Chapter 3 of G. Grolemund & H.Wickham, R for Data Science, O’Reilly, 2016. http://r4ds.had.co.nz/
Overview of good data visualization principles with reference to ggplot2:
Data Visualization for Social Science: A practical introduction with R and ggplot2 - Kieran Healy - http://socviz.co/
A free online book in the same vein:
Fundamentals of Data Visualization - Claus Wilkie - https://serialmentor.com/dataviz/index.html
A very helpful blogpost containing many tips to making ggplot2 charts look beautiful:
Zev Ross blog post - http://zevross.com/blog/2014/08/04/beautiful-plotting-in-r-a-ggplot2- cheatsheet-3/
The ggplot2 help pages are also a great guide
The most up to date would be here: http://ggplot2.tidyverse.org/index.html
Makeover challenge for Week 5: The goal of this exercise is to 'improve' a data visualization that is loose in the wild. In particular, I want you to think about data visualization principles that we have discussed in class. What are the key messages to illustrate? What is the most effective type of plot for these data? How many/much color, annotation, text, lines or shapes should I include in the visualization? Alternatively, you may decide that the original visualization did not highlight important findings and you think there is a more interesting story in the data to show.
Which dataset/visualization should I choose to makeover?
-
The best resource for this makover challenge is the community website - "#MakeoverMonday - A weekly social data project". If you go here: http://www.makeovermonday.co.uk/data/ you will find scores of raw datasets and links to articles that have included visualizations based on the dataset. Please choose one that interests you and start creating new visualizations ! It may be that just one visualization is sufficient, or you may wish to make several based on the dataset of your choice. (Note: many of these challenges were taken up by Tableau users who typically make more complex 'dashboard' or 'infographic' style visualizations. Don't be discouraged by this. We are interested largely in creating simple, elegant charts similar to those we'd use for research output).
-
Alternatively, you may wish to find your own data. You could retrieve this from anywhere e.g. newspapers, fivethirtyeight's data - https://github.com/fivethirtyeight/data , or published research in open-access journal articles.
If you need any advice on what to choose - please ask Dr Curley. In week 4, students will present their visualization and design choices in class. Each person will give a quick 5-10 minute presentation on their choices. To help you - here is a link to one makeover that I completed myself:
https://raw.githubusercontent.com/jalapic/learnR/master/exercises/curley_makeover.R
Aim: To familiarize students with how to conduct basic statistical tests in R. We shall discuss (depending upon class need) testing normality and variances of distributions, paired and unpaired T-tests, One-way Analysis of Variance, Two-way Analysis of Variance, Pearson and Spearman Correlations, Linear Regression, Logistic Regression and generalized Linear Models
Reading: A useful reference resource for statistics with R is: Field, A. Miles, J. Fields Z., 2012 Discovering Statistics Using R. Sage Publications. It’s a very big book, so I don’t expect you to read it for class – but if you need additional information, I recommend this book.
The goals of these two weeks are to:
- increase your knowledge of working with large, messy datasets in R
- improve data carpentry skills
- develop the ability to discern key information and narratives from raw data
- generate publication quality data visualizations
- produce reproducible data analysis
- work with others in a group to share skills and tasks collaboratively
Before the first class: Have a look at the datasets below and let me know which you would have a preference on working on - this will help with assigning groups.
In the first class, groups of 2-3 will be assigned. The first step will be to explore the different datasets and become familiar with data issues.
Secondly, begin to explore and generate potential research questiosns. This step usually involves producing exploratory visualizations.
You may also wish to bring in related data from other datasets to help with your analysis.
Throughout this class, you can ask me questions if you get stuck. This will provide an opportunity for me to identify which tasks/skills we need to discuss in more detail.
The second class will provide an opportunity to finalize visualizations and code. This should be reproducible such that anybody can follow your analysis. Please post the finished work as a Gist. You are encouraged to work on the project during the week too !
Possible Datasets for Group Projects
-
American National Election Studies Datasets http://www.electionstudies.org/studypages/download/datacenter_all_NoData.php also see here for an example: https://raw.githubusercontent.com/jalapic/learnR/master/datasets/makeover/anes.csv
-
Federal Election Commission Data:
https://www.fec.gov https://www.fec.gov/data/ http://classic.fec.gov/finance/disclosure/ftpdet.shtml http://classic.fec.gov/portal/download.shtml see here for an example: https://raw.githubusercontent.com/jalapic/learnR/master/datasets/makeover/CandidateSummaryAction.csv -
Kaggle has many large datasets - https://www.kaggle.com/datasets - we will view these
-
NEW: try Google's dataset search engine: https://toolbox.google.com/datasetsearch
-
One of your own choice (e.g. large research project from your lab).
Aim: An important aspect of learning to code is learning how to independently acquire new skills and to troubleshoot problems. Students can independently learn skills from blog posts, video tutorials, Q&A sites such as StackOverflow, package vignettes etc. Students will be set the goal of learning an R package previously unfamiliar to themselves and presenting some analyses and visualizations based on this package to the class. Students will also develop skills in orally communicating their newly acquired skills to other students.
Before class: Students will select the R package either from a list provided by the instructor, or with the instructor’s approval. Students are expected to complete their analysis and visualizations and upload them to GitHub prior to class. In class, each student will present their analysis and visualization to the class, describing what they learnt, how they found the information to learn the package, what problems they encountered and how they overcame them.
Aim: Students will become familiar with the critical concept of open and transparent data analysis. They will understand why their code needs to be reproducible and that this is for the benefit of themselves, their co-workers or collaborators, reviewers and audience.
Before each class: . Students will also choose a recently published academic paper or piece of research from the internet that they will attempt to reproduce the analysis for. Students can choose from a list produced by the instructor or a piece of research of their own choice with the instructor’s approval. Before the second class, students will attempt to reproduce their chosen paper’s analyses and visualizations.
In class: The instructor will lead a discussion about the students current views about reproducibility in data science and academia. Students will present whether it was possible or not to reproduce the published findings in academic research papers and the implications of this exercise.
Reading: Munafo MR et al., 2017, A manifesto for reproducible science, Nature Human Behaviour 1: 0021. doi:10.1038/s41562-016-0021 Markowetz F, 2015, Five selfish reasons to work reproducibly, Genome Biology 16:274. doi: 10.1186/s13059-015-0850-7
In this class we shall discuss the merits and benefits of version control. We will discuss how using Git and GitHub can facilitate open science research, and how we can manage our projects more effectively.
We shall follow the tutorial that I wrote in this paper:
Vuorre M & Curley JP, 2018, Curating Research Assets: A Tutorial on the Git Version Control System, Advances in Methods and Practices in Psychological Science.
It can be accessed in preprint form here - https://mvuorre.github.io/pdf/2017-Vuorre-Curley.pdf
Aim: Students will continue to develop best standard practices for data analysis and reproducible research. Students will learn how to develop simple R packages and how to submit their final package to GitHub and/or CRAN.
In the first class (Week 12), students will be assigned to pairs and will develop their idea for a package. Students will discuss their package development with other students and troubleshoot with the instructor. Before the second class (week 13), students will submit their final package to GitHub and/or CRAN. During the second class students will orally present their final packages to other students.
Hilary Parker blog - https://hilaryparker.com/2014/04/29/writing-an-r-package-from-scratch/
Karl Broman blog - http://kbroman.org/pkg_primer/
H.Wickham, R Packages, O’Reilly, 2015.
Aim: Students will produce a large final data analysis project based upon their own research or on publicly available datasets. Students are expected to produce a fully reproducible data project from raw data to final analyses and visualizations. Students will learn to communicate both orally and in writing the results of their data analyses to a target audience.
Before the class, students will develop their idea for their final data project and generate preliminary ideas about potential interesting research questions. Students will work on their projects and submit the final project to the instructor as well as publishing online at Rpubs.