This repository was archived by the owner on Mar 20, 2018. It is now read-only.
forked from swcarpentry/r-novice-gapminder
-
Notifications
You must be signed in to change notification settings - Fork 19
/
Copy path02-project-intro.Rmd
214 lines (174 loc) · 8.8 KB
/
02-project-intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
---
title: "Project Management With RStudio"
teaching: 20
exercises: 10
questions:
- "How can I manage my projects in R?"
objectives:
- Create self-contained projects in RStudio
keypoints:
- "Use RStudio to create and manage projects with consistent layout."
- "Treat raw data as read-only."
- "Treat generated output as disposable."
- "Separate function definition and application."
source: Rmd
---
```{r, include=FALSE}
source("../bin/chunk-options.R")
knitr_fig_path("02-")
```
## Introduction
The scientific process is naturally incremental, and many projects
start life as random notes, some code, then a manuscript, and
eventually everything is a bit mixed together.
<blockquote class="twitter-tweet"><p>Managing your projects in a reproducible fashion doesn't just make your science reproducible, it makes your life easier.</p>— Vince Buffalo (@vsbuffalo) <a href="https://twitter.com/vsbuffalo/status/323638476153167872">April 15, 2013</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
Most people tend to organize their projects like this:

There are many reasons why we should *ALWAYS* avoid this:
1. It is really hard to tell which version of your data is
the original and which is the modified;
2. It gets really messy because it mixes files with various
extensions together;
3. It probably takes you a lot of time to actually find
things, and relate the correct figures to the exact code
that has been used to generate it;
A good project layout will ultimately make your life easier:
* It will help ensure the integrity of your data;
* It makes it simpler to share your code with someone else
(a lab-mate, collaborator, or supervisor);
* It allows you to easily upload your code with your manuscript submission;
* It makes it easier to pick the project back up after a break.
## A possible solution
Fortunately, there are tools and packages which can help you manage your work effectively.
One of the most powerful and useful aspects of RStudio is its project management
functionality. We'll be using this today to create a self-contained, reproducible
project.
> ## Challenge: Creating a self-contained project
>
> We're going to create a new project in RStudio:
>
> 1. Click the "File" menu button, then "New Project".
> 2. Click "New Directory".
> 3. Click "Empty Project".
> 4. Type in the name of the directory to store your project, e.g. "my_project".
> 5. If available, select the checkbox for "Create a git repository."
> 6. Click the "Create Project" button.
{: .challenge}
Now when we start R in this project directory, or open this project with RStudio,
all of our work on this project will be entirely self-contained in this directory.
## Best practices for project organization
Although there is no "best" way to lay out a project, there are some general
principles to adhere to that will make project management easier:
### Treat data as read only
This is probably the most important goal of setting up a project. Data is
typically time consuming and/or expensive to collect. Working with them
interactively (e.g., in Excel) where they can be modified means you are never
sure of where the data came from, or how it has been modified since collection.
It is therefore a good idea to treat your data as "read-only".
### Data Cleaning
In many cases your data will be "dirty": it will need significant preprocessing
to get into a format R (or any other programming language) will find useful. This
task is sometimes called "data munging". I find it useful to store these scripts
in a separate folder, and create a second "read-only" data folder to hold the
"cleaned" data sets.
### Treat generated output as disposable
Anything generated by your scripts should be treated as disposable: it should
all be able to be regenerated from your scripts.
There are lots of different ways to manage this output. I find it useful to
have an output folder with different sub-directories for each separate
analysis. This makes it easier later, as many of my analyses are exploratory
and don't end up being used in the final project, and some of the analyses
get shared between projects.
> ## Tip: Good Enough Practices for Scientific Computing
>
> [Good Enough Practices for Scientific Computing](https://github.com/swcarpentry/good-enough-practices-in-scientific-computing/blob/gh-pages/good-enough-practices-for-scientific-computing.pdf) gives the following recommendations for project organization:
>
> 1. Put each project in its own directory, which is named after the project.
> 2. Put text documents associated with the project in the `doc` directory.
> 3. Put raw data and metadata in the `data` directory, and files generated during cleanup and analysis in a `results` directory.
> 4. Put source for the project's scripts and programs in the `src` directory, and programs brought in from elsewhere or compiled locally in the `bin` directory.
> 5. Name all files to reflect their content or function.
>
{: .callout}
> ## Tip: ProjectTemplate - a possible solution
>
> One way to automate the management of projects is to install the third-party package, `ProjectTemplate`.
> This package will set up an ideal directory structure for project management.
> This is very useful as it enables you to have your analysis pipeline/workflow organised and structured.
> Together with the default RStudio project functionality and Git you will be able to keep track of your
> work as well as be able to share your work with collaborators.
>
> 1. Install `ProjectTemplate`.
> 2. Load the library
> 3. Initialise the project:
>
> ```{r, eval=FALSE}
> install.packages("ProjectTemplate")
> library("ProjectTemplate")
> create.project("../my_project", merge.strategy = "allow.non.conflict")
> ```
>
> For more information on ProjectTemplate and its functionality visit the
> home page [ProjectTemplate](http://projecttemplate.net/index.html)
{: .callout}
### Separate function definition and application
One of the more effective ways to work with R is to start by writing the code you want to run directly in an .R script, and then running the selected lines (either using the keyboard shortcuts in RStudio or clicking the "Run" button) in the interactive R console.
When your project is in its early stages, the initial .R script file usually contains many lines
of directly executed code. As it matures, reusable chunks get pulled into their
own functions. It's a good idea to separate these functions into two separate folders; one
to store useful functions that you'll reuse across analyses and projects, and
one to store the analysis scripts.
> ## Tip: avoiding duplication
>
> You may find yourself using data or analysis scripts across several projects.
> Typically you want to avoid duplication to save space and avoid having to
> make updates to code in multiple places.
>
> In this case I find it useful to make "symbolic links", which are essentially
> shortcuts to files somewhere else on a filesystem. On Linux and OS X you can
> use the `ln -s` command, and on Windows you can either create a shortcut or
> use the `mklink` command from the windows terminal.
{: .callout}
### Save the data in the data directory
Now we have a good directory structure we will now place/save the data file in the `data/` directory.
> ## Challenge 1
> Download the gapminder data from [here](https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv).
>
> 1. Download the file (CTRL + S, right mouse click -> "Save as", or File -> "Save page as")
> 2. Make sure it's saved under the name `gapminder-FiveYearData.csv`
> 3. Save the file in the `data/` folder within your project.
>
> We will load and inspect these data later.
{: .challenge}
> ## Challenge 2
> It is useful to get some general idea about the dataset, directly from the
> command line, before loading it into R. Understanding the dataset better
> will come in handy when making decisions on how to load it in R. Use the command-line
> shell to answer the following questions:
> 1. What is the size of the file?
> 2. How many rows of data does it contain?
> 3. What kinds of values are stored in this file?
>
> > ## Solution to Challenge 2
> >
> > By running these commands in the shell:
> > ```{r ch2a-sol, engine='sh'}
> > ls -lh data/gapminder-FiveYearData.csv
> > ```
> > The file size is 80K.
> > ```{r ch2b-sol, engine='sh'}
> > wc -l data/gapminder-FiveYearData.csv
> > ```
> > There are 1705 lines. The data looks like:
> > ```{r ch2c-sol, engine='sh'}
> > head data/gapminder-FiveYearData.csv
> > ```
> {: .solution}
{: .challenge}
> ## Tip: command line in R Studio
>
> You can quickly open up a shell in RStudio using the **Tools -> Shell...** menu item.
{: .callout}
### Version Control
It is important to use version control with projects. Go [here](http://swcarpentry.github.io/git-novice/14-supplemental-rstudio/) for a good lesson which describes using Git with R Studio.