Skip to content

Commit

Permalink
small edits after first review of Arran's edits
Browse files Browse the repository at this point in the history
  • Loading branch information
nsbatra committed Oct 14, 2024
1 parent ddcd5e9 commit 3eb5e8e
Show file tree
Hide file tree
Showing 6 changed files with 128 additions and 77 deletions.
8 changes: 4 additions & 4 deletions new_pages/basics.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -471,7 +471,7 @@ On installation, R contains **"base"** packages and functions that perform commo

*Functions* are contained within **packages** which can be downloaded ("installed") to your computer from the internet. Once a package is downloaded, it is stored in your "library". You can then access the functions it contains during your current R session by "loading" the package.

*Think of R as your personal library*: When you download a package, your library gains a new book of functions, but each time you want to use a function in that book, you must borrow,"load", that book from your library.
*Think of R as your personal library*: When you download a package, your library gains a new book of functions, but each time you want to use a function in that book, you must borrow, "load", that book from your library.

In summary: to use the functions available in an R package, 2 steps must be implemented:

Expand Down Expand Up @@ -528,7 +528,7 @@ library(rio)
library(here)
```

To check whether a package in installed or loaded, you can view the Packages pane in RStudio. If the package is installed, it is shown there with version number. If its box is checked, it is loaded for the current session.
To check whether a package is installed or loaded, you can view the Packages pane in RStudio. If the package is installed, it is shown there with version number. If its box is checked, it is loaded for the current session.

**Install from Github**

Expand Down Expand Up @@ -1015,7 +1015,7 @@ class(linelist$gender) # class should be character
Sometimes, a column will be converted to a different class automatically by R. Watch out for this! For example, if you have a vector or column of numbers, but a character value is inserted... the entire column will change to class character.

```{r}
num_vector <- c(1, 2 , 3, 4, 5) # define vector as all numbers
num_vector <- c(1, 2, 3, 4, 5) # define vector as all numbers
class(num_vector) # vector is numeric class
num_vector[3] <- "three" # convert the third element to a character
class(num_vector) # vector is now character class
Expand Down Expand Up @@ -1233,7 +1233,7 @@ You can think of it as saying "and then". Many functions can be linked together
Pipe operators were first introduced through the [magrittr package](https://magrittr.tidyverse.org/), which is part of tidyverse, and were specified as `%>%`. In R 4.1.0, they introduced a **base** R pipe which is specified through `|>`. The behaviour of the two pipes is the same, and they can be used somewhat interchangeably. However, there are a few key differences.

* The `%>%` pipe allows you to pass multiple arguments.
* The `%>%` pipe lets you drop parenthesis when calling a function with no other arguments (i.e. drop vs drop()).
* The `%>%` pipe lets you drop parentheses when calling a function with no other arguments (i.e. `drop` vs `drop()`).
* The `%>%` pipe allows you to start a pipe with `.` to create a function in your linking of code.

For these reasons, we recommend the **magrittr** pipe, `%>%`, over the **base** R pipe, `|>`.
Expand Down
146 changes: 90 additions & 56 deletions new_pages/importing.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -5,22 +5,21 @@
knitr::include_graphics(here::here("images", "Import_Export_1500x500.png"))
```



In this page we describe ways to locate, import, and export files:

* Use of the **rio** package to flexibly `import()` and `export()` many types of files.
* Use of the **here** package to locate files relative to an R project root. To prevent complications from file paths that are specific to one computer.
* Use of the **rio** package to flexibly `import()` and `export()` many types of files.
* Use of R projects to store your data files, and locate them easily from any computer by using *relative file paths*
* Use of the **here** package to create the file paths
* Specific import scenarios, such as:
* Specific Excel sheets.
* Messy headers and skipping rows.
* From Google sheets.
* From data posted to websites.
* With APIs.
* Importing the *most recent* file.
* Manual data entry.
* R-specific file types such as RDS and RData.
* Exporting/saving files and plots.
* Specific Excel sheets
* Messy headers and skipping rows
* From Google sheets
* From data posted to websites
* With APIs
* Importing the *most recent* file
* Manual data entry
* R-specific file types such as RDS and RData
* Exporting/saving files and plots


```{r, include=FALSE}
Expand All @@ -30,10 +29,12 @@ pacman::p_load(
tidyverse) # data management, summary, and visualization
```



<!-- ======================================================= -->
## Overview

When you import a "dataset" into R, you are generally creating a new *data frame* object in your R environment and defining it as an imported file (e.g. Excel, CSV, TSV, RDS) that is located in your folder directories at a certain file path/address.
When you import a "dataset" into R, you are generally creating a new *data frame* object in your R environment and defining it as an imported file (e.g. Excel, CSV, TSV, RDS) which is located in your folder directories at a certain file path/address.

You can import/export many types of files, including those created by other statistical programs (SAS, STATA, SPSS). You can also connect to relational databases.

Expand All @@ -43,13 +44,22 @@ R even has its own data formats:
* An RData file (.Rdata) can be used to store multiple objects, or even a complete R workspace. Read more in [this section](#import_rdata).




<!-- ======================================================= -->
## The **rio** package {}

The R package we recommend is: **rio**. The name "rio" is an abbreviation of "R I/O" (input/output).

Its functions `import()` and `export()` can handle many different file types (e.g. .xlsx, .csv, .rds, .tsv). When you provide a file path to either of these functions (including the file extension like ".csv"), **rio** will read the extension and use the correct tool to import or export the file.

If used within an R project, an importing command can be as simple as:

```{r, eval=FALSE, warning=F, message=F}
linelist <- import("linelist_raw.xlsx")
```


The alternative to using **rio** is to use functions from many other packages, each of which is specific to a type of file. For example, `read.csv()` (**base** R), `read.xlsx()` (**openxlsx** package), and `write_csv()` (**readr** pacakge), etc. These alternatives can be difficult to remember, whereas using `import()` and `export()` from **rio** is easy.

**rio**'s functions `import()` and `export()` use the appropriate package and function for a given file, based on its file extension. See the end of this page for a complete table of which packages/functions **rio** uses in the background. It can also be used to import STATA, SAS, and SPSS files, among dozens of other file types.
Expand All @@ -60,82 +70,101 @@ Import/export of shapefiles requires other packages, as detailed in the page on



## The **here** package {#here}
<!-- ======================================================= -->
## File paths

When importing or exporting data, you must provide a file path. You can do this one of three ways:
1) Provide the "full" / "absolute" file path (*not recommended*)
2) Provide a "relative" file path (*recommended*)
3) Manual file selection

The package **here** and its function `here()` make it easy to tell R where to find and to save your files - in essence, it builds file paths.

Used in conjunction with an R project, **here** allows you to describe the location of files in your R project in relation to the R project's *root directory* (the top-level folder). This is useful when the R project may be shared or accessed by multiple people/computers. It prevents complications due to the unique file paths on different computers (e.g. `"C:/Users/Laura/Documents..."` by "starting" the file path in a place common to all users (the R project root).

This is how `here()` works within an R project:
### "Absolute" file paths {.unnumbered}

* When the **here** package is first loaded within the R project, it places a small file called ".here" in the root folder of your R project as a "benchmark" or "anchor".
* In your scripts, to reference a file in the R project's sub-folders, you use the function `here()` to build the file path *in relation to that anchor*.
* To build the file path, write the names of folders beyond the root, within quotes, separated by commas, finally ending with the file name and file extension as shown below.
* `here()` file paths can be used for both importing and exporting.
Absolute or "full" file paths can be provided to functions like `import()` but they are "fragile" as they are unique to the user's specific computer and therefore *not recommended*.

For example, below, the function `import()` is being provided a file path constructed with `here()`.
Below is an example of a command using an absolute file path. In Laura's computer there is a folder called "analysis", a sub-folder "data", and within that a sub-folder "linelists", in which there is the .xlsx file of interest.

```{r, eval=F}
linelist <- import(here("data", "linelists", "ebola_linelist.xlsx"))
linelist <- import("C:/Users/Laura/Documents/analysis/data/linelists/linelist_raw.xlsx")
```

The command `here("data", "linelists", "ebola_linelist.xlsx")` is actually providing the full file path that is *unique to the user's computer*:
A few things to note about absolute file paths:

```
"C:/Users/Laura/Documents/my_R_project/data/linelists/ebola_linelist.xlsx"
```
* **Avoid using absolute file paths** as they will not work if the script is run on a different computer.
* Provide the file path within quotation marks. It is a "string" (character) value.
* Use *forward* slashes (`/`), as in the example above (note: this is *NOT* the default for Windows file paths).
* File paths that begin with double slashes (e.g. "//...") will likely **not be recognized by R** and will produce an error. Consider moving your work to a "named" or "lettered" drive that begins with a letter (e.g. "J:" or "C:"). See the page on [Directory interactions](directories.qmd) for more details on this issue.

The beauty is that the R command using `here()` can be successfully run on any computer accessing the R project.
One scenario where absolute file paths may be appropriate is when you want to import a file from a shared drive that has the same full file path for all users.

<span style="color: darkgreen;">**_TIP:_** To quickly convert all `\` to `/`, highlight the code of interest, use Ctrl+f (in Windows), check the option box for "In selection", and then use the replace functionality to convert them.</span>

<span style="color: darkgreen;">**_TIP:_** If you are unsure where the “.here” root is set to, run the function `here()` with empty parentheses.</span>

Read more about the **here** package [at this link](https://here.r-lib.org/).

### R Projects and "relative" file paths {.unnumbered}

In R, "relative" file paths consist of the file path *relative to* the root of an **R project**. This allows for more simple commands which can be run from different computers (e.g. if the R project is on a shared drive or is sent by email).

<!-- ======================================================= -->
## File paths
Let us assume that our work is in an R project that contains a sub-folder "data" and within that a subfolder "linelists", in which there is the .xlsx file of interest.

When importing or exporting data, you must provide a file path. You can do this one of three ways:
The "absolute" file path could be:

1) *Recommended:* provide a "relative" file path with the **here** package.
2) Provide the "full" / "absolute" file path.
3) Manual file selection.
```
"C:/Users/Laura/Documents/my_R_project/data/linelists/linelist_raw.xlsx"
```

By working within an R project, we can import the data by simply writing this command:

```{r, eval=FALSE, warning=F, message=F}
linelist <- import("data/linelists/linelist_raw.xlsx")
```

### "Relative" file paths {.unnumbered}
Because we are using an R project, R knows to begin its search for the file in the project's folder. Then the command tells it to look in the "data" folder, and then the "linelists" folder, and to find the dataset.

In R, "relative" file paths consist of the file path *relative to* the root of an R project. They allow for more simple file paths that can work on different computers (e.g. if the R project is on a shared drive or is sent by email). As described [above](#here), relative file paths are facilitated by use of the **here** package.

An example of a relative file path constructed with `here()` is below. We assume the work is in an R project that contains a sub-folder "data" and within that a subfolder "linelists", in which there is the .xlsx file of interest.

```{r, eval=F}
linelist <- import(here("data", "linelists", "ebola_linelist.xlsx"))
```
### The **here** package {#here}

The importing command can be improved by building the relative file path via the package **here** and its function `here()`.

The file path to the .xlsx file can be created with the below command. Note how each folder *after the R project*, and the file name itself, is listed within quotation marks and separated by commas.

### "Absolute" file paths {.unnumbered}
```{r, eval=F, message = F, warning = F}
here("data", "linelists", "linelist_raw.xlsx")
```

Absolute or "full" file paths can be provided to functions like `import()` but they are "fragile" as they are unique to the user's specific computer and therefore *not recommended*.
Because this command is run within an R Project, it will **return** a full, absolute file path that is *adapted to the user's computer*, such as below.

```
"C:/Users/Laura/Documents/my_R_project/data/linelists/linelist_raw.xlsx"
```

Below is an example of an absolute file path, where in Laura's computer there is a folder "analysis", a sub-folder "data" and within that a sub-folder "linelists", in which there is the .xlsx file of interest.
The final step is to *nest the `here()` command within the `import()` function*, like this:

```{r, eval=F}
linelist <- import("C:/Users/Laura/Documents/analysis/data/linelists/ebola_linelist.xlsx")
linelist <- import(here("data", "linelists", "linelist_raw.xlsx"))
```

A few things to note about absolute file paths:
There are several benefits to using `here()` within `import()`:

1) `here()` becomes very important when creating *automated reports* with R Markdown or Quarto
2) `here()` free you from worrying about the slash direction (see example above)



<span style="color: darkgreen;">**_TIP:_** If you are unsure where the "here" root is set to, run the function `here()` with empty parentheses and then begin building the command.</span>

Read more about the **here** package [at this link](https://here.r-lib.org/).






* **Avoid using absolute file paths** as they will not work if the script is run on a different computer.
* Use *forward* slashes (`/`), as in the example above (note: this is *NOT* the default for Windows file paths).
* File paths that begin with double slashes (e.g. "//...") will likely **not be recognized by R** and will produce an error. Consider moving your work to a "named" or "lettered" drive that begins with a letter (e.g. "J:" or "C:"). See the page on [Directory interactions](directories.qmd) for more details on this issue.

One scenario where absolute file paths may be appropriate is when you want to import a file from a shared drive that has the same full file path for all users.

<span style="color: darkgreen;">**_TIP:_** To quickly convert all `\` to `/`, highlight the code of interest, use Ctrl+f (in Windows), check the option box for "In selection", and then use the replace functionality to convert them.</span>



Expand All @@ -159,6 +188,11 @@ my_data <- import(file.choose())








## Import data

To use `import()` to import a dataset is quite simple. Simply provide the path to the file (including the file name and file extension) in quotes. If using `here()` to build the file path, follow the instructions above. Below are a few examples:
Expand Down Expand Up @@ -195,7 +229,7 @@ By default, if you provide an Excel workbook (.xlsx) to `import()`, the workbook
my_data <- import("my_excel_file.xlsx", which = "Sheetname")
```

If using the `here()` method to provide a relative pathway to `import()`, you can still indicate a specific sheet by adding the `which = ` argument after the closing parentheses of the `here()` function.
If using the `here()` method to create the file path, you can still indicate a specific sheet by adding the `which = ` argument after the closing parentheses of the `here()` function.

```{r import_sheet_here, eval=F}
# Demonstration: importing a specific Excel sheet when using relative pathways with the 'here' package
Expand Down Expand Up @@ -249,7 +283,7 @@ Unfortunately `skip = ` only accepts one integer value, *not* a range (e.g. "2:1

Sometimes, your data may have a *second* row, for example if it is a "data dictionary" row as shown below. This situation can be problematic because it can result in all columns being imported as class "character".

```{r, echo=F, waring = F, message = F}
```{r, echo=F, warning = F, message = F}
# HIDDEN FROM READER
####################
# Create second header row of "data dictionary" and insert into row 2. Save as new dataframe.
Expand Down Expand Up @@ -314,7 +348,7 @@ linelist_raw <- import("linelist_raw.xlsx",
**For CSV files:** (`col.names = `)

```{r, eval=F}
# import first time; sotre column names
# import first time; save column names
linelist_raw_names <- import("linelist_raw.csv") %>%
names() # save true column names
Expand Down
Loading

0 comments on commit 3eb5e8e

Please sign in to comment.