Skip to content

Commit

Permalink
Merge pull request #16 from gbdias/master
Browse files Browse the repository at this point in the history
edit slides
  • Loading branch information
gbdias authored Oct 28, 2024
2 parents caf62b3 + 01774ac commit f651566
Show file tree
Hide file tree
Showing 3 changed files with 27 additions and 40 deletions.
34 changes: 10 additions & 24 deletions lab_dataframes.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ X[] <- 0
as.vector(X)
```

7. In the the earlier exercises, you created a vector with the names of the type Geno\_a\_1, Geno\_a\_2, Geno\_a\_3, Geno\_b\_1, Geno\_b\_2&#x2026;, Geno\_s\_3 using vectors. In today's lecture, a function named `outer()` that generates matrices was mentioned. Try to generate the same vector as yesterday using this function instead. The `outer()` function is very powerful, but can be hard to wrap you head around, so try to follow the logic, perhaps by creating a simple example to start with.
7. In the the earlier exercises, you created a vector with the names of the type Geno\_a\_1, Geno\_a\_2, Geno\_a\_3, Geno\_b\_1, Geno\_b\_2&#x2026;, Geno\_s\_3 using vectors. In a previous lecture, a function named `outer()` that generates matrices was mentioned. Try to generate the same vector as before, but this time using `outer()`. This function is very powerful, but can be hard to wrap you head around, so try to follow the logic, perhaps by creating a simple example to start with.

```{r}
letnum <- outer(paste("Geno",letters[1:19], sep = "_"), 1:3, paste, sep = "_")
Expand Down Expand Up @@ -180,7 +180,7 @@ E.mm

# Dataframes

Even though vectors are at the very base of R usage, data frames are central to R as the most common ways to import data into R (`read.table()`) will create a dataframe. Even though a dataframe can itself contain another dataframe, by far the most common dataframes consists of a set of equally long vectors. As dataframes can contain several different data types the command `str()` is very useful to run on dataframes.
Even though vectors are at the very base of R usage, data frames are central to R as the most common ways to import data into R (`read.table()`) will create a data frame. A data frame consists of a set of equally long vectors. As data frames can contain several different data types the command `str()` is very useful to run on data frames.

```{r}
vector1 <- 1:10
Expand All @@ -194,7 +194,7 @@ In the above example, we can see that the dataframe **dfr** contains 10 observat

## Exercise

1. Figure out what is going on with the second column in **dfr** dataframe described above and modify the creation of the dataframe so that the second column is stored as a character vector rather than a factor. Hint: Check the help for `data.frame` to find an argument that turns off the factor conversion.
1. Figure out what is going on with the second column in **dfr** data frame described above and modify the creation of the data frame so that the second column is stored as a character vector rather than a factor. Hint: Check the help for `data.frame` to find an argument that turns off the factor conversion.

```{r,accordion=TRUE}
dfr <- data.frame(vector1, vector2, vector3, stringsAsFactors = FALSE)
Expand All @@ -215,27 +215,27 @@ dfr[dfr$vector3>0,2]
dfr$vector2[dfr$vector3>0]
```

4. Create a new vector combining the all columns of **dfr** separated by a underscore.
4. Create a new vector combining all columns of **dfr** and separate them by a underscore.

```{r,accordion=TRUE}
paste(dfr$vector1, dfr$vector2, dfr$vector3, sep = "_")
```

5. There is a dataframe of car information that comes with the base installation of R. Have a look at this data by typing `mtcars`. How many rows and columns does it have?
5. There is a data frame of car information that comes with the base installation of R. Have a look at this data by typing `mtcars`. How many rows and columns does it have?

```{r,accordion=TRUE}
dim(mtcars)
ncol(mtcars)
nrow(mtcars)
```

6. Re-arrange the row names of this dataframe and save as a vector.
6. Re-arrange (shuffle) the row names of this data frame and save as a vector.

```{r,accordion=TRUE}
car.names <- sample(row.names(mtcars))
```

7. Create a dataframe containing the vector from the previous question and two vectors with random numbers named random1 and random2.
7. Create a data frame containing the vector from the previous question and two vectors with random numbers named random1 and random2.

```{r,accordion=TRUE}
random1 <- rnorm(length(car.names))
Expand All @@ -244,7 +244,7 @@ mtcars2 <- data.frame(car.names, random1, random2)
mtcars2
```

8. Now you have two dataframes that both contains information on a set of cars. A collaborator asks you to create a new dataframe with all this information combined. Create a merged dataframe ensuring that rows match correctly.
8. Now you have two data frames that both contains information on a set of cars. A collaborator asks you to create a new data frame with all this information combined. Create a merged data frame ensuring that rows match correctly.

```{r,accordion=TRUE}
mt.merged <- merge(mtcars, mtcars2, by.x = "row.names", by.y = "car.names")
Expand Down Expand Up @@ -332,7 +332,7 @@ list.2 <- list(vec1 = c("hi", "ho", "merry", "christmas"),
list.2
```

2. Here is a dataframe.
2. Here is a data frame.

```{r}
dfr <- data.frame(letters, LETTERS, letters == LETTERS)
Expand Down Expand Up @@ -369,18 +369,4 @@ lapply(list.a, FUN = "length")
```{r,accordion=TRUE}
lapply(X = list.a, FUN = "summary")
sapply(X = list.a, FUN = "summary")
```

# Extras

1. Design a S3 class that should hold information on human proteins. The data needed for each protein is:

- The gene that encodes it
- The molecular weight of the protein
- The length of the protein sequence
- Information on who and when it was discovered
- Protein assay data

Create this hypothetical S3 object in R.

2. Among the test data sets that are part of base R, there is one called **iris**. It contains measurements on set of plants. You can access the data using by typing `iris` in R. Explore this data set and calculate some useful summary statistics, like SD, mean and median for the parts of the data where this makes sense. Calculate the same statistics for any grouping that you can find in the data.
```
25 changes: 13 additions & 12 deletions lab_loadingdata.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ output:

# Introduction

Up until now we have mostly created the object we worked with on the fly from within R. The most common use-case is however to read in different data sets that are stored as files, either somewhere on a server or locally on your computer. In this exercise we will test some common ways to import data in R and also show to save data from R. After this exercise you will know how to:
Up until now we have mostly created the object we worked with on the fly from within R. The most common use-case is however to read in different data sets that are stored as files, either somewhere on a server or locally on your computer. In this exercise we will test some common ways to import data in R and also how to save data from R. After this exercise you will know how to:

- Read data from txt files and save the information as a vector, data frame or a list.
- Identify missing data and correctly encode this at import
Expand Down Expand Up @@ -66,17 +66,17 @@ shelley.vec[381]
2. Go back and fix the way you read in the text to make sure that you get a vector with all words in chapter as individual entries also filter any non-letter characters and now identify the longest word.

```{r,accordion=TRUE}
shelley.vec2 <- scan(file="https://raw.githubusercontent.com/NBISweden/workshop-r/master/data/lab_loadingdata/book_chapter.txt", what='character', sep=' ', quote=NULL)
shelley.vec2 <- scan(file="https://raw.githubusercontent.com/NBISweden/workshop-r/master/data/lab_loadingdata/book_chapter.txt", what=character(), sep=' ', quote=NULL)
shelley.filt2 <- gsub(pattern='[^[:alnum:] ]', replacement="", x=shelley.vec2)
which(nchar(shelley.filt2) == max(nchar(shelley.filt2)))
shelley.filt2[301]
longest <- which(nchar(shelley.filt2) == max(nchar(shelley.filt2)))
shelley.filt2[longest]
```

# `read.table()`

This is the by far most common way to get data into R. As the function creates a data frame at import it will only work for data set that fits those criteria, meaning that the data needs to have a set of columns of equal length that are separated with a common string eg. tab, comma, semicolon etc.

In this code block with first import the data from [normalized.txt](https://raw.githubusercontent.com/NBISweden/workshop-r/master/data/lab_loadingdata/normalized.txt) and accept the defaults for all other arguments in the function. With this settings R will read it as a tab delimited file and will use the first row of the data as colnames (header) and the first column as rownames.
In this code block we first import the data from [normalized.txt](https://raw.githubusercontent.com/NBISweden/workshop-r/master/data/lab_loadingdata/normalized.txt) and accept the defaults for all other arguments in the function. With this settings R will read it as a tab delimited file and will use the first row of the data as colnames (header) and the first column as rownames.

```{r,accordion=TRUE}
expr.At <- read.table("https://raw.githubusercontent.com/NBISweden/workshop-r/master/data/lab_loadingdata/normalized.txt")
Expand All @@ -85,7 +85,7 @@ head(expr.At)

One does however not have to have all data as a file an the local disk, instead one can read data from online resources. The following command will read in a file from a web server.

```{r,accordion=TRUE, error=T}
```{r,accordion=TRUE}
url <- 'http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'
abalone <- read.table(url, header=FALSE , sep=',')
head(abalone)
Expand All @@ -94,28 +94,28 @@ head(abalone)
1. Read this [example data](https://raw.githubusercontent.com/NBISweden/workshop-r/master/data/lab_loadingdata/example.data) to R using the `read.table()` function. This files consist of gene expression values. Once you have the object in R validate that it looks okay and export it using the `write.table` function.

```{r,accordion=TRUE}
ed <- read.table("https://raw.githubusercontent.com/NBISweden/workshop-r/master/data/lab_loadingdata/example.data", sep=":")
ed <- read.table("https://raw.githubusercontent.com/NBISweden/workshop-r/master/data/lab_loadingdata/example.data", sep=":", header = T)
head(ed)
str(ed)
```

Encode all NA values as "missing", at export.

```{r,eval=FALSE,accordion=TRUE}
write.table(x=ed, na="missing", file="example_mis.data")
write.table(x=ed, na="missing", file="example_write.txt")
```

2. Read in the file you just created and double-check that you have the same data as earlier.

```{r,eval=FALSE,accordion=TRUE}
df.test <- read.table("example_mis.data", na.strings="missing")
df.test <- read.table("example_write.txt", na.strings="missing")
```

3. Analysing genome annotation in R using read.table

For this exercise we will load a GTF file into R and calculate some basic summary statistics from the file. In the first part we will use basic manipulations of data frames to extract the information. In the second part you get a try out a library designed to work with annotation data, that stores the information in a more complex format, that allow for easy manipulation and calculation of summaries from genome annotation files.

For those not familiar with the gtf format it is a file format containing annotation information for a genome. It does not contain the actual DNA sequence of the organism, but instead refers to positions along the genome.
For those not familiar with the GTF format it is a file format containing annotation information for a genome. It does not contain the actual DNA sequence of the organism, but instead refers to positions along the genome.

A valid GTF file should contain the following tab delimited fields (taken from the ensembl home page).

Expand All @@ -136,7 +136,7 @@ A valid GTF file should contain the following tab delimited fields (taken from t

The last column can contain a large number of attributes that are semicolon-separated.

As these files for many organisms are large we will in this exercise use the latest version of Drosophila melanogaster genome annotation available at `ftp://ftp.ensembl.org/pub/release-86/gtf/drosophila_melanogaster` that is small enough for analysis even on a laptop.
As these files for many organisms are large we will in this exercise use the latest version of *Drosophila melanogaster* genome annotation available at `ftp://ftp.ensembl.org/pub/release-86/gtf/drosophila_melanogaster` that is small enough for analysis even on a laptop.

Download the file named **Drosophila_melanogaster.BDGP6.86.gtf.gz** to your computer. Unzip this file and keep track of where your store the file.

Expand Down Expand Up @@ -166,13 +166,14 @@ str(d.gtf)
1. How many chromosome names can be found in the annotation file?

```{r,accordion=TRUE}
levels(d.gtf$Chromosome)
length(levels(as.factor(d.gtf$Chromosome)))
```

2. How many **exons** is there in total and per chromosome? (hint: first extract lines that have `feature == 'exon'`)

```{r,accordion=TRUE}
d.gtf.exons <- d.gtf[(d.gtf$Feature == 'exon'),]
nrow(d.gtf.exons)
aggregate(d.gtf.exons$Feature, by=list(d.gtf.exons$Chromosome), summary)
```

Expand Down
8 changes: 4 additions & 4 deletions slide_r_elements_3.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -337,7 +337,7 @@ name: data_frames_accessing

# Data frames &mdash; accessing values

- We can always use the `[]` notation to access values inside data frames.
- We can always use the `[row,column]` notation to access values inside data frames.

```{r data.frame.access, echo=T}
df[1,] # get the first row
Expand Down Expand Up @@ -516,12 +516,12 @@ name: lists_nested
We can use lists to store hierarchies of data:

```{r lists_nested, echo=T}
ikea_lund <- list(park = 125)
ikea_lund <- list(parking = 125)
ikea_sweden <- list(ikea_lund = ikea_lund,
ikea_uppsala = ikea_uppsala)
# use names to navigate inside the hierarchy
ikea_sweden$ikea_lund$park
ikea_sweden$ikea_uppsala$park
ikea_sweden$ikea_lund$parking
ikea_sweden$ikea_uppsala$parking
```


Expand Down

0 comments on commit f651566

Please sign in to comment.