-
Notifications
You must be signed in to change notification settings - Fork 27
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
163 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,163 @@ | ||
--- | ||
title: "Assignment 04 - HPC and SQL" | ||
output: tufte::tufte_html | ||
date: 2024-11-08 | ||
--- | ||
|
||
## Due Date | ||
|
||
This assignment is due by 11:59pm Pacific Time, November 22nd, 2024. | ||
|
||
The learning objectives are to write faster code for computational task requiring a loop and to implement some queries and basic data wrangling in SQL. | ||
|
||
## HPC | ||
|
||
### Make things run faster | ||
Rewrite the following R functions to make them faster. It is OK (and recommended) to take a look at StackOverflow and Google. | ||
|
||
|
||
``` r | ||
# Total row sums | ||
fun1 <- function(mat) { | ||
n <- nrow(mat) | ||
ans <- double(n) | ||
for (i in 1:n) { | ||
ans[i] <- sum(mat[i, ]) | ||
} | ||
ans | ||
} | ||
|
||
fun1alt <- function(mat) { | ||
# YOUR CODE HERE | ||
} | ||
|
||
# Cumulative sum by row | ||
fun2 <- function(mat) { | ||
n <- nrow(mat) | ||
k <- ncol(mat) | ||
ans <- mat | ||
for (i in 1:n) { | ||
for (j in 2:k) { | ||
ans[i,j] <- mat[i, j] + ans[i, j - 1] | ||
} | ||
} | ||
ans | ||
} | ||
|
||
fun2alt <- function(mat) { | ||
# YOUR CODE HERE | ||
} | ||
``` | ||
|
||
### Question 1 | ||
Using the dataset generated below (`dat`), check that both of your new functions produce the same outputs as the corresponding original functions. | ||
|
||
|
||
``` r | ||
# Use the data with this code | ||
set.seed(2315) | ||
dat <- matrix(rnorm(200 * 100), nrow = 200) | ||
``` | ||
|
||
Then use `microbenchmark` to check that your version is actually faster. How much faster is it? | ||
|
||
|
||
``` r | ||
# Test for the first | ||
microbenchmark::microbenchmark( | ||
fun1(dat), | ||
fun1alt(dat), unit = "relative" | ||
) | ||
|
||
# Test for the second | ||
microbenchmark::microbenchmark( | ||
fun2(dat), | ||
fun2alt(dat), unit = "relative" | ||
) | ||
``` | ||
|
||
|
||
### Make things run faster with parallel computing | ||
|
||
The following function allows simulating pi: | ||
|
||
|
||
``` r | ||
sim_pi <- function(n = 1000, i = NULL) { | ||
p <- matrix(runif(n*2), ncol = 2) | ||
mean(rowSums(p^2) < 1) * 4 | ||
} | ||
|
||
# Here is an example of the run | ||
set.seed(156) | ||
sim_pi(1000) # 3.132 | ||
``` | ||
|
||
In order to get accurate estimates, we can run this function multiple times, with the following code: | ||
|
||
|
||
``` r | ||
# This runs the simulation a 4,000 times, each with 10,000 points | ||
set.seed(1231) | ||
system.time({ | ||
ans <- unlist(lapply(1:4000, sim_pi, n = 10000)) | ||
print(mean(ans)) | ||
}) | ||
``` | ||
|
||
### Question 2 | ||
Rewrite the previous code using `parLapply()` (or your parallelization method of choice) to parallelize it. Run the code once, using `system.time()`, to show that your version is faster. | ||
|
||
|
||
``` r | ||
# YOUR CODE HERE | ||
system.time({ | ||
# YOUR CODE HERE | ||
ans <- # YOUR CODE HERE | ||
print(mean(ans)) | ||
# YOUR CODE HERE | ||
}) | ||
``` | ||
|
||
## SQL | ||
|
||
Setup a temporary database by running the following chunk | ||
|
||
|
||
``` r | ||
# install.packages(c("RSQLite", "DBI")) | ||
|
||
library(RSQLite) | ||
library(DBI) | ||
|
||
# Initialize a temporary in memory database | ||
con <- dbConnect(SQLite(), ":memory:") | ||
|
||
# Download tables | ||
film <- read.csv("https://raw.githubusercontent.com/ivanceras/sakila/master/csv-sakila-db/film.csv") | ||
film_category <- read.csv("https://raw.githubusercontent.com/ivanceras/sakila/master/csv-sakila-db/film_category.csv") | ||
category <- read.csv("https://raw.githubusercontent.com/ivanceras/sakila/master/csv-sakila-db/category.csv") | ||
|
||
# Copy data.frames to database | ||
dbWriteTable(con, "film", film) | ||
dbWriteTable(con, "film_category", film_category) | ||
dbWriteTable(con, "category", category) | ||
``` | ||
|
||
When you write a new chunk, remember to replace the `r` with `sql, connection=con`. Some of these questions will require you to use an inner join. Read more about them here https://www.w3schools.com/sql/sql_join_inner.asp | ||
|
||
## Question 3 | ||
|
||
How many many movies are available in each `rating` category? | ||
|
||
## Question 4 | ||
|
||
What is the average replacement cost and rental rate for each `rating` category? | ||
|
||
## Question 5 | ||
|
||
Use table `film_category` together with `film` to find how many films there are with each category ID. | ||
|
||
## Question 6 | ||
|
||
Incorporate the `category` table into the answer to the previous question to find the name of the most popular category. |