rebuilt website

USCbiostats · Dec 9, 2024 · 1e6ae37 · 1e6ae37
1 parent 41ad891
commit 1e6ae37
Showing 1 changed file with 163 additions and 0 deletions.
diff --git a/website/content/assignment/11-hw4.md b/website/content/assignment/11-hw4.md
@@ -0,0 +1,163 @@
+---
+title: "Assignment 04 - HPC and SQL"
+output: tufte::tufte_html
+date: 2024-11-08
+---
+
+## Due Date
+
+This assignment is due by 11:59pm Pacific Time, November 22nd, 2024.
+
+The learning objectives are to write faster code for computational task requiring a loop and to implement some queries and basic data wrangling in SQL.
+
+## HPC
+
+### Make things run faster
+Rewrite the following R functions to make them faster. It is OK (and recommended) to take a look at StackOverflow and Google.
+
+
+``` r
+# Total row sums
+fun1 <- function(mat) {
+  n <- nrow(mat)
+  ans <- double(n) 
+  for (i in 1:n) {
+    ans[i] <- sum(mat[i, ])
+  }
+  ans
+}
+
+fun1alt <- function(mat) {
+  # YOUR CODE HERE
+}
+
+# Cumulative sum by row
+fun2 <- function(mat) {
+  n <- nrow(mat)
+  k <- ncol(mat)
+  ans <- mat
+  for (i in 1:n) {
+    for (j in 2:k) {
+      ans[i,j] <- mat[i, j] + ans[i, j - 1]
+    }
+  }
+  ans
+}
+
+fun2alt <- function(mat) {
+  # YOUR CODE HERE
+}
+```
+
+### Question 1
+Using the dataset generated below (`dat`), check that both of your new functions produce the same outputs as the corresponding original functions.
+
+
+``` r
+# Use the data with this code
+set.seed(2315)
+dat <- matrix(rnorm(200 * 100), nrow = 200)
+```
+
+Then use `microbenchmark` to check that your version is actually faster. How much faster is it?
+
+
+``` r
+# Test for the first
+microbenchmark::microbenchmark(
+  fun1(dat),
+  fun1alt(dat), unit = "relative"
+)
+
+# Test for the second
+microbenchmark::microbenchmark(
+  fun2(dat),
+  fun2alt(dat), unit = "relative"
+)
+```
+
+
+### Make things run faster with parallel computing
+
+The following function allows simulating pi:
+
+
+``` r
+sim_pi <- function(n = 1000, i = NULL) {
+  p <- matrix(runif(n*2), ncol = 2)
+  mean(rowSums(p^2) < 1) * 4
+}
+
+# Here is an example of the run
+set.seed(156)
+sim_pi(1000) # 3.132
+```
+
+In order to get accurate estimates, we can run this function multiple times, with the following code:
+
+
+``` r
+# This runs the simulation a 4,000 times, each with 10,000 points
+set.seed(1231)
+system.time({
+  ans <- unlist(lapply(1:4000, sim_pi, n = 10000))
+  print(mean(ans))
+})
+```
+
+### Question 2
+Rewrite the previous code using `parLapply()` (or your parallelization method of choice) to parallelize it. Run the code once, using `system.time()`, to show that your version is faster.
+
+
+``` r
+# YOUR CODE HERE
+system.time({
+  # YOUR CODE HERE
+  ans <- # YOUR CODE HERE
+  print(mean(ans))
+  # YOUR CODE HERE
+})
+```
+
+## SQL
+
+Setup a temporary database by running the following chunk
+
+
+``` r
+# install.packages(c("RSQLite", "DBI"))
+
+library(RSQLite)
+library(DBI)
+
+# Initialize a temporary in memory database
+con <- dbConnect(SQLite(), ":memory:")
+
+# Download tables
+film <- read.csv("https://raw.githubusercontent.com/ivanceras/sakila/master/csv-sakila-db/film.csv")
+film_category <- read.csv("https://raw.githubusercontent.com/ivanceras/sakila/master/csv-sakila-db/film_category.csv")
+category <- read.csv("https://raw.githubusercontent.com/ivanceras/sakila/master/csv-sakila-db/category.csv")
+
+# Copy data.frames to database
+dbWriteTable(con, "film", film)
+dbWriteTable(con, "film_category", film_category)
+dbWriteTable(con, "category", category)
+```
+
+When you write a new chunk, remember to replace the `r` with `sql, connection=con`. Some of these questions will require you to use an inner join. Read more about them here https://www.w3schools.com/sql/sql_join_inner.asp
+
+## Question 3
+
+How many many movies are available in each `rating` category?
+
+## Question 4
+
+What is the average replacement cost and rental rate for each `rating` category?
+
+## Question 5
+
+Use table `film_category` together with `film` to find how many films there are with each category ID.
+
+## Question 6
+
+Incorporate the `category` table into the answer to the previous question to find the name of the most popular category.