forked from posit-conf-2023/arrow
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsetup.qmd
294 lines (218 loc) · 12.5 KB
/
setup.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
---
title: "Packages & Data"
execute:
eval: false
---
Welcome to the `Big Data in R with Arrow` workshop! On this page you will find information about the software, packages and data we will be using during the workshop. Please try your very best to arrive on the day ready---software & packages installed and data downloaded on to your laptop.
# Software
You will need a laptop with [R](https://cloud.r-project.org/) and the [RStudio Desktop IDE](https://posit.co/download/rstudio-desktop/) installed and with sufficient disk storage space for the workshop data sets and exercises---we recommend a minimum of about \~80GB to work with the "larger-than-memory" data option or \~2-3GB to work with the smaller example---or "tiny"---data option. The workshop will also have a dedicated Posit Cloud work space ready with the packages and the tiny sample data sets loaded for those who need or prefer to participate through a browser.
# Packages
To install the required core packages for the day, run the following:
```{r}
#| label: install-core-packages
#| message: false
#| eval: false
install.packages(c(
"here", "arrow", "dplyr", "duckdb", "dbplyr", "stringr", "lubridate", "tictoc"
))
```
And to load them:
```{r}
#| label: load-core-packages
#| message: false
library(here)
library(arrow)
library(dplyr)
library(duckdb)
library(dbplyr)
library(stringr)
library(lubridate)
library(tictoc)
```
Please come with the *latest CRAN versions* of the core packages installed on your laptops. Note that the latest version of the `arrow` package---**`arrow 13.0.0`**---was just released 30 August 2023. A small amount of the code in the workshop relies on functions introduced in `arrow 13.0.0`; let us know if for any reason you are unable to install this version, and we'll flag up any modifications you'll need to make the workshop code run with earlier versions of the `arrow`package.
While the core workshop doesn't focus on spatial data, the "wrapping up" section near the end of day results in a ggplot map. If you want to run the "wrapping up" workflow code you will need to install the following spatial and plotting packages:
```{r}
#| label: install-plot-packages
#| message: false
#| eval: false
install.packages(c("ggplot2", "ggrepel", "sf", "scales", "janitor"))
```
And to load them:
```{r}
#| label: load-plot-packages
#| message: false
library(ggplot2)
library(ggrepel)
library(sf)
library(scales)
library(janitor)
```
# Data
During the 1-day workshop, you will need the following data sets:
1. *NYC Yellow Taxi Trip Record Data*: Partitioned parquet files released as open data from the [NYC Taxi & Limousine Commission (TLC)](https://www.nyc.gov/site/tlc/about/raw-data.page) with a pre-tidied subset (\~40GB) downloaded with either `arrow` or via https from an AWS S3 bucket
2. *Seattle Public Library Checkouts by Title*: A single CSV file (9GB) from the [Seattle Open Data portal](https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6) downloaded via https from an AWS S3 bucket
3. *Taxi Zone Lookup CSV Table & Taxi Zone Shapefile*: two NYC Taxi trip ancillary data files from the [TLC open platform](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) downloaded via https directly from this repository
## Larger-Than-Memory Data Option
### 1. NYC Taxi Data
This is the main data set we will need for the day. It's pretty hefty---*40* GB in total---and there are a couple of options for how to acquire it, depending on your internet connection speed.
#### Option 1---the simplest option---for those with a good internet connection and happy to let things run
If you have a solid internet connection, and especially if you're in the US/Canada, this option is the simplest. You can use `arrow` itself to download the data. Note that there are no progress bars displayed during download, and so your session will appear to hang, but you can check progress by inspecting the contents of the download directory. When we tested this with Steph's laptop and a fast internet connection, it took 67 minutes, though results will likely vary widely.
This method requires `arrow` to have been built with S3 support enabled. This is on by default for most
MacOS and Windows users, but if you're on Linux, take a look at the instructions [here](https://arrow.apache.org/docs/r/articles/install.html#libraries).
After installing arrow, run the following code:
```{r}
#| label: download-taxi-arrow
#| message: false
#| eval: false
library(arrow)
library(dplyr)
data_path <- here::here("data/nyc-taxi") # Or set your own preferred path
open_dataset("s3://voltrondata-labs-datasets/nyc-taxi") |>
filter(year %in% 2012:2021) |>
write_dataset(data_path, partitioning = c("year", "month"))
```
Once this has completed, you can check everything has downloaded correctly by calling:
```{r}
#| label: download-check
#| message: false
#| eval: false
open_dataset(data_path) |>
nrow()
```
It might take a moment to run (the data has over a billion rows!), but you should expect to see:
```
[1] 1150352666
```
If you get an error message, your download may have been interrupted at some point. The error message will name the file which could not be read. Manually delete this file and run the `nrow()` code snippet again until you successfully load the remaining data. You can then download any missing files individually using option 2.
#### Option 2---one file at a time via https
If you have a slower internet connection or are further away from the data S3 bucket location, it's probably going to be simpler to download the data file-by-file. Or, if you had any interruptions to your download process in the previous step, you can either try instead with this method, or delete the files which weren't downloaded properly, and use this method to just download the files you need.
We've created a script for you which downloads the data one file at a time via https. The script also checks for previously downloaded data, so if you encounter problems downloading any files, just delete the partially downloaded file and run again---the script will only download files which are missing.
```{r}
#| label: download-taxi-curl
#| message: false
#| eval: false
download_via_https <- function(data_dir, years = 2012:2021){
# Set this option as we'll be downloading large files and R has a default
# timeout of 60 seconds, so we've updated this to 30 mins
options(timeout = 1800)
# The S3 bucket where the data is stored
bucket <- "https://voltrondata-labs-datasets.s3.us-east-2.amazonaws.com"
# Collect any errors raised during the download process
problems <- c()
# Download the data from S3 - loops through the data files, downloading 1 file at a time
for (year in years) {
# We only have 2 months for 2022 data
if(year ==2022){
months = 1:2
} else {
months = 1:12
}
for (month in months) {
# Work out where we're going to be saving the data
partition_dir <- paste0("year=", year, "/month=", month)
dest_dir <- file.path(data_dir, partition_dir)
dest_file_path <- file.path(dest_dir, "part-0.parquet")
# If the file doesn't exist
if (!file.exists(dest_file_path)) {
# Create the partition to store the data in
if(!dir.exists(dest_dir)){
dir.create(dest_dir, recursive = TRUE)
}
# Work out where we are going to be retrieving the data from
source_path <- file.path(bucket, "nyc-taxi", partition_dir, "part-0.parquet")
# Download the data - save any error messages that occur
tryCatch(
download.file(source_path, dest_file_path, mode = "wb"),
error = function(e){
problems <- c(problems, e$message)
}
)
}
}
}
print("Downloads complete")
if(length(problems) > 0){
warning(call. = FALSE, "The following errors occurred during download:\n", paste(problems, collapse = "\n"))
}
}
data_path <- here::here("data/nyc-taxi") # Or set your own preferred path
download_via_https(data_path)
```
Once this has completed, you can check everything has downloaded correctly by calling:
```{r}
#| label: download-check-again
#| message: false
#| eval: false
open_dataset(data_path) |>
nrow()
```
It might take a moment to run (the data has over a billion rows), but you should expect to see:
```
[1] 1150352666
```
If you get an error message, your download may have been interrupted at some point. The error message will name the file which could not be read. Manually delete this file and run the `nrow()` code snippet again until you successfully load the data. You can then download any missing files by re-running `download_via_https(data_path)`.
### 2. Seattle Checkouts by Title Data
This is the data we will use to explore some data storage and engineering options. It's a good sized, single CSV file---*9GB* on-disk in total, which can be downloaded from the an AWS S3 bucket via https:
```{r}
#| label: get-seattle-csv
#| eval: false
options(timeout = 1800)
download.file(
url = "https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv",
destfile = here::here("data/seattle-library-checkouts.csv")
)
```
## Tiny Data Option
If you don't have time or disk space to download the larger-than-memory data sets (and still have disk space do the exercises), you can run the code and exercises in the course with "tiny" versions of these data. Although the focus in this course is working with larger-than-memory data, you can still learn about the concepts and workflows with smaller data---although note you may not see the same performance improvements that you would get when working with larger data.
### 1. Tiny NYC Taxi Data
We've created a "tiny" NYC Taxi data set which contains only 1 in 1000 rows from the original data set. So instead of working with 1.15 billion rows of data and about 40GB of files, the tiny taxi data set is 1.15 million rows and about 50MB of files. You can download the tiny NYC Taxi data directly from this repo via https:
```{r}
#| label: download-tiny-taxi
#| message: false
#| eval: false
options(timeout = 1800)
download.file(
url = "https://github.com/posit-conf-2023/arrow/releases/download/v0.1.0/nyc-taxi-tiny.zip",
destfile = here::here("data/nyc-taxi-tiny.zip")
)
# Extract the partitioned parquet files from the zip folder:
unzip(
zipfile = here::here("data/nyc-taxi-tiny.zip"),
exdir = here::here("data/")
)
```
### 2. Tiny Seattle Checkouts by Title Data
We've created a "tiny" Seattle Checkouts by Title data set which contains only 1 in 100 rows from the original data set. So instead of working with \~41 million rows of data in a 9GB file, the tiny Seattle checkouts data set is \~410 thousand rows and in an 90MB file. You can download the tiny Seattle Checkouts by Title data directly from this repo via https:
```{r}
#| label: download-tiny-seattle
#| message: false
#| eval: false
options(timeout = 1800)
download.file(
url = "https://github.com/posit-conf-2023/arrow/releases/download/v0.1.0/seattle-library-checkouts-tiny.csv",
destfile = here::here("data/seattle-library-checkouts-tiny.csv")
)
```
## Both Data Options / Everyone
### 3. Taxi Zone Lookup CSV Table & Taxi Zone Shapefile
You can download the two NYC Taxi trip ancillary data files directly from this repo via https:
```{r}
#| label: get-taxi-shape-again
#| eval: false
options(timeout = 1800)
download.file(
url = "https://github.com/posit-conf-2023/arrow/releases/download/v0.1.0/taxi_zone_lookup.csv",
destfile = here::here("data/taxi_zone_lookup.csv")
)
download.file(
url = "https://github.com/posit-conf-2023/arrow/releases/download/v0.1.0/taxi_zones.zip",
destfile = here::here("data/taxi_zones.zip")
)
# Extract the spatial files from the zip folder:
unzip(
zipfile = here::here("data/taxi_zones.zip"),
exdir = here::here("data/taxi_zones")
)
```
## Data on The Day Of
While we ask that everyone do their best to arrive on the day ready with your software & packages installed and the data downloaded on to your laptops---we recognize that life happens. We will have 5 USB flash drives (and a few USB-C to USB adapters) on-hand the morning of the workshop with copies of ***all the larger-than-memory and tiny versions of the data sets***. So if you have trouble downloading any of the data beforehand, we should be able to get you sorted before the day starts or as soon as possible as the day begins. And if disk space or laptop permissions are blockers, the workshop will have a dedicated Posit Cloud work space ready for participants.