Skip to content

Commit

Permalink
Added slides for parquet + {arrow}
Browse files Browse the repository at this point in the history
  • Loading branch information
mthomas-ketchbrook committed Sep 17, 2023
1 parent e662919 commit 64b8a60
Showing 1 changed file with 42 additions and 1 deletion.
43 changes: 42 additions & 1 deletion materials/d1-03-performance/index.qmd
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
---
title: "Performance"
subtitle: "posit::conf(2023) <br> Shiny in Production: Tools & Techniques"
author: "TBD"
footer: "[{{< var workshop_short_url >}}]({{< var workshop_full_url >}})"
format:
revealjs:
Expand All @@ -14,6 +13,14 @@ format:
history: false
---

## Performance Agenda

* Profiling your Shiny app with {profvis}
* Lightning-quick data loading/querying with {arrow} & *.parquet* files
* Asynchronous processes with {crew}

# {profvis}: an R package for profiling R code <br>(including Shiny apps)

## What is {profvis}?

R package for visualizing how (and how fast/slow) your R code runs
Expand Down Expand Up @@ -72,3 +79,37 @@ You need to wrap the `run_app()` function in `print()`, before passing it to `pr
- Describe the flamegraph, change the filters to only show events that took time
- Navigate to the 'data' tab and discuss what took the most time
:::

# {arrow} & the *.parquet* file format

## What are *.parquet* files?

* *.parquet* is a *columnar* storage format
* *.parquet* files not only store data, but they also store metadata about your data (i.e., data types for each column, number of rows in the file, etc.)
* Smaller files
* Faster read speed

::: {.notes}
- HOT TAKE INCOMING: parquet is the new csv
- parquet files typically are the data structure that lives behind projects like the open source Delta Lake
- faster across pretty much all benchmarks
:::

## What is the {arrow} R package?

* Part of the larger Apache Arrow project
* Connect to your data with {arrow}...
* ... and query it with {dplyr}

. . .

[Apache Arrow Homepage](https://arrow.apache.org/)

[Shiny + Arrow Article](https://posit.co/blog/shiny-and-arrow/)

::: {.notes}
- "multi-language toolbox for accelerated data interchange and in-memory processing"
- I.e., a set of data manipulation standards (particularly against parquet files) that has been implemented in a bunch of languages including R, Python, Rust, Go, and more
- {arrow} let's you use {dplyr} verbs against a single parquet file (or, perhaps more importantly, a *set* of parquet files) to query the data in those files
- When it comes to building Shiny apps, we should look for easy places where we can gain efficiency & speed to improve our user experience (you don't want users waiting 20 seconds for your data prep logic to run against a single massive csv); it's very likely that the combination of .parquet + {arrow} + {dplyr} can meet your app performance needs (it does for at least 95% of my use cases -- there are very few cases where I have to go beyond that and start looking into other engines for faster data manipulation)
:::

0 comments on commit 64b8a60

Please sign in to comment.