Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorganise repo structure and store input data as parquet #50

Merged
merged 41 commits into from
Sep 4, 2024
Merged
Show file tree
Hide file tree
Changes from 40 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
d76c5a0
Move raw data to `data-raw/` and scripts to `scripts/`
milanmlft Aug 30, 2024
da302a2
Add master script to generate dummy data
milanmlft Aug 30, 2024
f83ccbf
Move envvar declarations to master script
milanmlft Aug 30, 2024
6223faf
Put all scripts in top-level
milanmlft Aug 30, 2024
ea5bc14
Rename scripts
milanmlft Aug 30, 2024
ba7e245
Use correct name in `here::i_am()`
milanmlft Aug 30, 2024
5ddac93
Rename `inst/test_data` to `inst/dev_data`
milanmlft Aug 30, 2024
e846e87
Small script updates
milanmlft Aug 30, 2024
282b7bd
Enable markdown support for roxygen
milanmlft Aug 30, 2024
8cbc2df
Move common functions into the package
milanmlft Aug 30, 2024
391f380
Roxygenise
milanmlft Aug 30, 2024
87d7530
Fix paths to `dev_data`
milanmlft Aug 30, 2024
51f2110
Rename
milanmlft Aug 30, 2024
d799335
Extract `calculate_monthly_counts` helper and move into package
milanmlft Aug 30, 2024
a5a6fa8
Add simple tests for monthly counts
milanmlft Aug 30, 2024
8bc9f86
Roxygenise
milanmlft Aug 30, 2024
454999c
Fix: don't generate duplicate rows for monthly counts
milanmlft Aug 30, 2024
cf94247
Test a sligthly more complex example
milanmlft Aug 30, 2024
860826b
Test that `calculate_monthly_counts` can handle database-stored tables
milanmlft Aug 30, 2024
1aec829
Reduce imports
milanmlft Aug 30, 2024
6a12dce
Use correct function
milanmlft Aug 30, 2024
47a688c
Add `dbplyr` as dev dep for tests
milanmlft Aug 30, 2024
3cdee64
Move summary stat helpers into package
milanmlft Aug 30, 2024
77c7f68
Roxygenise
milanmlft Aug 30, 2024
a91a173
Update script
milanmlft Aug 30, 2024
bc84152
Add tests for summary stats
milanmlft Aug 30, 2024
6f21f5d
Update README
milanmlft Aug 30, 2024
3c423b5
Add `RSQLite` as dev dependency
milanmlft Sep 2, 2024
21a63fe
Write summary tables to parquet instead of adding to database
milanmlft Sep 2, 2024
abf0c01
Rename script, avoid confusion between dev and test data
milanmlft Sep 2, 2024
303dfb0
Use consistent filenames for tables
milanmlft Sep 2, 2024
c9e3794
Read summary tables from parquet instead of db
milanmlft Sep 2, 2024
84e9a78
No more need for SQL script
milanmlft Sep 2, 2024
054300b
Resolve `R CMD check` notes and make linter happy
milanmlft Sep 2, 2024
60a40e2
Change data getters to read in parquet files
milanmlft Sep 2, 2024
eb206c0
Clean up envvars
milanmlft Sep 2, 2024
13d3a8c
Add placeholder dir for production data
milanmlft Sep 2, 2024
38dca9a
Add test docker compose
milanmlft Sep 2, 2024
ad239e6
Update package build
milanmlft Sep 2, 2024
8d7f7cb
Describe file structure of the repo
milanmlft Sep 2, 2024
0e023d0
Merge branch 'main' into milanmlft/reorganise-repo
milanmlft Sep 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,15 @@
^renv\.lock$
^.*\.Rproj$
^\.Rproj\.user$
^data-raw$
dev_history.R
^dev$
$run_dev.*
^.here$
^LICENSE\.md$
^\.github$
^\.lintr$
^\.renvignore$
^data$
^data-raw$
^deploy$
^dev$
^scripts$
11 changes: 0 additions & 11 deletions .Rprofile
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,3 @@ if (interactive()) {
}

source("renv/activate.R")

# Path to download Eunomia datasets
Sys.setenv(EUNOMIA_DATA_FOLDER = file.path("dev/test_db/eunomia"))
# Name of the synthetic dataset to use
Sys.setenv(TEST_DB_NAME = "synthea-allergies-10k")
# OMOP CDM version
Sys.setenv(TEST_DB_OMOP_VERSION = "5.3")
# Schema name for data
Sys.setenv(TEST_DB_CDM_SCHEMA = "main")
# Schema name for results
Sys.setenv(TEST_DB_RESULTS_SCHEMA = "main")
3 changes: 2 additions & 1 deletion .lintr
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
linters: linters_with_defaults(
line_length_linter(120),
object_name_linter(styles = c("snake_case", "symbols", "camelCase"))
object_name_linter(styles = c("snake_case", "symbols", "camelCase")),
object_length_linter(NULL)
)
encoding: "UTF-8"
8 changes: 6 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@ Imports:
readr,
lubridate,
dplyr,
cli
cli,
nanoparquet
Suggests:
devtools,
usethis,
Expand All @@ -29,9 +30,12 @@ Suggests:
spelling,
here,
CDMConnector,
lintr
lintr,
dbplyr,
RSQLite
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.3.2
Config/testthat/edition: 3
Language: en-US
Roxygen: list(markdown = TRUE)
21 changes: 21 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
@@ -1,8 +1,28 @@
# Generated by roxygen2: do not edit by hand

export(calculate_monthly_counts)
export(calculate_summary_stats)
export(connect_to_db)
export(read_parquet_sorted)
export(run_app)
export(write_table)
import(bslib)
import(shiny)
importFrom(dplyr,across)
importFrom(dplyr,all_of)
importFrom(dplyr,arrange)
importFrom(dplyr,bind_rows)
importFrom(dplyr,collect)
importFrom(dplyr,count)
importFrom(dplyr,everything)
importFrom(dplyr,filter)
importFrom(dplyr,group_by)
importFrom(dplyr,mutate)
importFrom(dplyr,n)
importFrom(dplyr,n_distinct)
importFrom(dplyr,rename)
importFrom(dplyr,select)
importFrom(dplyr,summarise)
importFrom(ggplot2,aes)
importFrom(ggplot2,geom_bar)
importFrom(ggplot2,geom_boxplot)
Expand All @@ -20,3 +40,4 @@ importFrom(golem,with_golem_options)
importFrom(shiny,NS)
importFrom(shiny,shinyApp)
importFrom(shiny,tagList)
importFrom(stats,sd)
2 changes: 1 addition & 1 deletion R/run_app.R
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ run_app <- function(
}

.check_env <- function() {
required <- c("CALYPSO_DATA_PATH", "CALYPSO_DB_NAME", "CALYPSO_DB_OMOP_VERSION")
required <- "CALYPSO_DATA_PATH"
missing <- required[!required %in% names(Sys.getenv())]
if (length(missing) > 0) {
cli::cli_abort("The following environment variables are missing: {.envvar {missing}}")
Expand Down
57 changes: 57 additions & 0 deletions R/utils-preprocessing-db.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#' Connect to duckdb database
#'
#' @param db_path path to the duckdb database file
#' @param ... unused
#' @param .envir passed on to [`withr::defer()`]
#'
#' @return A [`DBI::DBIConnection-class`] object
#' @export
connect_to_db <- function(db_path, ..., .envir = parent.frame()) {
if (!file.exists(db_path)) {
cli::cli_abort("Database file {.file {db_path}} not found")
}

# Connect to the duckdb test database
con <- DBI::dbConnect(
duckdb::duckdb(dbdir = db_path)
)
withr::defer(DBI::dbDisconnect(con), envir = .envir)
con
}


#' Write data to a table in the database
#'
#' @param data data.frame, data to be written to the table
#' @param con A [`DBI::DBIConnection-class`] object
#' @param table character, name of the table to write to
#' @param schema character, name of the schema to be used
#'
#' @return `TRUE`, invisibly, if the operation was successful
#' @export
write_table <- function(data, con, table, schema) {
DBI::dbWriteTable(
conn = con,
name = DBI::Id(schema = schema, table = table),
value = data,
overwrite = TRUE
)
}


#' Read a parquet table and sort the results
#'
#' @param path path to the parquet file to be read
#' @inheritParams nanoparquet::read_parquet
#'
#' @return A `data.frame` with the results sorted by all columns
#' @export
#' @importFrom dplyr arrange across everything
read_parquet_sorted <- function(path, options = nanoparquet::parquet_options()) {
if (!file.exists(path)) {
cli::cli_abort("File {.file {path}} not found")
}

nanoparquet::read_parquet(path, options) |>
arrange(across(everything()))
}
110 changes: 110 additions & 0 deletions R/utils-preprocessing-summarise.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
#' Calculate monthly statistics for an OMOP concept
#'
#' @param omop_table A table from the OMOP CDM
#' @param concept The name of the concept column to calculate statistics for
#' @param date The name of the date column to calculate statistics for
#'
#' @return A `data.frame` with the following columns:
#' - `concept_id`: The concept ID
#' - `concept_name`: The concept name
#' - `date_year`: The year of the date
#' - `date_month`: The month of the date
#' - `person_count`: The number of unique patients per concept for each month
#' - `records_per_person`: The average number of records per person per concept for each month
#' @export
#' @importFrom dplyr mutate group_by summarise select n n_distinct collect
calculate_monthly_counts <- function(omop_table, concept, date) {
# Extract year and month from date column
omop_table <- mutate(omop_table,
concept_id = {{ concept }},
date_year = lubridate::year({{ date }}),
date_month = lubridate::month({{ date }})
)

date_year <- date_month <- concept_id <- person_id <- person_count <- records_per_person <- NULL
omop_table |>
group_by(date_year, date_month, concept_id) |>
summarise(
person_count = n_distinct(person_id),
records_per_person = n() / n_distinct(person_id)
) |>
select(
concept_id,
date_year,
date_month,
person_count,
records_per_person
) |>
## Collect in case we're dealing with a database-stored table
collect()
}

#' Calculate summary statistics for an OMOP table
#'
#' Calculates the mean snd standard deviation for numeric concepts and the
#' frequency for categorical concepts.
#'
#' @param omop_table A table from the OMOP CDM
#' @param concept_name The name of the concept ID column
#'
#' @return A `data.frame` with the following columns:
#' - `concept_id`: The concept ID
#' - `summary_attribute`: The summary attribute (e.g. "mean", "sd", "frequency")
#' - `value_as_number`: The value of the summary attribute
#' - `value_as_concept_id`: In case of a categorical concept, the concept ID for each category
#' @export
#' @importFrom dplyr all_of rename filter collect bind_rows
calculate_summary_stats <- function(omop_table, concept_name) {
stopifnot(is.character(concept_name))

omop_table <- rename(omop_table, concept_id = all_of(concept_name))

## Avoid "no visible binding" notes
value_as_number <- value_as_concept_id <- NULL

numeric_concepts <- filter(omop_table, !is.na(value_as_number))
# beware CDM docs: NULL=no categorical result, 0=categorical result but no mapping
categorical_concepts <- filter(omop_table, !is.null(value_as_concept_id) & value_as_concept_id != 0)

numeric_stats <- .summarise_numeric_concepts(numeric_concepts) |> collect()
categorical_stats <- .summarise_categorical_concepts(categorical_concepts) |> collect()
bind_rows(numeric_stats, categorical_stats)
}

#' @importFrom dplyr group_by summarise
#' @importFrom stats sd
.summarise_numeric_concepts <- function(omop_table) {
value_as_number <- concept_id <- NULL

# Calculate mean and sd
stats <- omop_table |>
group_by(concept_id) |>
summarise(mean = mean(value_as_number, na.rm = TRUE), sd = sd(value_as_number, na.rm = TRUE))

# Wrangle output to expected format
stats |>
tidyr::pivot_longer(
cols = c(mean, sd),
names_to = "summary_attribute",
values_to = "value_as_number"
)
}

#' @importFrom dplyr count mutate select
.summarise_categorical_concepts <- function(omop_table) {
concept_id <- value_as_concept_id <- summary_attribute <- NULL

# Calculate frequencies
frequencies <- omop_table |>
count(concept_id, value_as_concept_id)

# Wrangle output into the expected format
frequencies |>
mutate(summary_attribute = "frequency") |>
select(
concept_id,
summary_attribute,
value_as_number = n,
value_as_concept_id
)
}
35 changes: 14 additions & 21 deletions R/utils_get_data.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,46 +6,39 @@
get_concepts_table <- function() {
if (golem::app_dev()) {
return(
readr::read_csv(app_sys("test_data", "calypso_concepts.csv"), show_col_types = FALSE)
readr::read_csv(app_sys("dev_data", "calypso_concepts.csv"), show_col_types = FALSE)
)
}
.read_db_table("calypso_concepts")
.read_parquet_table("calypso_concepts")
}

get_monthly_counts <- function() {
if (golem::app_dev()) {
return(
readr::read_csv(app_sys("test_data", "calypso_monthly_counts.csv"), show_col_types = FALSE)
readr::read_csv(app_sys("dev_data", "calypso_monthly_counts.csv"), show_col_types = FALSE)
)
}
.read_db_table("calypso_monthly_counts")
.read_parquet_table("calypso_monthly_counts")
}

get_summary_stats <- function() {
if (golem::app_dev()) {
return(
readr::read_csv(app_sys("test_data", "calypso_summary_stats.csv"), show_col_types = FALSE)
readr::read_csv(app_sys("dev_data", "calypso_summary_stats.csv"), show_col_types = FALSE)
)
}
.read_db_table("calypso_summary_stats")
.read_parquet_table("calypso_summary_stats")
}

.connect_to_db <- function() {
dir <- Sys.getenv("CALYPSO_DATA_PATH")
name <- Sys.getenv("CALYPSO_DB_NAME")
version <- Sys.getenv("CALYPSO_DB_OMOP_VERSION")

db_file <- glue::glue("{dir}/{name}_{version}_1.0.duckdb")
if (!file.exists(db_file)) {
cli::cli_abort("Database file {.file {db_file}} does not exist.")
.read_parquet_table <- function(table_name) {
data_dir <- Sys.getenv("CALYPSO_DATA_PATH")
if (data_dir == "") {
cli::cli_abort("Environment variable {.envvar CALYPSO_DATA_PATH} not set")
}
if (!dir.exists(data_dir)) {
cli::cli_abort("Data directory {.file {data_dir}} not found")
}

# Connect to the duckdb database
DBI::dbConnect(duckdb::duckdb(dbdir = db_file))
}

.read_db_table <- function(table_name) {
con <- .connect_to_db()
withr::defer(DBI::dbDisconnect(con))
DBI::dbReadTable(con, table_name)
nanoparquet::read_parquet(glue::glue("{data_dir}/{table_name}.parquet"))
}
34 changes: 23 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,11 +45,10 @@ as it has good support for R package development and Shiny.
install.packages("renv")
renv::restore()
```
3. Create the [duckdb](https://github.com/duckdb/duckdb) test database and run the analyses by running from an R console in the project directory (test dataset properties can be updated in the [`.Rprofile`](https://github.com/SAFEHR-data/omop-data-catalogue/blob/main/.Rprofile) file):
3. Create the [duckdb](https://github.com/duckdb/duckdb) test database and run the analyses by running from an R console in the project directory:

```r
source(here::here("dev/test_db/setup_test_db.R"))
source(here::here("dev/omop_analyses/analyse_omop_cdm.R"))
source(here::here("scripts/create_dev_data.R"))
```

4. To preview the app locally, run the following from an R console within the project directory:
Expand All @@ -60,16 +59,29 @@ as it has good support for R package development and Shiny.

The `dev/02_dev.R` script contains a few helper functions to get you started.

Calypso test data can be found in [`inst/test_data`](https://github.com/SAFEHR-data/omop-data-catalogue/tree/main/inst/data). These data have been generated by using the synthetic dataset '[synthea-allergies-10k](https://darwin-eu.github.io/CDMConnector/reference/eunomiaDir.html)', and adding some [dummy data](https://github.com/SAFEHR-data/omop-data-catalogue/tree/main/dev/test_db/dummy) for the MEASUREMENT and OBSERVATION tables (to have some records in the 'calypso-summary-stats' table).
The test data can be found in [`inst/dev_data`](https://github.com/SAFEHR-data/omop-data-catalogue/tree/main/inst/data). These data have been generated by using the synthetic dataset '[synthea-allergies-10k](https://darwin-eu.github.io/CDMConnector/reference/eunomiaDir.html)', and adding some [dummy data](https://github.com/SAFEHR-data/omop-data-catalogue/tree/main/dev/test_db/dummy) for the MEASUREMENT and OBSERVATION tables (to have some records in the 'calypso-summary-stats' table).

If you want to recreate a test dataset, you can run the following R scripts:

```r
source(here::here("dev/test_db/setup_test_db.R"))
source(here::here("dev/test_db/insert_dummy_tables.R"))
source(here::here("dev/omop_analyses/analyse_omop_cdm.R"))
source(here::here("dev/test_db/produce_test_data.R"))
```
### File structure

This repo is organised as an R package with a few additional directories used for deployment of the
Shiny app:

- `R/`: contains the R source code for the package
- `inst/`: configuration files and dummy data for the app
- `dev_data/`: dummy data for the app to use during development
- `app/wwww`: static files (e.g. CSS, JavaScript) for the app
- `man/`: documentation files for the package, generated by `{roxygen2}`
- `tests/`: unit tests for the package, written with `{testthat}`

The directories _not_ included in the package (i.e. listed in `.Rbuildignore`) but used for deployment and data pre-processing:

- `data-raw/test_db`: the source data for generating the test data
- `data/test_data`: test data parquet files mimicking what real data would look like to run the app in production
- `dev/`: contains scripts and helper functions for development
- `deploy/`: contains Docker files and scripts for deployment
- `renv/`: contains the `renv` library, managed by `{renv}`
- `scripts/`: contains scripts for data pre-processing and generating the test and dev data

### Updating the `renv` lockfile

Expand Down
File renamed without changes.
1 change: 1 addition & 0 deletions data/prod_data/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*.parquet
Binary file added data/test_data/calypso_concepts.parquet
Binary file not shown.
Binary file added data/test_data/calypso_monthly_counts.parquet
Binary file not shown.
Binary file added data/test_data/calypso_summary_stats.parquet
Binary file not shown.
5 changes: 1 addition & 4 deletions deploy/.env.sample
Original file line number Diff line number Diff line change
@@ -1,4 +1 @@
GOLEM_CONFIG_ACTIVE=production # production or dev
CALYPSO_DATA_PATH=dev/test_db/eunomia
CALYPSO_DB_NAME=synthea-allergies-10k
CALYPSO_DB_OMOP_VERSION=5.3
CALYPSO_DATA_PATH=data/test_data
Binary file modified deploy/calypso_0.0.0.9000.tar.gz
Binary file not shown.
Loading