Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge changes in Rc/v2.1.0 to main branch #7

Open
wants to merge 37 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 33 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
532aaf5
Refactor code in the R/ folder
mingstat Oct 17, 2024
ff5d202
Update examples
mingstat Oct 18, 2024
432e835
Print the directory path and file names
mingstat Oct 18, 2024
9ec6f69
Add an arugment to print file paths if requested
mingstat Oct 21, 2024
bda40b4
Do not export get_base_dir()
mingstat Oct 21, 2024
b155b1a
Add get_nfs_path() function back
mingstat Oct 21, 2024
f7601bf
Keep the metadata format unchanged
mingstat Oct 21, 2024
6570461
Update README
mingstat Oct 22, 2024
488d679
Update package vignettes
mingstat Oct 22, 2024
1f256bc
Add sample data for docs and tests
mingstat Oct 22, 2024
088feb5
Update Rd files
mingstat Oct 23, 2024
30619ab
Update tests
mingstat Oct 23, 2024
8298ec2
Update test setup file
mingstat Oct 23, 2024
9cb51ac
Update integration guide
mingstat Oct 23, 2024
01de803
Make get_base_dir internal function
mingstat Oct 23, 2024
b111f9a
Export get_file_paths and load_data_files
mingstat Oct 23, 2024
a480124
Update DESCRIPTION file
mingstat Oct 23, 2024
8722c94
Update changelog
mingstat Oct 23, 2024
8bcefd3
Update roxygen examples
mingstat Oct 23, 2024
5b719dd
Fix lintr issues
mingstat Oct 23, 2024
62545e9
Fix styler issues
mingstat Oct 23, 2024
2bbf397
Try to identify issue from tests
mingstat Oct 23, 2024
a02950e
Use pattern matching to find files when no file extension is provided
mingstat Oct 23, 2024
51ba3ce
Fix styler issues
mingstat Oct 23, 2024
6e4a18a
Save demo data to a temp dir
mingstat Nov 1, 2024
89413a7
Remove pharmaverseadam data from package
mingstat Nov 1, 2024
4665a23
Remove comments inside functions
mingstat Nov 6, 2024
19ccc3d
Remvoe get_base_dir() and keep get_nfs_path()
mingstat Nov 6, 2024
d164c10
Update function docs
mingstat Nov 6, 2024
912957f
Update examples in README
mingstat Nov 6, 2024
52d058b
Update example in vignettes
mingstat Nov 6, 2024
c06a26a
Update R document files
mingstat Nov 6, 2024
b8d3bc3
Fix styler issues
mingstat Nov 6, 2024
abe10d6
Check argument print_file_paths
mingstat Nov 13, 2024
cf48a94
Remove unused code
mingstat Nov 13, 2024
6dc6e0e
Updated changelog
mingstat Nov 13, 2024
a14c500
Update Rd file
mingstat Nov 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Package: dv.loader
Type: Package
Title: Data loading module
Version: 2.0.0
Version: 2.1.0
Authors@R: c(
person( "Boehringer-Ingelheim Pharma GmbH & Co.KG", role = c("cph", "fnd")),
person( given = "Ming", family = "Yang", role = c("aut", "cre"), email = "[email protected]"),
Expand All @@ -13,10 +13,13 @@ License: Apache License (>= 2)
Encoding: UTF-8
LazyData: true
Depends: R (>= 3.5.0)
Imports: haven
Imports:
checkmate,
haven
Suggests:
testthat,
testthat (>= 3.0.0),
knitr,
rmarkdown
RoxygenNote: 7.3.0
VignetteBuilder: knitr
Config/testthat/edition: 3
2 changes: 2 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Generated by roxygen2: do not edit by hand

export(get_cre_path)
export(get_file_paths)
export(get_nfs_path)
export(load_data)
export(load_data_files)
8 changes: 8 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,11 @@
# dv.loader 2.1.0

- Refactored code to improve readability and maintainability.

- Fixed issue of partial matching when the `file_names` argument contains no file extensions.
ml-ebs-ext marked this conversation as resolved.
Show resolved Hide resolved

- Added arguments `env_var` and `print_file_paths` in `load_data()` function to provide more flexibility and control.

# dv.loader 2.0.0

- GitHub release with QC report
Expand Down
115 changes: 73 additions & 42 deletions R/dvloader.R
Original file line number Diff line number Diff line change
@@ -1,61 +1,92 @@
#' gets the NFS base path from an env var
#' It assumes there is an env var
#' called RXD_DATA which holds the path suffix.
#' @return the NFS base path
#' Get Base Path from an Environment Variable
#'
#' This function assumes that there is an environment variable called `RXD_DATA`
#' which is set to the base path of the data directory.
#'
#' @return [character(1)] The normalized base path.
#'
#' @export
get_nfs_path <- function() {
base_path <- Sys.getenv("RXD_DATA")
# check that RXD_DATA is set

if (base_path == "") {
stop("Usage: get_nfs_path: RXD_DATA must be set")
stop("Environment variable RXD_DATA must be set")
}
return(base_path)

checkmate::assert_directory_exists(base_path)

return(normalizePath(base_path))
}

#' gets the NFS base path from an env var
#' alias for get_nfs_path to maintain backwards compatibility
#' Get Base Path from an Environment Variable
#'
#' This function is an alias for `get_nfs_path()` to maintain backwards compatibility.
#'
#' @return [character(1)] The normalized base path.
#'
#' @export
get_cre_path <- get_nfs_path

#' Loads data into memory based on study directory and one or more file_names.
#' @param sub_dir A relative directory/folder that will be appended to a base path defined by `Sys.getenv("RXD_DATA")`.
#' If the argument is left as NULL, the function will load data from the working directory `getwd()`.
#' @param file_names Study file or file_names name(s) - can be a vector of strings.
#' This is the only required argument.
#' @param use_wd for "use working directory" - a flag used when importing local files
#' not on NFS - default value is FALSE
#' @param prefer_sas if set to TRUE, imports sas7bdat files first before looking for
#' RDS files (the opposite of default behavior)
#' @return a list of dataframes
#' @export

#' Load Data Files
#'
#' This function loads data files from a specified directory or the current working directory.
#' It supports loading both RDS and SAS7BDAT files.
#'
#' @param sub_dir [character(1)] Optional character string specifying a subdirectory. Default is NULL.
#' @param file_names [character(1+)] Character vector of file names to load (without extension).
#' @param use_wd [logical(1)] Logical indicating whether to use the current working directory. Default is FALSE.
#' @param prefer_sas [logical(1)] Logical indicating whether to prefer SAS7BDAT files over RDS. Default is FALSE.
#' @param print_file_paths [logical(1)] Logical indicating whether to print the directory path and file names.
#' Default is FALSE.
#'
#' @return A named list of data frames, where each name corresponds to a loaded file.
#'
#' @examples
#' \dontrun{
#' test_data_path <- "../inst/extdata/"
#' data_list <- load_data(
#' sub_dir = test_data_path,
#' file_names = "dummyads2",
#' use_wd = TRUE
#' )
#' }
load_data <- function(sub_dir = NULL, file_names, use_wd = FALSE, prefer_sas = FALSE) {
if (is.null(file_names)) {
stop("Usage: load_data: file_names: Must supply at least one file name")
#' # Get the current value of the RXD_DATA environment variable
#' base_dir <- Sys.getenv("RXD_DATA")
#'
#' # Set the RXD_DATA environment variable to the path of the haven package
#' Sys.setenv(RXD_DATA = find.package("haven"))
#'
#' data_list <- load_data(sub_dir = "examples", file_names = c("iris.sas7bdat"))
#' str(data_list)
#'
#' # Reset the RXD_DATA environment variable to its original value
#' Sys.setenv(RXD_DATA = base_dir)
#'
#' @export
load_data <- function(
sub_dir = NULL,
file_names,
use_wd = FALSE,
prefer_sas = FALSE,
print_file_paths = FALSE) {
checkmate::assert_character(sub_dir, len = 1, null.ok = TRUE)
checkmate::assert_character(file_names, min.len = 1)
checkmate::assert_logical(use_wd, len = 1)
checkmate::assert_logical(prefer_sas, len = 1)

if (use_wd) {
base_dir <- getwd()
} else {
base_dir <- get_nfs_path()
}

study_path <- "" # will be built using args
dir_path <- if (is.null(sub_dir)) base_dir else file.path(base_dir, sub_dir)

if (is.null(sub_dir)) {
study_path <- getwd()
} else {
if (use_wd) {
study_path <- file.path(getwd(), sub_dir)
} else {
study_path <- file.path(get_cre_path(), sub_dir)
}
file_ext <- if (prefer_sas) "sas7bdat" else "rds"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is not used anywhere. I think it can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I've removed the unused file_ext variable since it's now handled in the get_file_paths() function.

file_paths <- get_file_paths(dir_path = dir_path, file_names = file_names, prefer_sas = prefer_sas)

if (isTRUE(print_file_paths)) {
cat("Loading data from", dir_path, "\n")
cat("Loading data file(s):", basename(file_paths), "\n")
}

# create the output
data_list <- create_data_list(study_path, file_names, prefer_sas) # nolint
data_list <- load_data_files(file_paths)

names(data_list) <- file_names

return(data_list)
}
181 changes: 115 additions & 66 deletions R/utils.R
Original file line number Diff line number Diff line change
@@ -1,81 +1,130 @@
#' For each file name provided, reads in the first matching file and its meta data/attributes.
#' Preference is given to RDS because its faster
#' @param file_path the folder where the files are
#' @param file_names CDISC names for the files
#' @param prefer_sas if TRUE, imports .sas7bdat files first instead of .RDS files
#' @return returns a list of dataframes with metadata as an attribute on each dataframe
create_data_list <- function(file_path, file_names, prefer_sas) {
data_list <- lapply(file_names, function(x) {
extensions <- c("", ".rds", ".sas7bdat")
if (prefer_sas) {
extensions <- c("", ".sas7bdat", ".rds")
}

file_name_to_load <- NULL

candidates <- list.files(file_path)
uppercase_candidates <- Map(toupper, candidates)

for (ext in extensions) {
# Case insensitive file name match
uppercase_file_name <- toupper(paste0(x, ext))

match_count <- sum(uppercase_candidates == uppercase_file_name)
if (match_count > 1) {
stop(paste("create_data_list(): More than one case-insensitive file name match for", file_path, x))
#' Get File Paths
#'
#' This function constructs file paths for given file names, handling both RDS and SAS7BDAT files.
#' It can prioritize SAS files over RDS files based on the `prefer_sas` parameter.
#'
#' @param dir_path [character(1)] The directory path where the files are located.
#' @param file_names [character(1+)] A vector of file names to process.
#' @param prefer_sas [logical(1)] Whether to prefer SAS files over RDS files. Default is FALSE.
#'
#' @return [character] A vector of normalized file paths.
#'
#' @examples
#' \dontrun{
#' temp_dir <- tempdir()
#'
#' file_names <- c("adsl", "adae")
#'
#' file.create(file.path(temp_dir, paste0(file_names, ".rds")))
#' file.create(file.path(temp_dir, paste0(file_names, ".sas7bdat")))
#'
#' list.files(temp_dir)
#'
#' get_file_paths(dir_path = temp_dir, file_names = file_names)
#' get_file_paths(dir_path = temp_dir, file_names = file_names, prefer_sas = TRUE)
#'
#' unlink(temp_dir, recursive = TRUE)
#' }
#'
#' @export
get_file_paths <- function(dir_path, file_names, prefer_sas = FALSE) {
checkmate::assert_character(dir_path, len = 1)
checkmate::assert_character(file_names, min.len = 1)
checkmate::assert_logical(prefer_sas, len = 1)

file_paths <- lapply(file_names, function(file_name) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of new code here and I haven't reviewed it closely. The difference that strikes me the most is that the old toupper case-insensitive match behavior is gone. I imagine this can have an impact under Windows. Since load_data has been rewritten to list files through this function, we need a good reason to deviate from the old behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code follows the case-sensitive behavior of readRDS() and haven::read_sas() to avoid ambiguity and risk of matching the wrong file.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This piece of code belongs to the create_data_list currently on the main branch:

# Case insensitive file name match
uppercase_file_name <- toupper(paste0(x, ext))

match_count <- sum(uppercase_candidates == uppercase_file_name)
if (match_count > 1) {
  stop(paste("create_data_list(): More than one case-insensitive file name match for", file_path, x))
}

It is there to warn against an edge-case scenario in which a folder contains two files that share the same name but differ in case. That is not a problem under linux, but we still want to warn users against that situation, because running the same code with the same data files under case-insensitive windows file systems could lead to the loading of different files.

This check is no longer in the rewritten dv.loader and it should be, unless the team decides otherwise.

My suggestion here would be to take the original logic of create_data_list and adapt it minimally to follow the old filename-matching logic, so that we don't throw away useful behavior on a rewrite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for raising this important point about case-sensitivity. I agree we should discuss with the team to determine if we want to to make any changes to the current behavior.

file_path <- file.path(dir_path, file_name)
file_ext <- tools::file_ext(file_name)

if (file_ext == "") {
candidates <- basename(list.files(dir_path))

rds_match <- grep(
pattern = paste0("^", file_name, "\\.rds$"),
x = candidates,
ignore.case = TRUE,
value = TRUE
)

sas_match <- grep(
pattern = paste0("^", file_name, "\\.sas7bdat$"),
x = candidates,
ignore.case = TRUE,
value = TRUE
)

if (isTRUE(prefer_sas)) {
if (length(sas_match) > 0) {
return(file.path(dir_path, sas_match[1]))
} else if (length(rds_match) > 0) {
return(file.path(dir_path, rds_match[1]))
} else {
stop(dir_path, " does not contain SAS or RDS file: ", file_name)
}
} else if (isFALSE(prefer_sas)) {
if (length(rds_match) > 0) {
return(file.path(dir_path, rds_match[1]))
} else if (length(sas_match) > 0) {
return(file.path(dir_path, sas_match[1]))
} else {
stop(dir_path, " does not contain RDS or SAS file: ", file_name)
}
}

index <- match(uppercase_file_name, uppercase_candidates)
if (!is.na(index)) {
file_name_to_load <- candidates[[index]]
break
} else {
if (file.exists(file_path)) {
return(file_path)
} else {
stop(dir_path, " does not contain: ", file_name)
}
}

if (is.null(file_name_to_load)) {
stop(paste("create_data_list(): No RDS or SAS files found for", file_path, x))
}

output <- read_file(file_path, file_name_to_load)

return(output)
})

names(data_list) <- file_names

return(data_list)
return(normalizePath(unlist(file_paths)))
}


#' Reads RDS/SAS file and metadatas from first 6 items from file.info() its file path
#' @param file_path a path to a file
#' @param file_name name of a file
#' @return a data object with an extra attribute of metadata
read_file <- function(file_path, file_name) {
ext <- tools::file_ext(file_name)

if (!(toupper(ext) %in% c("RDS", "SAS7BDAT"))) {
stop("Usage error: read_file: file_name: file must either be RDS or SAS7BDAT.")
}
#' Load Data Files
#'
#' This function reads data from multiple file paths and returns a list of data frames.
#' It supports reading RDS and SAS7BDAT files.
#'
#' @param file_paths [character(1+)] A vector of file paths to read.
#'
#' @return [list] A named list of data frames, where each name is the basename of the corresponding file path.
#'
#' @examples
#' path <- system.file("examples", "iris.sas7bdat", package = "haven")
#' data_list <- load_data_files(file_paths = path)
#' str(data_list)
#'
#' @export
load_data_files <- function(file_paths) {
checkmate::assert_character(file_paths, min.len = 1)
checkmate::assert_file_exists(file_paths)

data_list <- lapply(file_paths, function(file_path) {
extension <- tools::file_ext(file_path)

if (tolower(extension) == "rds") {
data <- readRDS(file_path)
} else if (tolower(extension) == "sas7bdat") {
data <- haven::read_sas(file_path)
} else {
stop("Unsupported file extension: ", extension)
}

is_rds <- toupper(ext) == "RDS"
meta <- file.info(file_path, extra_cols = FALSE)
meta[["path"]] <- file_path
meta[["file_name"]] <- basename(file_path)

file <- file.path(file_path, file_name)
file_name <- tools::file_path_sans_ext(file_name)
rownames(data) <- NULL
attr(data, "meta") <- meta

# grab file info
meta <- file.info(file)[1L:6L]
meta[["path"]] <- row.names(meta)
meta[["file_name"]] <- file_name
meta <- data.frame(meta, stringsAsFactors = FALSE)
row.names(meta) <- NULL
return(data)
})

if (is_rds) {
out <- readRDS(file)
} else {
out <- haven::read_sas(file)
}
attr(out, "meta") <- meta
names(data_list) <- basename(file_paths)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whole paths would be better, because file_paths can point to different folders with files that share the same name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of base file names as list names is maintained to ensure backward compatibility with existing code that relies on the legacy load_data() function. For users who need the full file paths, this information is stored in the metadata attributes of each data frame in the returned list.

Copy link

@ml-ebs-ext ml-ebs-ext Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users of legacy load_data() retain the old behavior because of this statement that happens at the end of that function:

names(data_list) <- file_names

Users of load_data_files() would benefit from seeing the exact path they provided as names of the output list.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See discussion in internal chat about possibility of removing both the paths and the extensions as well as preventing repeat entries in the resulting list. The

names(data_list) <- file_names

at the end of load_data() should make that function immune to implementing this change.

return(out)
return(data_list)
}
Loading
Loading