Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] extract_tibble should allow users to join specified tables #111

Open
5 tasks done
rsh52 opened this issue Dec 12, 2022 · 3 comments
Open
5 tasks done

[FEATURE] extract_tibble should allow users to join specified tables #111

rsh52 opened this issue Dec 12, 2022 · 3 comments
Assignees
Labels
backlog not to be worked on now enhancement New feature or request

Comments

@rsh52
Copy link
Collaborator

rsh52 commented Dec 12, 2022

Feature Request Description

In addition to extracting selected tibbles, extract_tibbles should allow users the option to join them as a singular tibble output instead of the list. As found in recent projects, the next logical step often times when using extract_tibbles is joining.

Proposed Solution

Prototyped logic is available in our internal Prodigy Reporter. The new argument (suggest: join_tibbles = TRUE/FALSE) should kick off join operations. Since we abstract some column names, i.e. form_status_complete, we need to account for duplicated colnames in the tibbles themselves.

# Load Libraries ===============================================================
library(REDCapTidieR)
library(tidyverse)
library(tidyselect)
library(rlang)

# tibble List Selection Function ===============================================
tibble_list_select <- function(supertibble, tbls) {
  tbls <- eval_select(data = supertibble, expr = enquo(tbls))
  supertibble[tbls]
}

# Join Operation ===============================================================
join_tibbles <- function(extracted_tibbles, record_id) {
  # First: compile all names related to tibbles
  # Second: Identify names that exist in multiple tibbles (not record_id)
  # Third: Append identified names with name of the tibble they belong to
  
  duplicate_colnames <- extracted_tibbles %>%
    map(names) %>%
    unlist() %>%
    tibble(name = .) %>%
    count(name) %>%
    # don't append table name to pk: infseq_id
    filter(n > 1 & name != record_id) %>% # <-- Need to functionally call out record_id in case of name change -->
    pull(name)
  
  extracted_tibbles <- map2(
    extracted_tibbles,
    names(extracted_tibbles),
    .f = function(df, df_name) {
      # [duplicate_col] -> [duplicate_col].[table_name]
      rename_with(
        df,
        .cols = any_of(duplicate_colnames),
        .fn = function(col) paste0(col, ".", df_name)
      )
    }
  )
  
  # Multi-left_join using reduce, filter for inputs resulting in include == TRUE
  out <- reduce(
    extracted_tibbles,
    dplyr::left_join,
    by = record_id # <-- Need to functionally update this -->
  )
  
  out
}

Here's how I envision this being implemented, but imagine the external functions as internal to extract_tibbles instead:

# Example ======================================================================
redcap_uri <- Sys.getenv("REDCAP_URI")
token <- Sys.getenv("REDCAPTIDIER_CLASSIC_API")

supertibble <- read_redcap(redcap_uri, token)

extracted_tibbles <- supertibble |>
  extract_tibbles() 

extracted_tibbles |> 
  tibble_list_select(tbls = c(contains("nonrepeat"), repeated)) |>
  join_tibbles(record_id = "record_id")

You should be able to copy and paste all of this into a script and use REDCapTidieR 0.2.0 to view the proposed output. Open to suggestions on naming conventions for identified duplicate columns (currently [duplicate_col].[table_name]).

Checklist

  • The issue is atomic
  • The issue description is documented
  • The issue title describes the problem succinctly
  • Developers are assigned to the issue
  • Labels are assigned to the issue
@rsh52 rsh52 added the enhancement New feature or request label Dec 12, 2022
@rsh52 rsh52 self-assigned this Dec 12, 2022
@rsh52
Copy link
Collaborator Author

rsh52 commented Dec 12, 2022

@skadauke @ezraporter tagged for posterity (and critiques) ✨

@rsh52 rsh52 added this to the 0.3 milestone Dec 12, 2022
@rsh52 rsh52 added the backlog not to be worked on now label Dec 13, 2022
@skadauke
Copy link
Collaborator

skadauke commented Dec 14, 2022

Discussed an alternative, higher-level API for this using the existing extract_tibble() function. The following would return a single tibble with demographics and disease_response instruments joined together appropriately.

supertbl |>
  extract_tibble(demographics, disease_response)

One question is what "appropriately" means. Another question is how to make this syntax concise and expressive while at the same time not limiting flexibility. We will see use cases for table joins during development of the Prodigy reporter and aim to implement a solution with 0.3 in a few months.

@skadauke
Copy link
Collaborator

skadauke commented Dec 16, 2022

I had one more thought, not sure if it's possible or even a good idea. What if

supertbl |>
  extract_tibble(everything())

returns a tibble that's (mostly) the same as the block matrix? The use case here might be that people could make changes inside the supertibble and then send those changes back to the REDCap instance. I know I said we don't want to touch writing, but it's a thought. And this could guide how we plan what a structure of a table in which nonrepeating and repeating instruments are combined.

@skadauke skadauke changed the title [FEATURE] extract_tibbles should allow users to join specified tables [FEATURE] extract_tibble should allow users to join specified tables Dec 16, 2022
@rsh52 rsh52 modified the milestones: 0.3, 0.4 Feb 20, 2023
@rsh52 rsh52 removed this from the 0.4 milestone May 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog not to be worked on now enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants