From 75346ccf84343814753430be37e5a692f35284e3 Mon Sep 17 00:00:00 2001 From: Anders Aasted Isaksen <67263135+Aastedet@users.noreply.github.com> Date: Thu, 19 Dec 2024 15:53:58 +0100 Subject: [PATCH] docs: :memo: expand on inclusions and exclusions (#133) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes #130 Closes #140 --------- Co-authored-by: Anders Aasted Isaksen Co-authored-by: Signe Kirk Brødbæk <40836345+signekb@users.noreply.github.com> Co-authored-by: Luke W. Johnston Co-authored-by: Luke W. Johnston --- vignettes/articles/function-flow.Rmd | 470 ++++++++++++++++++++++----- 1 file changed, 384 insertions(+), 86 deletions(-) diff --git a/vignettes/articles/function-flow.Rmd b/vignettes/articles/function-flow.Rmd index cdd1028..b2565dc 100644 --- a/vignettes/articles/function-flow.Rmd +++ b/vignettes/articles/function-flow.Rmd @@ -43,113 +43,318 @@ library(dplyr) library(osdc) ``` -#### HbA1c tests above the diagnosis cut-off value (48 mmol/mol or 6.5%) - -The function `include_hba1c()` uses `lab_forsker` as the input data to -extract all events of HbA1c tests above the diagnosis cut-off value. - -Since the HbA1c diagnosis cut-off value depends on the kind of test that -is used, the inclusion event is defined as follows: - -- For HbA1c IFCC (NPU03835), we include values \>= 6.5 %. -- For HbA1c DCCT (NPU27300), we include values \>= 48 mmol/mol. - -```{r, echo=FALSE} -algorithm |> - filter(name == "hba1c") |> - knitr::kable(caption = "Algorithm used in the implementation for including HbA1c.") -``` - -#### Hospital diagnosis of diabetes +#### Hospital diagnoses + +#### Joining LPR2 and LPR3 data + +The helper functions `join_lpr2()` and `join_lpr3()` join records of +diagnoses to administrative information in LPR2-formatted and +LPR3-formatted data, respectively. + +`join_lpr2()` takes `lpr_diag` and `lpr_adm` as inputs, filters to the +necessary diagnoses (`c_diag` starting with "DO0[0-6]", "DO8[0-4]", +"DZ3[37]", "DE1[0-4]", "249", or "250"), joins the required information +by record number (`recnum`), and outputs a `data.frame` with the +following variables: + +- `pnr`: identifier variable +- `date`: date of the recorded diagnosis (renamed from `d_inddto`) +- `specialty`: department specialty (renamed from `c_spec`) +- `diagnosis_code`: diagnosis code (renamed from `c_diag`) +- `diagnosis_type`: diagnosis type (renamed from `c_diagtype`) + +`join_lpr3()` takes `diagnoser` and `kontakter` as inputs, filters to +the necessary diagnoses (`diagnosekode` starting with "DO0[0-6]", +"DO8[0-4]", "DZ3[37]" or "DE1[0-4]"), joins the required information by +record number (`dw_ek_kontakt`), and outputs a `data.frame` with the +following variables: + +- `pnr`: identifier variable (renamed from `cpr`) +- `date`: date of the recorded diagnosis (renamed from `dato_start`) +- `specialty`: department specialty (renamed from `hovedspeciale_ans`) +- `diagnosis_code`: diagnosis code (renamed from `diagnosekode`) +- `diagnosis_type`: diagnosis type (renamed from `diagnosetype`) +- `diagnosis_retracted`: if the diagnosis was later retracted (renamed + from `senere_afkraeftet`) + +These outputs are passed to `include_diabetes_diagnoses()` (and to +`get_pregnancy_dates()`, see exclusion events) for further processing +below. + +#### Processing of diabetes diagnoses The function `include_diabetes_diagnoses()` uses the hospital contacts -from LPR2 and 3 to include all dates of diabetes diagnoses. Diabetes -diagnoses from both ICD 8 and ICD 10 are included. - -This function contains two helper functions: - -- `keep_diabetes_icd10()` -- `keep_diabetes_icd8()` - - - - +from LPR2 and LPR3 to include all dates of diabetes diagnoses to use for +inclusion, as well as additional information needed to classify diabetes +type. Diabetes diagnoses from both ICD-8 and ICD-10 are included. + +The function takes the outputs of `join_lpr2()` and `join_lpr3()` as +inputs and processes each input separately to generate the following +internal variables: + +- From `join_lpr2`: + - `pnr`: identifier variable + - `date`: dates of all included diabetes diagnoses: + - registered as primary (A) or secondary (B) diagnoses, regardless + of type or department: + - Keep rows where `diagnosis` starts with "DE1[0-4]", "249" or + "250", and `diagnosis_type` is either "A" or "B" + - `is_primary`: Define whether the diagnosis was a primary + diagnosis (`diagnosis_type` == "A") + - `is_t1d`: Define whether the diagnosis was T1D-specific + (`diagnosis` starts with "DE10" or "249") + - `is_t2d`: Define whether the diagnosis was T2D-specific + (`diagnosis` starts with "DE11" or "250") + - `department`: Define whether the diagnosis was made made by an + endocrinological (if `specialty` == 8 then `department` == + "endocrinology") or other medical department (if `specialty` \< + 8 or 9-30 then `department` == "other medical") +- From `join_lpr3()`: + - `pnr`: identifier variable + - `date`: dates of all included diabetes diagnoses: + - registered as primary (A) or secondary (B) diagnoses, regardless + of type or department, but exclude retracted diagnoses: + - Keep rows where `diagnosis` starts with "DE1[0-4]", + `diagnosis_type` is either "A" or "B" and + `diagnosis_retracted` == "Nej" + - `is_primary`: Define whether the diagnosis was a primary + diagnosis (`diagnosis_type` == "A") + - `is_t1d`: Define whether the diagnosis was T1D-specific + (`diagnosis` starts with "DE10") + - `is_t2d`: Define whether the diagnosis was T2D-specific + (`diagnosis` starts with "DE11") + - `department`: Define whether the diagnosis was made made by an + endocrinological department (if `specialty` == "medicinsk + endokrinologi" then `department` == "endocrinology") or other + medical department (if `specialty` is any of "Blandet medicin og + kirurgi", "Intern medicin", "Geriatri", "Hepatologi", + "Hæmatologi", "Infektionsmedicin", "Kardiologi", "Medicinsk + allergologi", "Medicinsk gastroenterologi", "Medicinsk + lungesygdomme", "Nefrologi", "Reumatologi", "Palliativ medicin", + "Akut medicin", "Dermato-venerologi", "Neurologi", "Onkologi", + "Fysiurgi", or "Tropemedicin" then `department` == "other + medical") + +Internally, these intermediate results are combined and processed +together. And ultimately, `include_diabetes_diagnoses()` outputs a +single `data.frame` with the following variables (up to two rows per +individual): + +- `pnr`: identifier variable +- `dates`: dates of the first and second hospital diabetes diagnosis +- `n_t1d_endocrinology`: number of type 1 diabetes-specific primary + diagnosis codes from endocrinological departments +- `n_t2d_endocrinology`: number of type 2 diabetes-specific primary + diagnosis codes from endocrinological departments +- `n_t1d_medical`: number of type 1 diabetes-specific primary + diagnosis codes from medical departments +- `n_t2d_medical`: number of type 2 diabetes-specific primary + diagnosis codes from medical departments + +This output is passed to the `join_inclusions()` function, where the +`dates` variable is used for the final step of the inclusion process. +The variables of counts of diabetes type-specific primary diagnoses (the four columns prefixed `n_` above) are +carried over for the subsequent classification of diabetes type, +initially as inputs to the `get_t1d_primary_diagnosis()` and +`get_majority_of_t1d_diagnoses()` functions. #### Diabetes-specific podiatrist services The function `include_podiatrist_services()` uses `sysi` or `sssy` as input to extract the dates of all diabetes-specific podiatrist services. - +These dates are extracted by filtering values beginning with "54" in the +`speciale` variable of the `sssy` and `sysi` registers by default +(alternatively, the function can take the `spec2` variable as input +instead, if that is the data available to the user). In addition, +services provided to a child of the individual (`barnmak` != 0) are +excluded using the `barnmak` variable. An internal helper function +`get_unique_honuge_dates()` is applied to generate a proper date +variable based on the year-week (wwyy-formatted) variable (`honuge`) +found in the raw data, and de-duplicates multiple services registered on +the same date. -#### GLD purchases +`include_podiatrist_services()` outputs a 2-column data frame with up to +two rows for each individual, containing the following variables: -The function `include_gld_purchases()` uses `lmdb` to extract the dates -of all GLD purchases (from 1997 onwards). +- `pnr`: identifier variable +- `date`: the dates of the first and second diabetes-specific + podiatrist record - +The output is passed to the `join_inclusions()` function for the final +step of the inclusion process. - +#### HbA1c tests above the diagnosis cut-off value (48 mmol/mol or 6.5%) -### Exclusion events +The function `include_hba1c()` uses `lab_forsker` as the input data to +extract the dates of all elevated HbA1c test results, using the +appropriate cut-offs: -#### HbA1c tests and GLD purchases during pregnancy +- IFCC units: `analysiscode` NPU27300, any `value` $\geq$ 48 mmol/mol +- DCCT units: `analysiscode` NPU03835: any `value` $\geq$ 6.5% . + +```{r, echo=FALSE} +algorithm |> + filter(name == "hba1c") |> + knitr::kable(caption = "Algorithm used in the implementation for including HbA1c.") +``` -The function `exclude_pregnancy()` uses diagnoses from LPR2 or LPR3 as -input and is used to exclude both HbA1c tests and GLD purchases during -pregnancy. +Multiple elevated results on the same day within each individual are +deduplicated, to account for the same test result often being reported +twice (one for IFCC, one for DCCT units). -Internally, this relies on the function `get_pregnancy_dates()` that -contains the following three helper functions: +`include_hba1c()` outputs a 2-column data frame containing the following +variables: -- `calculate_pregnancy_index_date_for_mc_visits_wo_end_date()` (this - might be removed with the inclusion of the birth register) -- `get_pregnancy_end_dates()`: Keep maternal care visits with an end - date and drop visits between 40 weeks before end date and 12 weeks - after end date. -- `get_maternal_care_visit_dates_without_end_date()`: Uses the output - from `get_pregnancy_end_dates()` which identifies maternal care - visits *with* end dates to derive maternal care visits *without* end - dates. below. +- `pnr`: identifier variable +- `dates`: the dates of all elevated HbA1c test results - +The output is passed to the `exclude_pregnancy()` function for censoring +of elevated results due to potential gestational diabetes (see below). - +#### GLD purchases -#### Glucose-lowering brand drugs for weight loss +The function `include_gld_purchases()` uses `lmdb` to extract the dates +of all GLD purchases. + +These dates are extracted by including all values beginning with "A10" +in the `atc` variable of the `lmdb` register, except for +glucose-lowering drugs that may be used for other conditions than +diabetes: GLP-RAs (`atc` start with "A10BJ") or +dapagliflozin/empagliflozin (`atc` = "A10BK01" or "A10BK03"). + +Since the diagnosis code data on pregnancies (see below) is insufficient +to perform censoring prior to 1997, `include_gld_purchases()` only +extracts dates from 1997 onward by default (if Medical Birth Register +data is available to use for censoring, the extraction window can be +extended). + +This function outputs a long `data.frame` (since all dates of purchases +must be kept for later use in classifying diabetes type) with the +following variables needed later in the classification part of the +function flow: + +- `pnr`: identifier variable +- `date`: dates of all purchases of GLD (renamed from `eksd`) +- `atc`: type of drug +- `contained_doses`: amount purchased, in number of defined daily + doses (DDD). Calculated as `volume` (doses contained in the + purchased package) times `apk` (number of packages purchased) +- `indication_code`: indication code of the prescription (renamed from + `indo`) + +These events are then passed to a chain of exclusion functions: +`exclude_potential_pcos()` and `exclude_pregnancy()` described in the +sections below. -The function `exclude_wld_purchases()` uses lmdb as input and excludes -the brand drugs Saxenda and Wegovy. +### Exclusion events - +#### Metformin purchases potentially for the treatment of polycystic ovary syndrome -#### Metformin purchases for women below age 40 +The function `exclude_potential_pcos()` takes the output from +`include_gld_purchases()` and `bef` (information on sex and date of +birth) as inputs and censors (filters out) all purchases of metformin in +women below age 40 at the date of purchase (`atc` = "A10BA02" & `sex` = +"woman" & age at purchase (`date`-`date_of_birth`) \< 40 years) or an +indication code suggesting the prescription was made for treatment of +polycystic ovary syndrome (`atc` = "A10BA02" & `sex` = "woman" & +`indication_code` either of "0000092", "0000276" or "0000781"). -The function `exclude_potential_pcos()` as input to exclude all -purchases of metformin by women below age 40 (i.e., \<= 39 years old) at -the date of purchase. It relies on `bef` as input. +This function only performs a filtering operation, and output retains +the same structure and variables as the input passed from +`include_gld_purchases()`. After these exclusions are made, the output +is passed to `exclude_pregnancy()` for further censoring, described +below. -This function contains two helper functions: +#### HbA1c tests and GLD purchases during pregnancy -- `keep_women()` -- `drop_age_40_below()` +The function `exclude_pregnancy()` takes the combined outputs from +`join_lpr2()`, `join_lpr3()`, `include_hba1c()`, and +`exclude_potential_pcos()` and uses diagnoses from LPR2 or LPR3 to +exclude both elevated HbA1c tests and GLD purchases during pregnancy, as +these may be due to gestational diabetes, rather than type 1 or type 2 +diabetes. - +Internally, this relies on the function `get_pregnancy_dates()` that +uses diagnoses registered in LPR2 and LPR3 to extract +the dates of all recorded pregnancy endings (live births and +miscarriages). These are identified by `diagnosis` values beginning with +"DO0[0-6]", "DO8[0-4]" or "DZ3[37]". The dates output by +`get_pregnancy_dates()` are used to exclude all inclusion events +registered between 40 weeks before and 12 weeks after a pregnancy +ending. + +After these exclusion functions have been applied, the output serves as +inputs to two sets of functions: + +1. The censored HbA1c and GLD data are passed to the + `join_inclusions()` function for the final step of the inclusion + process. +2. the censored GLD data is passed to the + `get_only_insulin_purchases()`, + `get_insulin_purchases_within_180_days()`, and + `get_insulin_is_two_thirds_of_gld_doses()` helper functions for the + classification of diabetes type. + +### Join inclusion events + +The function `join_inclusions()` appends/row-binds the dates output from +functions the process the four types of inclusion events by `pnr`. Thus, +it takes as input the following variables output from the following +functions: + +- From `include_diabetes_diagnoses()`: + - `pnr`: identifier variable + - `dates`: dates of the first and second hospital diabetes + diagnosis +- From `include_podiatrist_services()` + - `pnr`: identifier variable + - `dates`: the dates of the first and second diabetes-specific + podiatrist record +- From `exclude_pregnancy()`: + - `pnr`: identifier variable + - `dates`: the dates of the first and second elevated HbA1c test + results (after censoring) +- From `exclude_pregnancy()`: + - `pnr`: identifier variable + - `date`: dates of all purchases of GLD + - The dates of the first and second purchase of GLD of each + individual are extracted from these and appended as two rows + to the ´dates´ variable. + +The output from the function is a `data.frame` containing two variables +(`pnr` and `dates`) and 1 to 8 rows per ´pnr´. This output is passed to +`get_diagnosis_date()`. ### Get diagnosis date -The function `get_diagnosis_date()` combines the outputs from the -inclusion and exclusion functions to get the final diagnosis date. -Initially, it drops the first inclusion and exclusion events from the -function outputs with the helper `drop_first_event()`, so that only -those with two or more events are kept. This is then used to assign an -initial diagnosis according to OSDC. Then, all the outputs are joined -together with `join_diagnosis_dates()`. - -Finally, the dates outside of the data coverage period are dropped with -`drop_diagnosis_dates_outside_coverage()` to end with a final diagnosis -date. For details on this censoring based on periods with insufficient -data coverage, see the `vignette("design")`. +The function `get_inclusion_date()` takes the output from +`join_inclusions()` and defines the final diagnosis date based on all +the inclusion event types. + +First, the inputs are sorted by `dates` within each level of `pnr`, then +the earliest value of `dates` is dropped, so that only those with two or +more events are included. The date of inclusion, `raw_inclusion_date`, +is then defined as the earliest value of `dates`in the remaining rows +for each individual (effectively the date of the second recorded +inclusion event). A third variable, `stable_inclusion_date`, is defined +based on `raw_inclusion_date` (if `raw_inclusion_date` \< stable +inclusion threshold (one year after medication data starts to contribute +to inclusions. Default "31-12-1997"), then `stable_inclusion_date` is +set to `NA`, else it is set to`raw_inclusion_date`). This variable +serves to limit the included cohort to only individuals with valid date +of inclusion (and thereby valid age at inclusion & duration of +diabetes). + +`get_diagnosis_date()` outputs a `data.frame` with the following +variables: + +- `pnr`: identifier variable +- `raw_inclusion_date`: date of inclusion +- `stable_inclusion_date`: date of inclusion of valid incident cases + +This output is passed to the `get_diabetes_type()` function and used to +classify the diabetes type as described below. ### Classifying the diabetes type @@ -158,13 +363,106 @@ extracted diabetes population as having either T1D or T2D. As described in the `vignette("design")`, individuals not classified as T1D cases are classified as T2D cases. -The output is a `data.frame` that includes one row per individual in the -diabetes population: one column with their PNR, two columns with -inclusion dates (one "stable" date and one "raw" date - see the -`vignette("design")` for an elaboration on what that entails), and one -column with the diabetes type. - - +As the diabetes type classification incorporates an evaluation of the +time from diagnosis/inclusion to first subsequent purchase of insulin, +the `get_diabetes_type()` function has to take the date of diagnosis and +all purchases of GLD drugs (after censoring) as inputs. In addition, +information on diabetes type-specific primary diagnoses from hospitals +is also a requirement. + +Thus, the function takes the following inputs from +`get_diagnosis_date()`, `exclude_pregnancy()`, and +`include_diabetes_diagnoses()`: + +- From `get_diagnosis_date()`: Information on date of diagnosis of + diabetes + - `pnr` + - `raw_inclusion_date` + - `stable_inclusion_date` +- From `exclude_pregnancy()`: Information on historic GLD purchases: + - `pnr`: identifier variable + - `date`: dates of all purchases of GLD. + - `atc`: type of drug + - `contained_doses`: defined daily doses of drug contained in + purchase +- From `include_diabetes_diagnoses()`: Information on diabetes + type-specific primary diagnoses from hospitals: + - `pnr`: identifier variable + - `n_t1d_endocrinology`: number of type 1 diabetes-specific + primary diagnosis codes from endocrinological departments + - `n_t2d_endocrinology`: number of type 2 diabetes-specific + primary diagnosis codes from endocrinological departments + - `n_t1d_medical`: number of type 1 diabetes-specific primary + diagnosis codes from medical departments + - `n_t2d_medical`: number of type 2 diabetes-specific primary + diagnosis codes from medical departments + +For each `pnr` number, several helper functions are applied to these +inputs to extract additional information from the censored GLD data and +diagnoses to use for classification of diabetes type. All of these +return a single value (`TRUE`, otherwise `FALSE`) for each individual: + +- `get_only_insulin_purchases()`: + - Inputs passed from `exclude_pregnancy()`: + - `atc` + - Outputs: + - only_insulin_purchases = `TRUE` if no purchases with `atc` + starting with "A10A" are present +- `get_insulin_purchases_within_180_days()` + - Inputs passed from `exclude_pregnancy()`: + - `date` & `atc` + - Inputs passed from `get_diagnosis_date()`: + - `raw_inclusion_date` + - Outputs: `TRUE` If any purchases with `atc` starting with "A10A" + have a `date` between 0 and 180 days higher than + `raw_inclusion_date` +- `get_insulin_is_two_thirds_of_gld_doses()` + - Inputs passed from `exclude_pregnancy()`: + - `contained_doses` & `atc` + - Outputs: `TRUE` If the sum of `contained_doses` of rows of `atc` + starting with "A10A" (except "A10AE5") is at least twice the sum + of `contained_doses` of rows of `atc` starting with "A10B" or + "A10AE5" +- `get_any_t1d_primary_diagnoses()`: + - Inputs passed from `include_diabetes_diagnoses()`: + - `n_t1d_endocrinology` & `n_t1d_medical` + - Outputs: `TRUE` if the combined sum of the inputs is 1 or above. +- `get_type_diagnoses_from_endocrinology()`: + - Inputs passed from `include_diabetes_diagnoses()`: + - `n_t1d_endocrinology`, `n_t2d_endocrinology` + - Outputs: `type_diagnoses_from_endocrinology` = `TRUE` if the + combined sum of the inputs is 1 or above +- `get_type_diagnosis_majority()`: + - Inputs passed from `include_diabetes_diagnoses()`: + - `n_t1d_endocrinology`, `n_t2d_endocrinology`, + `n_t1d_medical` & `n_t2d_medical` + - Inputs passed from `get_type_diagnoses_from_endocrinology()`: + - `type_diagnoses_from_endocrinology` + - Outputs: `TRUE` if `type_diagnoses_from_endocrinology` == `TRUE` + and `n_t1d_endocrinology` is above `n_t2d_endocrinology`. Also + `TRUE` if `type_diagnoses_from_endocrinology` = `FALSE` and + `n_t1d_medical` is above `n_t2d_medical` + +`get_diabetes_type()` evaluates all the outputs from the helper +functions to define diabetes type for each individual. Diabetes type is +classified as "T1D" if: + +- `only_insulin_purchases` == `TRUE` & `any_t1d_primary_diagnoses` == + `TRUE` +- Or `only_insulin_purchases` == `FALSE` & `any_t1d_primary_diagnoses` + == `TRUE` & `type_diagnosis_majority` == `TRUE` & + `insulin_is_two_thirds_of_gld_doses` == `TRUE` & + `insulin_purchases_within_180_days` == `TRUE` + +`get_diabetes_type()` returns a `data.frame` with one row per `pnr` +number and four columns: `pnr`, `stable_inclusion_date`, +`raw_inclusion_date` & `diabetes_type`. This is the final product of the +OSDC algorithm. See the `vignette("design")` for an more detail on the +two inclusion dates and their intended use-cases. + + + + ![Flow of functions for classifying diabetes status using the `osdc` package.](images/function-flow-classification.svg) @@ -179,8 +477,8 @@ OSDC algorithm includes the following criteria: diagnoses extracted from `lpr_diag` (LPR2) and `diagnoser` (LPR3) in the previous steps. 2. `get_only_insulin_purchases()` which relies on the GLD purchases - from Lægemiddelsdatabasen to get patients where all GLD purchases - are insulin only. + from Lægemiddeldatabasen to get patients where all GLD purchases are + insulin only. 3. `get_majority_of_t1d_diagnoses()` (as compared to T2D diagnoses) which again relies on primary hospital diagnoses from LPR. 4. `get_insulin_purchase_within_180_days()` which relies on both