Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NOTES] Experiences with dbt-synthea for Big Patient Datasets #91

Open
TheCedarPrince opened this issue Oct 24, 2024 · 32 comments
Open

Comments

@TheCedarPrince
Copy link
Collaborator

Hey @katy-sadowski,

I ended up opening an issue as Discussions is not enabled for this repo.

Wanted to share some notes on my experiences with you on using this solution for 1 million patients each with 5 year retrospective look back:

I am going to reproduce these errors tomorrow and actually generate 1 million patients from scratch using Synthea and re-run the pipeline to find all the tables that fail to be built.

Hope this helps folks!

~ tcp 🌳

@katy-sadowski
Copy link
Collaborator

Fantastic, thanks @TheCedarPrince ! I think I'll try it on my end as well so we have 2x OS & laptop worth of data on performance :)

@lawrenceadams
Copy link
Collaborator

Nice work @TheCedarPrince ! I was playing around with this in Snowflake over the weekend and generated a million patients and started running it on their to see how it would benchmark compared to duckdb - keen to hear how you get on!

@TheCedarPrince
Copy link
Collaborator Author

Question for folks:

Can I run dbt-synthea on "chunks" of data I have manually split up (Meaning, I have folders like csvs_1, csvs_2, etc. and for each folder, have about 200K patients)? And then run dbt-synthea against each of these folders without rebuilding the prior DuckDB database? Was wondering as then I can chunk out the ETL simply this way.

@TheCedarPrince
Copy link
Collaborator Author

Also, @katy-sadowski , I was going to tinker a little later with also getting hyperfine to work with this process for benchmarking purposes. It's considered the state of the art for benchmarking and is used a ton in numerical or HPC environments -- especially like high energy physics areas. Ran into some problems yesterday when I tried to naively run it so there might be some system environment quirks.

@lawrenceadams
Copy link
Collaborator

lawrenceadams commented Oct 24, 2024

Can I run dbt-synthea on "chunks" of data I have manually split up (Meaning, I have folders like csvs_1, csvs_2, etc. and for each folder, have about 200K patients)? And then run dbt-synthea against each of these folders without rebuilding the prior DuckDB database? Was wondering as then I can chunk out the ETL simply this way.

Yeah you definitely can with duckdb with next to no effort - you'd have to change the sources to glob like this!

@TheCedarPrince
Copy link
Collaborator Author

Interesting -- what files would need to be changed to accommodate this in dbt-synthea? Because I can load vocab now fine in #90 and mostly the loading of data is fine, but the problem I have, as mentioned:

[ETLing] this data in the step dbt run fails on large tables in the OMOP Schema -- especially those tables which have a large lineage.

So I would want to dbt run multiple times versus loading. Or do you see the load step as helping with the ETL step @lawrenceadams ?

@lawrenceadams
Copy link
Collaborator

lawrenceadams commented Oct 24, 2024

So there are two approaches:

  1. we could either modify the loader script to glob files , or...
  2. more simply - given duckdb doesn't need files loaded before it can use them (@katy-sadowski might correct me - but my understanding as to why we have a step where we actively load files into the pipeline is to keep the source.yml files identical between database systems e.g. Postgres/Duckdb/etc can use the same code), you could modify a sources.yml file to incorporate something like the below:
image

I would modify the sources file below

version: 2
sources:
- name: synthea
schema: "{% if var('seed_source') %}{{ target.schema ~ '_synthea_seeds' }}{% else %}{{ target.schema ~ '_synthea' }}{% endif %}"
tables:
- name: allergies
- name: careplans
- name: claims_transactions
- name: claims
- name: conditions
- name: devices
- name: encounters
- name: imaging_studies
- name: immunizations
- name: medications
- name: observations
- name: organizations
- name: patients
- name: payer_transitions
- name: payers
- name: procedures
- name: providers
- name: supplies

to:

version: 2

sources:
  - name: synthea
    meta:
      external_location: "/path/to/synthea/output/{name}/*.csv"
    tables:
      - name: allergies
      - name: careplans
      - name: claims_transactions
      - name: claims
      - name: conditions
      - name: devices
      - name: encounters
      - name: imaging_studies
      - name: immunizations
      - name: medications
      - name: observations
      - name: organizations
      - name: patients
      - name: payer_transitions
      - name: payers
      - name: procedures
      - name: providers
      - name: supplies 

and it should work without having to have an explicit load step 🚀

Does that make sense? It blew my mind the first time I realised I could do this! Makes working with partitioned CSV/Parquet files so much easier 😄

@lawrenceadams
Copy link
Collaborator

I tried loading in the chonkiest of tables (claims transactions) with success - although this took a fair amount of time on a high end Mac...

image

@TheCedarPrince
Copy link
Collaborator Author

Latest updates @lawrenceadams and @katy-sadowski :

I generated about one million synthetic patients (with some dead patients included) each with 1 year lookback using Synthea (see my synthea.properties file below):

Click to expand to see `synthea.properties`
# Starting with a properties file because it requires no additional dependencies

exporter.baseDirectory = ./output/
exporter.use_uuid_filenames = false
exporter.subfolders_by_id_substring = false
# number of years of history to keep in exported records, anything older than this may be filtered out
# set years_of_history = 0 to skip filtering altogether and keep the entire history
exporter.years_of_history = 1
# split records allows patients to have one record per provider organization
exporter.split_records = false
exporter.split_records.duplicate_data = false
exporter.metadata.export = true
exporter.ccda.export = false
exporter.fhir.export = false
exporter.fhir_stu3.export = false
exporter.fhir_dstu2.export = false
exporter.fhir.use_shr_extensions = false
exporter.fhir.use_us_core_ig = true
exporter.fhir.transaction_bundle = true
exporter.fhir.bulk_data = false
exporter.groups.fhir.export = false
exporter.hospital.fhir.export = true
exporter.hospital.fhir_stu3.export = false
exporter.hospital.fhir_dstu2.export = false
exporter.practitioner.fhir.export = true
exporter.practitioner.fhir_stu3.export = false
exporter.practitioner.fhir_dstu2.export = false
exporter.encoding = UTF-8
exporter.csv.export = true
# if exporter.csv.append_mode = true, then each run will add new data to any existing CSVs. if false, each run will clear out the files and start fresh 
exporter.csv.append_mode = false
# if exporter.csv.folder_per_run = true, then each run will have CSVs placed into a unique subfolder. if false, each run will only use the top-level csv folder
exporter.csv.folder_per_run = false
# included_files and excluded_files list out the files to include/exclude in the csv exporter
# only one of these may be set at a time, if both are set then both will be ignored
# if neither is set, then all files will be included
# see list of files at: https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary
# include filenames separated with a comma, ex: patients.csv,procedures.csv,medications.csv
# NOTE: the csv exporter does not actively delete files, so if Run 1 you included a file, then Run 2 you exclude that file, the version from Run 1 will still be present
exporter.csv.included_files = 
exporter.csv.excluded_files = 

exporter.cpcds.export = false
exporter.cpcds.append_mode = false
exporter.cpcds.folder_per_run = false
exporter.cpcds.single_payer = false
exporter.cdw.export = false
exporter.text.export = false
exporter.text.per_encounter_export = false
exporter.clinical_note.export = false

# parameters for symptoms export
exporter.symptoms.csv.export = false
# selection mode of conditions or symptom export: 0 = conditions according to  exporter.years_of_history. other values = all conditions (entire history)
exporter.symptoms.mode = 0
# if exporter.symptoms.csv.append_mode = true, then each run will add new data to any existing CSVs. if false, each run will clear out the files and start fresh
exporter.symptoms.csv.append_mode = false
# if exporter.symptoms.csv.folder_per_run = true, then each run will have CSVs placed into a unique subfolder. if false, each run will only use the top-level csv folder
exporter.symptoms.csv.folder_per_run = false
exporter.symptoms.text.export = false

# the number of patients to generate, by default
# this can be overridden by passing a different value to the Generator constructor
generate.default_population = 1

generate.log_patients.detail = simple
# options are "none", "simple", or "detailed" (without quotes). defaults to simple if another value is used
# none = print nothing to the console during generation
# simple = print patient names once they are generated.
# detailed = print patient names, atributes, vital signs, etc..  May slow down processing

generate.timestep = 604800000
# time is in ms
# 1000 * 60 * 60 * 24 * 7 = 604800000

# default demographics is every city in the US
generate.demographics.default_file = geography/demographics.csv
generate.geography.zipcodes.default_file = geography/zipcodes.csv
generate.geography.country_code = US
generate.geography.timezones.default_file = geography/timezones.csv
generate.geography.foreign.birthplace.default_file = geography/foreign_birthplace.json
generate.geography.sdoh.default_file = geography/sdoh.csv

# Lookup Table Folder location
generate.lookup_tables = modules/lookup_tables/

# Set to true if you want every patient to be dead.
generate.only_dead_patients = false
# Set to true if you want every patient to be alive.
generate.only_alive_patients = false
# If both only_dead_patients and only_alive_patients are set to true,
# It they will both default back to false

# if criteria are provided, (for example, only_dead_patients, only_alive_patients, or a "patient keep module" with -k flag)
# this is the maximum number of times synthea will loop over a single slot attempting to produce a matching patient.
# after this many failed attempts, it will throw an exception.
# set this to 0 to allow for unlimited attempts (but be aware of the possibility that it will never complete!)
generate.max_attempts_to_keep_patient = 1000

# if true, tracks and prints out details of transition tables for each module upon completion
# note that this may significantly slow down processing, and is intended primarily for debugging
generate.track_detailed_transition_metrics = false

# If true, person names have numbers appended to them to make them more obviously fake
generate.append_numbers_to_person_names = true

# if true, the entire population will use veteran prevalence data
generate.veteran_population_override = false

# these should add up to 1.0
# weighting and categories are inspired by the following but there are no specific hard numbers to point to
# http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1694190/pdf/amjph00543-0042.pdf
# http://www.ncbi.nlm.nih.gov/pubmed/8122813
generate.demographics.socioeconomic.weights.income = 0.2
generate.demographics.socioeconomic.weights.education = 0.7
generate.demographics.socioeconomic.weights.occupation = 0.1

generate.demographics.socioeconomic.score.low = 0.0
generate.demographics.socioeconomic.score.middle = 0.25
generate.demographics.socioeconomic.score.high = 0.66

generate.demographics.socioeconomic.education.less_than_hs.min = 0.0
generate.demographics.socioeconomic.education.less_than_hs.max = 0.5
generate.demographics.socioeconomic.education.hs_degree.min = 0.1
generate.demographics.socioeconomic.education.hs_degree.max = 0.75
generate.demographics.socioeconomic.education.some_college.min = 0.3
generate.demographics.socioeconomic.education.some_college.max = 0.85
generate.demographics.socioeconomic.education.bs_degree.min = 0.5
generate.demographics.socioeconomic.education.bs_degree.max = 1.0

generate.demographics.socioeconomic.income.poverty = 11000
generate.demographics.socioeconomic.income.high = 75000

generate.birthweights.default_file = birthweights.csv
generate.birthweights.logging = false

# in Massachusetts, the individual insurance mandate became law in 2006
# in the US, the Affordable Care Act become law in 2010,
# and individual and employer mandates took effect in 2014.
# mandate.year will determine when individuals with an occupation score above mandate.occupation
# receive employer mandated insurance (aka "private" insurance).
# prior to mandate.year, anyone with income greater than the annual cost of an insurance plan
# will purchase the insurance.
generate.insurance.mandate.year = 2006
generate.insurance.mandate.occupation = 0.2

# Default Costs, to be used for pricing something that we don't have a specific price for
# -- $500 for procedures is completely invented
generate.costs.default_procedure_cost = 1.23
# -- $255 for medications - also invented
generate.costs.default_medication_cost = 1.23
# -- Encounters billed using avg prices from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3096340/
# -- Adjustments for initial or subsequent hospital visit and level/complexity/time of encounter
# -- not included. Assume initial, low complexity encounter (Tables 4 & 6)
generate.costs.default_encounter_cost = 125.00
# -- https://www.nytimes.com/2014/07/03/health/Vaccine-Costs-Soaring-Paying-Till-It-Hurts.html
# -- currently all vaccines cost $136.
generate.costs.default_immunization_cost = 136.00

# Providers
generate.providers.hospitals.default_file = providers/hospitals.csv
generate.providers.longterm.default_file = providers/longterm.csv
generate.providers.nursing.default_file = providers/nursing.csv
generate.providers.rehab.default_file = providers/rehab.csv
generate.providers.hospice.default_file = providers/hospice.csv
generate.providers.dialysis.default_file = providers/dialysis.csv
generate.providers.homehealth.default_file = providers/home_health_agencies.csv
generate.providers.veterans.default_file = providers/va_facilities.csv
generate.providers.urgentcare.default_file = providers/urgent_care_facilities.csv
generate.providers.primarycare.default_file = providers/primary_care_facilities.csv
generate.providers.ihs.hospitals.default_file = providers/ihs_facilities.csv
generate.providers.ihs.primarycare.default_file = providers/ihs_centers.csv

# Provider selection behavior
# How patients select a provider organization:
#  nearest - select the closest provider. See generate.providers.maximum_search_distance
#  quality - select the best provider if quality is known. Otherwise nearest.
#  random  - select randomly.
#  network - select the nearest provider in your insurance network. same as random except it changes every time the patient switches insurance provider.
generate.providers.selection_behavior = nearest

# if a provider cannot be found for a certain type of service,
# this will default to the nearest hospital.
generate.providers.default_to_hospital_on_failure = true

# minimum number of providers linked per patient
# if this number is not met it re-runs the simulation
generate.providers.minimum = 1

# maximum distance to look for a provider for a given patient, in km
# set to 10 degrees lat/lon to support the model that veterans only seek care at VA facilities
generate.providers.maximum_search_distance = 32

# Payers
generate.payers.insurance_companies.default_file = payers/insurance_companies.csv
generate.payers.insurance_companies.medicare = Medicare
generate.payers.insurance_companies.medicaid = Medicaid
generate.payers.insurance_companies.dual_eligible = Dual Eligible

# Payer selection behavior
# How patients select a payer:
#  best_rates - select insurance with best rates for person's existing conditions and medical needs
#  random  - select randomly.
generate.payers.selection_behavior = random

# Payer adjustment behavior
# How payers adjust claims:
#  none - the payer reimburses each claim by the full amount.
#  fixed - the payer adjusts each claim by a fixed rate (set by adjustment_rate)
#  random  - the payer adjusts each claim by a random rate (between zero and adjustment_rate).
generate.payers.adjustment_behavior = none
# Payer adjustment rate should be between zero and one (0.00 - 1.00), where 0.05 is 5%.
generate.payers.adjustment_rate = 0.10

# Experimental feature. Patients will miss care if true, but side-effects of missing that care
# are not handled. Additionally, the path the disease module might take may no longer make sense.
# It might assume things occurred that haven't actually happened it. Use with care.
generate.payers.loss_of_care = false

# Add a FHIR terminology service URL to enable the use of ValueSet URIs within code definitions.
# generate.terminology_service_url = https://r4.ontoserver.csiro.au/fhir

# Quit Smoking
lifecycle.quit_smoking.baseline = 0.01
lifecycle.quit_smoking.timestep_delta = -0.01
lifecycle.quit_smoking.smoking_duration_factor_per_year = 1.0

# Quit Alcoholism
lifecycle.quit_alcoholism.baseline = 0.001
lifecycle.quit_alcoholism.timestep_delta = -0.001
lifecycle.quit_alcoholism.alcoholism_duration_factor_per_year = 1.0

# Adherence
lifecycle._direct_transition field
physiology.state.enabled = false

# set to true to introduce errors in height, weight and BMI observations for people
# under 20 years old
growtherrors = false
```adherence.baseline = 0.05

# set this to true to enable randomized "death by natural causes"
# highly recommended if "only_dead_patients" is true
lifecycle.death_by_natural_causes = false

# set this to enable "death by loss of care" or missed care,
# e.g. not covered by insurance or otherwise unaffordable.
# only functional if "generate.payers.loss_of_care" is also true.
lifecycle.death_by_loss_of_care = false

# Use physiology simulations to generate some VitalSigns
physiology.generators.enabled = false

# Allow physiology module states to be executed
# If false, all Physiology state objects will immediately redirect to the state defined in
# the alt_direct_transition field
physiology.state.enabled = false

# set to true to introduce errors in height, weight and BMI observations for people
# under 20 years old
growtherrors = false

And then I did the following:

(test) thecedarprince@thecedarledge:~/FOSS/dbt-synthea$ dbt deps
14:58:36  Running with dbt=1.8.7
14:58:37  Installing dbt-labs/dbt_utils
14:58:37  Installed from version 1.3.0
14:58:37  Up to date!
(test) thecedarprince@thecedarledge:~/FOSS/dbt-synthea$ dbt run-operation load_data_duckdb --args "{file_dict: $file_dict, vocab_tables: false}"
14:59:23  Running with dbt=1.8.7
14:59:23  Registered adapter: duckdb=1.8.0
14:59:24  Found 85 models, 29 seeds, 425 data tests, 29 sources, 537 macros

Following this, I loaded the vocab files:

(test) thecedarprince@thecedarledge:~/FOSS/dbt-synthea$ dbt run-operation load_data_duckdb --args "{file_dict: $file_dict, vocab_tables: true}"
15:12:50  Running with dbt=1.8.7
15:12:51  Registered adapter: duckdb=1.8.0
15:12:51  Found 85 models, 29 seeds, 425 data tests, 29 sources, 537 macros
(test) thecedarprince@thecedarledge:~/FOSS/dbt-synthea$ dbt seed --select states omop
15:13:48  Running with dbt=1.8.7
15:13:48  Registered adapter: duckdb=1.8.0
15:13:49  Found 85 models, 29 seeds, 425 data tests, 29 sources, 537 macros
15:13:49
15:13:51  Concurrency: 1 threads (target='dev')
15:13:51
15:13:51  1 of 1 START seed file dbt_synthea_dev_map_seeds.states ........................ [RUN]
15:13:52  1 of 1 OK loaded seed file dbt_synthea_dev_map_seeds.states .................... [INSERT 51 in 1.71s]
15:13:52
15:13:52  Finished running 1 seed in 0 hours 0 minutes and 3.67 seconds (3.67s).
15:13:53
15:13:53  Completed successfully
15:13:53
15:13:53  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

And then I ran dbt run but had some tables fail -- not sure what to make of them as well as the suggestions at the end of the run process:

(test) thecedarprince@thecedarledge:~/FOSS/dbt-synthea$ dbt run
15:13:57  Running with dbt=1.8.7
15:13:57  Registered adapter: duckdb=1.8.0
15:13:58  Found 85 models, 29 seeds, 425 data tests, 29 sources, 537 macros
15:13:58
15:13:58  Concurrency: 1 threads (target='dev')
15:13:58
15:13:58  1 of 85 START sql table model dbt_synthea_dev.dose_era ......................... [RUN]
15:14:00  1 of 85 OK created sql table model dbt_synthea_dev.dose_era .................... [OK in 1.78s]
15:14:00  2 of 85 START sql table model dbt_synthea_dev.episode .......................... [RUN]
15:14:01  2 of 85 OK created sql table model dbt_synthea_dev.episode ..................... [OK in 1.66s]
15:14:01  3 of 85 START sql table model dbt_synthea_dev.episode_event .................... [RUN]
15:14:03  3 of 85 OK created sql table model dbt_synthea_dev.episode_event ............... [OK in 1.50s]
15:14:03  4 of 85 START sql table model dbt_synthea_dev.fact_relationship ................ [RUN]
15:14:04  4 of 85 OK created sql table model dbt_synthea_dev.fact_relationship ........... [OK in 1.23s]
15:14:04  5 of 85 START sql table model dbt_synthea_dev.metadata ......................... [RUN]
15:14:05  5 of 85 OK created sql table model dbt_synthea_dev.metadata .................... [OK in 1.21s]
15:14:05  6 of 85 START sql table model dbt_synthea_dev.note ............................. [RUN]
15:14:06  6 of 85 OK created sql table model dbt_synthea_dev.note ........................ [OK in 1.29s]
15:14:06  7 of 85 START sql table model dbt_synthea_dev.note_nlp ......................... [RUN]
15:14:08  7 of 85 OK created sql table model dbt_synthea_dev.note_nlp .................... [OK in 1.25s]
15:14:08  8 of 85 START sql table model dbt_synthea_dev.specimen ......................... [RUN]
15:14:09  8 of 85 OK created sql table model dbt_synthea_dev.specimen .................... [OK in 1.26s]
15:14:09  9 of 85 START sql table model dbt_synthea_dev.stg_map__states .................. [RUN]
15:14:10  9 of 85 OK created sql table model dbt_synthea_dev.stg_map__states ............. [OK in 1.23s]
15:14:10  10 of 85 START sql table model dbt_synthea_dev.stg_synthea__allergies .......... [RUN]
15:14:12  10 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__allergies ..... [OK in 1.62s]
15:14:12  11 of 85 START sql table model dbt_synthea_dev.stg_synthea__careplans .......... [RUN]
15:14:14  11 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__careplans ..... [OK in 2.10s]
15:14:14  12 of 85 START sql table model dbt_synthea_dev.stg_synthea__claims ............. [RUN]
15:15:24  12 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__claims ........ [OK in 69.77s]
15:15:24  13 of 85 START sql table model dbt_synthea_dev.stg_synthea__claims_transactions  [RUN]
15:18:07  13 of 85 ERROR creating sql table model dbt_synthea_dev.stg_synthea__claims_transactions  [ERROR in 163.53s]
15:18:07  14 of 85 START sql table model dbt_synthea_dev.stg_synthea__conditions ......... [RUN]
15:18:15  14 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__conditions .... [OK in 8.14s]
15:18:15  15 of 85 START sql table model dbt_synthea_dev.stg_synthea__devices ............ [RUN]
15:18:17  15 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__devices ....... [OK in 1.63s]
15:18:17  16 of 85 START sql table model dbt_synthea_dev.stg_synthea__encounters ......... [RUN]
15:18:33  16 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__encounters .... [OK in 16.23s]
15:18:33  17 of 85 START sql table model dbt_synthea_dev.stg_synthea__imaging_studies .... [RUN]
15:18:50  17 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__imaging_studies  [OK in 16.53s]
15:18:50  18 of 85 START sql table model dbt_synthea_dev.stg_synthea__immunizations ...... [RUN]
15:18:52  18 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__immunizations . [OK in 2.35s]
15:18:52  19 of 85 START sql table model dbt_synthea_dev.stg_synthea__medications ........ [RUN]
15:19:12  19 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__medications ... [OK in 19.96s]
15:19:12  20 of 85 START sql table model dbt_synthea_dev.stg_synthea__observations ....... [RUN]
15:19:33  20 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__observations .. [OK in 21.33s]
15:19:33  21 of 85 START sql table model dbt_synthea_dev.stg_synthea__organizations ...... [RUN]
15:19:35  21 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__organizations . [OK in 1.71s]
15:19:35  22 of 85 START sql table model dbt_synthea_dev.stg_synthea__patients ........... [RUN]
15:19:38  22 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__patients ...... [OK in 2.91s]
15:19:38  23 of 85 START sql table model dbt_synthea_dev.stg_synthea__payer_transitions .. [RUN]
15:19:54  23 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__payer_transitions  [OK in 15.85s]
15:19:54  24 of 85 START sql table model dbt_synthea_dev.stg_synthea__payers ............. [RUN]
15:19:56  24 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__payers ........ [OK in 1.95s]
15:19:56  25 of 85 START sql table model dbt_synthea_dev.stg_synthea__procedures ......... [RUN]
15:20:00  25 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__procedures .... [OK in 4.16s]
15:20:00  26 of 85 START sql table model dbt_synthea_dev.stg_synthea__providers .......... [RUN]
15:20:02  26 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__providers ..... [OK in 1.95s]
15:20:02  27 of 85 START sql table model dbt_synthea_dev.stg_synthea__supplies ........... [RUN]
15:20:04  27 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__supplies ...... [OK in 2.19s]
15:20:04  28 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__concept ......... [RUN]
15:20:08  28 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__concept .... [OK in 3.58s]
15:20:08  29 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__concept_ancestor  [RUN]
15:20:11  29 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__concept_ancestor  [OK in 3.40s]
15:20:11  30 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__concept_class ... [RUN]
15:20:13  30 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__concept_class  [OK in 1.85s]
15:20:13  31 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__concept_relationship  [RUN]
15:20:18  31 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__concept_relationship  [OK in 4.74s]
15:20:18  32 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__concept_synonym . [RUN]
15:20:20  32 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__concept_synonym  [OK in 2.36s]
15:20:20  33 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__domain .......... [RUN]
15:20:22  33 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__domain ..... [OK in 1.77s]
15:20:22  34 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__drug_strength ... [RUN]
15:20:24  34 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__drug_strength  [OK in 2.13s]
15:20:24  35 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__relationship .... [RUN]
15:20:26  35 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__relationship  [OK in 1.80s]
15:20:26  36 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__source_to_concept_map  [RUN]
15:20:26  36 of 85 ERROR creating sql table model dbt_synthea_dev.stg_vocabulary__source_to_concept_map  [ERROR in 0.06s]
15:20:26  37 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__vocabulary ...... [RUN]
15:20:28  37 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__vocabulary . [OK in 1.75s]
15:20:28  38 of 85 START sql table model dbt_synthea_dev.int__er_visits .................. [RUN]
15:20:31  38 of 85 OK created sql table model dbt_synthea_dev.int__er_visits ............. [OK in 3.11s]
15:20:31  39 of 85 START sql table model dbt_synthea_dev.int__ip_visits .................. [RUN]
15:20:35  39 of 85 OK created sql table model dbt_synthea_dev.int__ip_visits ............. [OK in 4.15s]
15:20:35  40 of 85 START sql table model dbt_synthea_dev.int__op_visits .................. [RUN]
15:20:57  40 of 85 OK created sql table model dbt_synthea_dev.int__op_visits ............. [OK in 22.01s]
15:20:57  41 of 85 START sql table model dbt_synthea_dev.care_site ....................... [RUN]
15:20:59  41 of 85 OK created sql table model dbt_synthea_dev.care_site .................. [OK in 1.77s]
15:20:59  42 of 85 START sql table model dbt_synthea_dev.int__person ..................... [RUN]
15:21:01  42 of 85 OK created sql table model dbt_synthea_dev.int__person ................ [OK in 2.49s]
15:21:01  43 of 85 START sql table model dbt_synthea_dev.location ........................ [RUN]
15:21:04  43 of 85 OK created sql table model dbt_synthea_dev.location ................... [OK in 2.64s]
15:21:04  44 of 85 START sql table model dbt_synthea_dev.provider ........................ [RUN]
15:21:06  44 of 85 OK created sql table model dbt_synthea_dev.provider ................... [OK in 1.96s]
15:21:06  45 of 85 START sql table model dbt_synthea_dev.concept ......................... [RUN]
15:21:11  45 of 85 OK created sql table model dbt_synthea_dev.concept .................... [OK in 4.84s]
15:21:11  46 of 85 START sql table model dbt_synthea_dev.int__source_to_source_vocab_map . [RUN]
15:21:16  46 of 85 OK created sql table model dbt_synthea_dev.int__source_to_source_vocab_map  [OK in 5.19s]
15:21:16  47 of 85 START sql table model dbt_synthea_dev.concept_ancestor ................ [RUN]
15:21:21  47 of 85 OK created sql table model dbt_synthea_dev.concept_ancestor ........... [OK in 4.67s]
15:21:21  48 of 85 START sql table model dbt_synthea_dev.concept_class ................... [RUN]
15:21:23  48 of 85 OK created sql table model dbt_synthea_dev.concept_class .............. [OK in 1.81s]
15:21:23  49 of 85 START sql table model dbt_synthea_dev.concept_relationship ............ [RUN]
15:21:28  49 of 85 OK created sql table model dbt_synthea_dev.concept_relationship ....... [OK in 5.81s]
15:21:28  50 of 85 START sql table model dbt_synthea_dev.int__source_to_standard_vocab_map  [RUN]
15:21:34  50 of 85 OK created sql table model dbt_synthea_dev.int__source_to_standard_vocab_map  [OK in 5.33s]
15:21:34  51 of 85 START sql table model dbt_synthea_dev.concept_synonym ................. [RUN]
15:21:37  51 of 85 OK created sql table model dbt_synthea_dev.concept_synonym ............ [OK in 3.53s]
15:21:37  52 of 85 START sql table model dbt_synthea_dev.domain .......................... [RUN]
15:21:39  52 of 85 OK created sql table model dbt_synthea_dev.domain ..................... [OK in 1.79s]
15:21:39  53 of 85 START sql table model dbt_synthea_dev.drug_strength ................... [RUN]
15:21:43  53 of 85 OK created sql table model dbt_synthea_dev.drug_strength .............. [OK in 3.32s]
15:21:43  54 of 85 START sql table model dbt_synthea_dev.relationship .................... [RUN]
15:21:45  54 of 85 OK created sql table model dbt_synthea_dev.relationship ............... [OK in 1.92s]
15:21:45  55 of 85 SKIP relation dbt_synthea_dev.source_to_concept_map ................... [SKIP]
15:21:45  56 of 85 START sql table model dbt_synthea_dev.cdm_source ...................... [RUN]
15:21:46  56 of 85 OK created sql table model dbt_synthea_dev.cdm_source ................. [OK in 1.84s]
15:21:46  57 of 85 START sql table model dbt_synthea_dev.vocabulary ...................... [RUN]
15:21:48  57 of 85 OK created sql table model dbt_synthea_dev.vocabulary ................. [OK in 1.79s]
15:21:48  58 of 85 START sql table model dbt_synthea_dev.int__all_visits ................. [RUN]
15:22:16  58 of 85 OK created sql table model dbt_synthea_dev.int__all_visits ............ [OK in 28.17s]
15:22:16  59 of 85 START sql table model dbt_synthea_dev.person .......................... [RUN]
15:22:19  59 of 85 OK created sql table model dbt_synthea_dev.person ..................... [OK in 2.42s]
15:22:19  60 of 85 START sql table model dbt_synthea_dev.int__encounter_provider ......... [RUN]
15:22:26  60 of 85 OK created sql table model dbt_synthea_dev.int__encounter_provider .... [OK in 6.78s]
15:22:26  61 of 85 START sql table model dbt_synthea_dev.int__drug_immunisations ......... [RUN]
15:22:27  61 of 85 OK created sql table model dbt_synthea_dev.int__drug_immunisations .... [OK in 1.86s]
15:22:27  62 of 85 START sql table model dbt_synthea_dev.int__drug_medications ........... [RUN]
15:22:43  62 of 85 OK created sql table model dbt_synthea_dev.int__drug_medications ...... [OK in 15.90s]
15:22:43  63 of 85 START sql table model dbt_synthea_dev.int__observation_allergies ...... [RUN]
15:22:46  63 of 85 OK created sql table model dbt_synthea_dev.int__observation_allergies . [OK in 2.41s]
15:22:46  64 of 85 START sql table model dbt_synthea_dev.int__observation_conditions ..... [RUN]
15:22:51  64 of 85 OK created sql table model dbt_synthea_dev.int__observation_conditions  [OK in 5.41s]
15:22:51  65 of 85 START sql table model dbt_synthea_dev.int__observation_observations ... [RUN]
15:22:57  65 of 85 OK created sql table model dbt_synthea_dev.int__observation_observations  [OK in 6.05s]
15:22:57  66 of 85 START sql table model dbt_synthea_dev.int__assign_all_visit_ids ....... [RUN]
15:23:14  66 of 85 OK created sql table model dbt_synthea_dev.int__assign_all_visit_ids .. [OK in 17.05s]
15:23:14  67 of 85 START sql table model dbt_synthea_dev.death ........................... [RUN]
15:23:17  67 of 85 OK created sql table model dbt_synthea_dev.death ...................... [OK in 2.77s]
15:23:17  68 of 85 START sql table model dbt_synthea_dev.observation_period .............. [RUN]
15:23:20  68 of 85 OK created sql table model dbt_synthea_dev.observation_period ......... [OK in 2.63s]
15:23:20  69 of 85 START sql table model dbt_synthea_dev.payer_plan_period ............... [RUN]
15:23:52  69 of 85 OK created sql table model dbt_synthea_dev.payer_plan_period .......... [OK in 32.26s]
15:23:52  70 of 85 START sql table model dbt_synthea_dev.int__final_visit_ids ............ [RUN]
15:24:03  70 of 85 OK created sql table model dbt_synthea_dev.int__final_visit_ids ....... [OK in 10.63s]
15:24:03  71 of 85 START sql table model dbt_synthea_dev.condition_occurrence ............ [RUN]
15:24:15  71 of 85 OK created sql table model dbt_synthea_dev.condition_occurrence ....... [OK in 12.92s]
15:24:15  72 of 85 START sql table model dbt_synthea_dev.device_exposure ................. [RUN]
15:24:21  72 of 85 OK created sql table model dbt_synthea_dev.device_exposure ............ [OK in 5.36s]
15:24:21  73 of 85 START sql table model dbt_synthea_dev.drug_exposure ................... [RUN]
15:25:27  73 of 85 OK created sql table model dbt_synthea_dev.drug_exposure .............. [OK in 66.43s]
15:25:27  74 of 85 START sql table model dbt_synthea_dev.measurement ..................... [RUN]
15:26:11  74 of 85 OK created sql table model dbt_synthea_dev.measurement ................ [OK in 43.41s]
15:26:11  75 of 85 START sql table model dbt_synthea_dev.observation ..................... [RUN]
15:26:38  75 of 85 OK created sql table model dbt_synthea_dev.observation ................ [OK in 27.04s]
15:26:38  76 of 85 START sql table model dbt_synthea_dev.procedure_occurrence ............ [RUN]
15:26:47  76 of 85 OK created sql table model dbt_synthea_dev.procedure_occurrence ....... [OK in 8.86s]
15:26:47  77 of 85 START sql table model dbt_synthea_dev.visit_detail .................... [RUN]
15:27:10  77 of 85 OK created sql table model dbt_synthea_dev.visit_detail ............... [OK in 23.57s]
15:27:10  78 of 85 START sql table model dbt_synthea_dev.visit_occurrence ................ [RUN]
15:27:33  78 of 85 OK created sql table model dbt_synthea_dev.visit_occurrence ........... [OK in 23.03s]
15:27:33  79 of 85 START sql table model dbt_synthea_dev.condition_era ................... [RUN]
15:27:40  79 of 85 OK created sql table model dbt_synthea_dev.condition_era .............. [OK in 7.18s]
15:27:40  80 of 85 START sql table model dbt_synthea_dev.drug_era ........................ [RUN]
15:28:43  80 of 85 OK created sql table model dbt_synthea_dev.drug_era ................... [OK in 62.78s]
15:28:43  81 of 85 SKIP relation dbt_synthea_dev.int__cost_condition ..................... [SKIP]
15:28:43  82 of 85 START sql table model dbt_synthea_dev.int__cost_drug_exposure_1 ....... [RUN]
15:28:48  82 of 85 OK created sql table model dbt_synthea_dev.int__cost_drug_exposure_1 .. [OK in 4.84s]
15:28:48  83 of 85 START sql table model dbt_synthea_dev.int__cost_drug_exposure_2 ....... [RUN]
15:29:21  83 of 85 OK created sql table model dbt_synthea_dev.int__cost_drug_exposure_2 .. [OK in 32.66s]
15:29:21  84 of 85 START sql table model dbt_synthea_dev.int__cost_procedure ............. [RUN]
15:29:29  84 of 85 OK created sql table model dbt_synthea_dev.int__cost_procedure ........ [OK in 7.95s]
15:29:29  85 of 85 SKIP relation dbt_synthea_dev.cost .................................... [SKIP]
15:29:29
15:29:29  Finished running 85 table models in 0 hours 15 minutes and 31.08 seconds (931.08s).
15:29:29
15:29:29  Completed with 2 errors and 0 warnings:
15:29:29
15:29:29    Runtime Error in model stg_synthea__claims_transactions (models/staging/synthea/stg_synthea__claims_transactions.sql)
  Out of Memory Error: failed to offload data block of size 256.0 KiB (16383.9 PiB/1.2 TiB used).
  This limit was set by the 'max_temp_directory_size' setting.
  By default, this setting utilizes the available disk space on the drive where the 'temp_directory' is located.
  You can adjust this setting, by using (for example) PRAGMA max_temp_directory_size='10GiB'
15:29:29
15:29:29    Runtime Error in model stg_vocabulary__source_to_concept_map (models/staging/vocabulary/stg_vocabulary__source_to_concept_map.sql)
  Parser Error: SELECT clause without selection list
15:29:29
15:29:29  Done. PASS=80 WARN=0 ERROR=2 SKIP=3 TOTAL=85

For reference, here are the size of the files I used:

146M allergies.csv
354M careplans.csv
 28G claims.csv
136G claims_transactions.csv
3.4G conditions.csv
 19M devices.csv
9.8G encounters.csv
 11G imaging_studies.csv
225M immunizations.csv
 11G medications.csv
 11G observations.csv
1.3M organizations.csv
317M patients.csv
2.5K payers.csv
8.3G payer_transitions.csv
1.7G procedures.csv
 11M providers.csv
183M supplies.csv

---

total 218G

@TheCedarPrince
Copy link
Collaborator Author

@lawrenceadams

So there are two approaches:

  1. we could either modify the loader script to glob files , or...
  2. more simply - given duckdb doesn't need files loaded before it can use them (@katy-sadowski might correct me - but my understanding as to why we have a step where we actively load files into the pipeline is to keep the source.yml files identical between database systems e.g. Postgres/Duckdb/etc can use the same code), you could modify a sources.yml file to incorporate something like the below:

[...]

Does that make sense? It blew my mind the first time I realised I could do this! Makes working with partitioned CSV/Parquet files so much easier 😄

It does because, looking at my above comment, I now realize the problem is not so much the loading of data but in the ETL step -- I was unclear where the error was coming from but I have narrowed it above where you can see ETL'ing fails.

So, what I would like is to load multiple patient files chunked however I want (by like 100K patients per data generation) and then ETL those patients per data generation into a final duckdb file that continues to grow as I continue adding in new patients. As it stands right now, my understanding is that it is only one shot. I cannot add new patients to an old duckdb database made with dbt-synthea after I run dbt run initially.

Is that helping to frame more my problem?

@lawrenceadams
Copy link
Collaborator

Yeah this makes sense - I think a problem is that in this case the duckdb query planner fails and cannot figure out what is going on anymore:

Out of Memory Error: failed to offload data block of size 256.0 KiB (16383.9 PiB/1.2 TiB used).

Looks like the query planner has gone berserk there - unsure what is going on... Could you share what platform you're on and how much RAM you have?

A fun problem 😆

I'm going to try tomorrow - let me know if you have any breakthroughs!

@TheCedarPrince
Copy link
Collaborator Author

Could you share what platform you're on and how much RAM you have?

Fedora 39 (Desktop)
RAM: 64GB Available

Looks like the query planner has gone berserk there - unsure what is going on...

Out of curiosity, is my intuition here that chunking up how many patients I am ETL'ing each time a good thought? Or is this problem unrelated to what solution I wanted sketched out above?

@lawrenceadams
Copy link
Collaborator

I'd have thought that would be enough! Interesting... 🤔

Yes that's definitely a way of doing it! Two ways of doing this spring to my mind:

  1. Export each OMOP (final models) at the end of each run to file e.g. parquet; and then stitch them together at the end / leave as partitioned files per se. [This is very janky - but I suppose one appraoch]
  2. The more logical approach would be to use incremental models and dbt run 10 times with the sources being different each time

Not sure if either is super ideal but it is possible!

@lawrenceadams
Copy link
Collaborator

One thing I hadn't appreciated is that every model gets materialised as a table - in reality - especially when using huge source sizes like we are now - it might be best to set the staging models (which do nothing but project ) to materialize as views:

models:
synthea_omop_etl:
+materialized: table

Otherwise we will have 4 copies of the data (loaded / staged / intermediate / omop)!

@lawrenceadams
Copy link
Collaborator

@TheCedarPrince I got a bit further and then realised I had forgot to seed in the vocabulary 🙃

It did manage to get stg_synthea__claims_transactions to work though - unsure how best to fix!

dbt run --fail-fast
17:53:10  Running with dbt=1.8.7
17:53:11  Registered adapter: duckdb=1.8.0
17:53:11  Found 86 models, 29 seeds, 425 data tests, 29 sources, 538 macros
17:53:11  
17:53:11  Concurrency: 1 threads (target='dev')
17:53:11  
17:53:11  1 of 86 START sql table model dbt_synthea_dev.dose_era ......................... [RUN]
17:53:13  1 of 86 OK created sql table model dbt_synthea_dev.dose_era .................... [OK in 1.73s]
17:53:13  2 of 86 START sql table model dbt_synthea_dev.episode .......................... [RUN]
17:53:15  2 of 86 OK created sql table model dbt_synthea_dev.episode ..................... [OK in 1.80s]
17:53:15  3 of 86 START sql table model dbt_synthea_dev.episode_event .................... [RUN]
17:53:16  3 of 86 OK created sql table model dbt_synthea_dev.episode_event ............... [OK in 1.63s]
17:53:16  4 of 86 START sql table model dbt_synthea_dev.fact_relationship ................ [RUN]
17:53:18  4 of 86 OK created sql table model dbt_synthea_dev.fact_relationship ........... [OK in 1.61s]
17:53:18  5 of 86 START sql table model dbt_synthea_dev.metadata ......................... [RUN]
17:53:20  5 of 86 OK created sql table model dbt_synthea_dev.metadata .................... [OK in 1.81s]
17:53:20  6 of 86 START sql table model dbt_synthea_dev.note ............................. [RUN]
17:53:21  6 of 86 OK created sql table model dbt_synthea_dev.note ........................ [OK in 1.63s]
17:53:21  7 of 86 START sql table model dbt_synthea_dev.note_nlp ......................... [RUN]
17:53:23  7 of 86 OK created sql table model dbt_synthea_dev.note_nlp .................... [OK in 1.64s]
17:53:23  8 of 86 START sql table model dbt_synthea_dev.specimen ......................... [RUN]
17:53:25  8 of 86 OK created sql table model dbt_synthea_dev.specimen .................... [OK in 1.78s]
17:53:25  9 of 86 START sql table model dbt_synthea_dev.stg_map__states .................. [RUN]
17:53:26  9 of 86 OK created sql table model dbt_synthea_dev.stg_map__states ............. [OK in 1.64s]
17:53:26  10 of 86 START sql table model dbt_synthea_dev.stg_synthea__allergies .......... [RUN]
17:53:30  10 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__allergies ..... [OK in 3.08s]
17:53:30  11 of 86 START sql table model dbt_synthea_dev.stg_synthea__careplans .......... [RUN]
17:53:33  11 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__careplans ..... [OK in 3.35s]
17:53:33  12 of 86 START sql table model dbt_synthea_dev.stg_synthea__claims ............. [RUN]
17:54:25  12 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__claims ........ [OK in 52.08s]
17:54:25  13 of 86 START sql table model dbt_synthea_dev.stg_synthea__claims_transactions  [RUN]
17:58:35  13 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__claims_transactions  [OK in 249.68s]
17:58:35  14 of 86 START sql table model dbt_synthea_dev.stg_synthea__conditions ......... [RUN]
17:58:40  14 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__conditions .... [OK in 5.35s]
17:58:40  15 of 86 START sql table model dbt_synthea_dev.stg_synthea__devices ............ [RUN]
17:58:43  15 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__devices ....... [OK in 2.92s]
17:58:43  16 of 86 START sql table model dbt_synthea_dev.stg_synthea__encounters ......... [RUN]
17:58:52  16 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__encounters .... [OK in 8.92s]
17:58:52  17 of 86 START sql table model dbt_synthea_dev.stg_synthea__imaging_studies .... [RUN]
17:59:12  17 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__imaging_studies  [OK in 19.78s]
17:59:12  18 of 86 START sql table model dbt_synthea_dev.stg_synthea__immunizations ...... [RUN]
17:59:15  18 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__immunizations . [OK in 3.68s]
17:59:15  19 of 86 START sql table model dbt_synthea_dev.stg_synthea__medications ........ [RUN]
17:59:23  19 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__medications ... [OK in 7.71s]
17:59:23  20 of 86 START sql table model dbt_synthea_dev.stg_synthea__observations ....... [RUN]
18:00:08  20 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__observations .. [OK in 44.97s]
18:00:08  21 of 86 START sql table model dbt_synthea_dev.stg_synthea__organizations ...... [RUN]
18:00:11  21 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__organizations . [OK in 3.00s]
18:00:11  22 of 86 START sql table model dbt_synthea_dev.stg_synthea__patients ........... [RUN]
18:00:15  22 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__patients ...... [OK in 3.84s]
18:00:15  23 of 86 START sql table model dbt_synthea_dev.stg_synthea__payer_transitions .. [RUN]
18:00:23  23 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__payer_transitions  [OK in 7.10s]
18:00:23  24 of 86 START sql table model dbt_synthea_dev.stg_synthea__payers ............. [RUN]
18:00:26  24 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__payers ........ [OK in 2.98s]
18:00:26  25 of 86 START sql table model dbt_synthea_dev.stg_synthea__procedures ......... [RUN]
18:00:34  25 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__procedures .... [OK in 8.31s]
18:00:34  26 of 86 START sql table model dbt_synthea_dev.stg_synthea__providers .......... [RUN]
18:00:37  26 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__providers ..... [OK in 3.17s]
18:00:37  27 of 86 START sql table model dbt_synthea_dev.stg_synthea__supplies ........... [RUN]
18:00:40  27 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__supplies ...... [OK in 3.35s]
18:00:40  28 of 86 START sql table model dbt_synthea_dev.stg_vocabulary__concept ......... [RUN]
18:00:40  28 of 86 ERROR creating sql table model dbt_synthea_dev.stg_vocabulary__concept  [ERROR in 0.06s]
18:00:40  29 of 86 START sql table model dbt_synthea_dev.stg_vocabulary__concept_ancestor  [RUN]
18:00:41  CANCEL query model.synthea_omop_etl.stg_vocabulary__concept_ancestor ........... [CANCEL]
18:00:41  29 of 86 ERROR creating sql table model dbt_synthea_dev.stg_vocabulary__concept_ancestor  [ERROR in 0.04s]
18:00:41  
18:00:41    Runtime Error in model stg_vocabulary__concept (models/staging/vocabulary/stg_vocabulary__concept.sql)
  Parser Error: SELECT clause without selection list
18:00:41  
18:00:41  Finished running 86 table models in 0 hours 7 minutes and 29.65 seconds (449.65s).
18:00:41  
18:00:41  Completed with 59 errors and 0 warnings:
18:00:41  
18:00:41    Runtime Error in model stg_vocabulary__concept (models/staging/vocabulary/stg_vocabulary__concept.sql)
  Parser Error: SELECT clause without selection list
18:00:41  

@TheCedarPrince
Copy link
Collaborator Author

  1. Export each OMOP (final models) at the end of each run to file e.g. parquet; and then stitch them together at the end / leave as partitioned files per se. [This is very janky - but I suppose one appraoch]

This actually is really quite reasonable for a naive and quick approach. I'll try that at some point probably tomorrow. As you know, it would have drawbacks of duplicating lots of information -- I immediately think of the vocab tables.

The more logical approach would be to use incremental models and dbt run 10 times with the sources being different each time

Hunh! Not sure what is involved in that approach but seems interesting!

Otherwise we will have 4 copies of the data (loaded / staged / intermediate / omop)!

AH! That's why my persistent DuckDB database file keeps getting so large! Makes sense.

18:00:41 CANCEL query model.synthea_omop_etl.stg_vocabulary__concept_ancestor ........... [CANCEL]
18:00:41 29 of 86 ERROR creating sql table model dbt_synthea_dev.stg_vocabulary__concept_ancestor [ERROR in 0.04s]
18:00:41
18:00:41 Runtime Error in model stg_vocabulary__concept (models/staging/vocabulary/stg_vocabulary__concept.sql)
Parser Error: SELECT clause without selection list
18:00:41
18:00:41 Finished running 86 table models in 0 hours 7 minutes and 29.65 seconds (449.65s).
18:00:41
18:00:41 Completed with 59 errors and 0 warnings:
18:00:41
18:00:41 Runtime Error in model stg_vocabulary__concept (models/staging/vocabulary/stg_vocabulary__concept.sql)
Parser Error: SELECT clause without selection list
18:00:41

Yea, I am unsure what this means as well...

@katy-sadowski
Copy link
Collaborator

Runtime Error in model stg_vocabulary__source_to_concept_map (models/staging/vocabulary/stg_vocabulary__source_to_concept_map.sql)
Parser Error: SELECT clause without selection list

@TheCedarPrince can you check the compiled SQL for this?

Regarding running dbt in batches, we need to be careful because certain tables are "shared" across patients like care_site, location, and provider. I'm guessing dbt's incremental mode would handle this sort of thing? But I haven't looked into it much. My overall take is I'd prefer to avoid these sorts of approaches if at all possible and rather understand:

  • if there are things going on in the SQL queries as written that are consuming lots of memory
  • what is even reasonable to expect for a local run on large data

As I understand it, incremental mode was designed for huge datasets with the need for very frequent updates, which is not the typical use case for OMOP CDMs.

Finally, regarding the duplication of data across stage/int/marts - that's what I hope to address in #38 . Keeping all those tables around is only useful for debugging purposes while developing the ETL.

@lawrenceadams
Copy link
Collaborator

Re: incremental, you're right: it's really made for updates every n hours/days etc.

If duckdb is unable to cope then we'd probably need a different database engine in this insistence, but this does seem like an extreme stress test/case.

I'd test on Postgres but it'd almost definitely run out of space

@lawrenceadams
Copy link
Collaborator

lawrenceadams commented Oct 26, 2024

I have (with some tuning) eventually managed to get every table build... except observations which runs out of memory sadly for me.

observations.sql has a lot of complex joins using big tables and this is what can upset duckdb looking through the docs.

Here is the query plan for it for reference:

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│         PROJECTION        │
│    ────────────────────   │
│       observation_id      │
│         person_id         │
│   observation_concept_id  │
│      observation_date     │
│    observation_datetime   │
│observation_type_concept_id│
│            NULL           │
│            NULL           │
│    value_as_concept_id    │
│    qualifier_concept_id   │
│      unit_concept_id      │
│        provider_id        │
│    visit_occurrence_id    │
│      visit_detail_id      │
│  observation_source_value │
│observation_source_concept_│
│             id            │
│            NULL           │
│            NULL           │
│            NULL           │
│            NULL           │
│            NULL           │
│                           │
│      ~319014617 Rows      │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│             #0            │
│             #1            │
│             #2            │
│             #3            │
│             #4            │
│             #5            │
│             #6            │
│             #7            │
│             #8            │
│             #9            │
│            #10            │
│            #11            │
│            #12            │
│            #13            │
│            #14            │
│            #15            │
│                           │
│      ~319014617 Rows      │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           WINDOW          │
│    ────────────────────   │
│        Projections:       │
│  ROW_NUMBER() OVER (ORDER │
│   BY person_id ASC NULLS  │
│            LAST)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         HASH_JOIN         │
│    ────────────────────   │
│      Join Type: INNER     │
│                           │
│        Conditions:        ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│        patient_id =       │                                                                                                                                  │
│     person_source_value   │                                                                                                                                  │
│                           │                                                                                                                                  │
│      ~319014617 Rows      │                                                                                                                                  │
└─────────────┬─────────────┘                                                                                                                                  │
┌─────────────┴─────────────┐                                                                                                                    ┌─────────────┴─────────────┐
│         PROJECTION        │                                                                                                                    │         SEQ_SCAN          │
│    ────────────────────   │                                                                                                                    │    ────────────────────   │
│             #0            │                                                                                                                    │           person          │
│             #1            │                                                                                                                    │                           │
│             #2            │                                                                                                                    │        Projections:       │
│             #3            │                                                                                                                    │    person_source_value    │
│             #4            │                                                                                                                    │         person_id         │
│             #5            │                                                                                                                    │                           │
│             #6            │                                                                                                                    │                           │
│             #7            │                                                                                                                    │                           │
│             #8            │                                                                                                                    │                           │
│             #9            │                                                                                                                    │                           │
│            #10            │                                                                                                                    │                           │
│            #11            │                                                                                                                    │                           │
│__internal_decompress_integ│                                                                                                                    │                           │
│     ral_bigint(#12, 1)    │                                                                                                                    │                           │
│                           │                                                                                                                    │                           │
│      ~253445155 Rows      │                                                                                                                    │       ~1144346 Rows       │
└─────────────┬─────────────┘                                                                                                                    └───────────────────────────┘
┌─────────────┴─────────────┐
│         HASH_JOIN         │
│    ────────────────────   │
│      Join Type: LEFT      │
│                           │
│        Conditions:        ├─────────────────────────────────────────────────────────────────────────────────────────────────────┐
│encounter_id = encounter_id│                                                                                                     │
│  patient_id = patient_id  │                                                                                                     │
│                           │                                                                                                     │
│      ~253445155 Rows      │                                                                                                     │
└─────────────┬─────────────┘                                                                                                     │
┌─────────────┴─────────────┐                                                                                       ┌─────────────┴─────────────┐
│         PROJECTION        │                                                                                       │         PROJECTION        │
│    ────────────────────   │                                                                                       │    ────────────────────   │
│             #0            │                                                                                       │             #0            │
│             #1            │                                                                                       │             #1            │
│             #2            │                                                                                       │__internal_compress_integra│
│             #3            │                                                                                       │     l_usmallint(#2, 1)    │
│             #4            │                                                                                       │                           │
│             #5            │                                                                                       │                           │
│             #6            │                                                                                       │                           │
│             #7            │                                                                                       │                           │
│             #8            │                                                                                       │                           │
│__internal_decompress_integ│                                                                                       │                           │
│     ral_bigint(#9, 1)     │                                                                                       │                           │
│                           │                                                                                       │                           │
│      ~253445155 Rows      │                                                                                       │       ~58246069 Rows      │
└─────────────┬─────────────┘                                                                                       └─────────────┬─────────────┘
┌─────────────┴─────────────┐                                                                                       ┌─────────────┴─────────────┐
│         HASH_JOIN         │                                                                                       │         SEQ_SCAN          │
│    ────────────────────   │                                                                                       │    ────────────────────   │
│      Join Type: LEFT      │                                                                                       │  int__encounter_provider  │
│                           │                                                                                       │                           │
│        Conditions:        │                                                                                       │        Projections:       │
│encounter_id = encounter_id├────────────────────────────────────────────────────────────────────────┐              │        encounter_id       │
│                           │                                                                        │              │         patient_id        │
│                           │                                                                        │              │        provider_id        │
│                           │                                                                        │              │                           │
│      ~253445155 Rows      │                                                                        │              │       ~58246069 Rows      │
└─────────────┬─────────────┘                                                                        │              └───────────────────────────┘
┌─────────────┴─────────────┐                                                          ┌─────────────┴─────────────┐
│           UNION           │                                                          │         PROJECTION        │
│                           │                                                          │    ────────────────────   │
│                           │                                                          │             #0            │
│                           ├───────────────────────────────────────────┐              │__internal_compress_integra│
│                           │                                           │              │     l_uinteger(#1, 1)     │
│                           │                                           │              │                           │
│                           │                                           │              │       ~56156413 Rows      │
└─────────────┬─────────────┘                                           │              └─────────────┬─────────────┘
┌─────────────┴─────────────┐                             ┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         PROJECTION        │                             │         SEQ_SCAN          ││         SEQ_SCAN          │
│    ────────────────────   │                             │    ────────────────────   ││    ────────────────────   │
│             #0            │                             │int__observation_observatio││    int__final_visit_ids   │
│             #1            │                             │             ns            ││                           │
│             #2            │                             │                           ││        Projections:       │
│             #3            │                             │        Projections:       ││        encounter_id       │
│   CAST(#4 AS TIMESTAMP)   │                             │         patient_id        ││  visit_occurrence_id_new  │
│             #5            │                             │        encounter_id       ││                           │
│             #6            │                             │   observation_concept_id  ││                           │
│             #7            │                             │      observation_date     ││                           │
│                           │                             │    observation_datetime   ││                           │
│                           │                             │observation_type_concept_id││                           │
│                           │                             │  observation_source_value ││                           │
│                           │                             │observation_source_concept_││                           │
│                           │                             │             id            ││                           │
│                           │                             │                           ││                           │
│       ~29567613 Rows      │                             │      ~223877542 Rows      ││       ~56156413 Rows      │
└─────────────┬─────────────┘                             └───────────────────────────┘└───────────────────────────┘
┌─────────────┴─────────────┐
│           UNION           ├──────────────┐
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││         SEQ_SCAN          │
│    ────────────────────   ││    ────────────────────   │
│ int__observation_allergies││int__observation_conditions│
│                           ││                           │
│        Projections:       ││        Projections:       │
│         patient_id        ││         patient_id        │
│        encounter_id       ││        encounter_id       │
│   observation_concept_id  ││   observation_concept_id  │
│      observation_date     ││      observation_date     │
│    observation_datetime   ││    observation_datetime   │
│observation_type_concept_id││observation_type_concept_id│
│  observation_source_value ││  observation_source_value │
│observation_source_concept_││observation_source_concept_│
│             id            ││             id            │
│                           ││                           │
│        ~719717 Rows       ││       ~28847896 Rows      │
└───────────────────────────┘└───────────────────────────┘

It seems like DuckDB will try to spill to disk where it can, including for operations such as joins/orders/windowing it is unable to - but the docs note that if you many of these it may fail to use the disk - which is what I imagine is happening here!

@lawrenceadams
Copy link
Collaborator

I had some success running in Snowflake (XS Warehouse) - the same queries causing it to fail locally are causing large spill to local warehouse storage. In total took about 20 minutes to run the whole project (without running tests) (~0.3 credits ~ $0.80 pre-tax 💸 )

These are the queries with their metrics if of interest! Metrics (CSV)

I will try to dig in and see what the worst offenders are, but it is measurements/observations at a glance; both of which have large joins with few predicates as far as I can see!

@katy-sadowski
Copy link
Collaborator

Runtime Error in model stg_vocabulary__source_to_concept_map (models/staging/vocabulary/stg_vocabulary__source_to_concept_map.sql)
Parser Error: SELECT clause without selection list

I realized what this is. The vocab download doesn't come with a source_to_concept_map (duh). I'll file a ticket to handle this table in BYO vocab mode.

Also, I'm looking at the observation SQL. I'm surprised to see there's nothing crazy going on in there. This AND might not be needed, and we could potentially fold that encounters/providers mapper into the visit int table to skip a join.

@lawrenceadams , what were the runtimes for the other models that succeeded / which were the slowest?

@lawrenceadams
Copy link
Collaborator

Apologies @katy-sadowski I ran out of time to dig in to the results over the weekend!

These are the 5 slowest models, and generally the ones that needed the most out of memory processing (arguably a moot point as we could reduce the number of threads [threads=8] / use a bigger warehouse - but still interesting).

Query ID Model Total Elapsed Time (s) Bytes Scanned (GB) % Scanned from Cache Bytes Written to Result (GB) Rows Produced Rows Written to Result Bytes (GB) spilled to Local Storage Execution Time (ms)
01b7f2aa-0000-673b-0000-f5a10005ea16 observation 754,447 (~12.5 min) 4.1 GB 92.30% 3.3 GB 253,447,438 1 35.693396 753,980 ms (~12.5 min)
01b7f2aa-0000-674d-0000-f5a100063052 visit_detail 370,289 (~6.2 min) 3.4 GB 82% 2.0 GB 54,933,090 1 16.240853 369,871 ms (~6.2 min)
01b7f2aa-0000-6713-0000-f5a100062362 visit_occurrence 368,723 (~6.1 min) 3.4 GB 83.60% 1.9 GB 54,933,090 1 15.9126733 368,252 ms (~6.1 min)
01b7f2aa-0000-673b-0000-f5a10005ea1e procedure_occurrence 345,095 (~5.8 min) 3.9 GB 57.70% 1.3 GB 45,071,224 1 20.5998039 344,142 ms (~5.8 min)
01b7f2b0-0000-6714-0000-f5a1000614a6 int__cost_drug_exposure_2 319,436 (~5.3 min) 3.8 GB 79.20% 0.8 GB 53,469,529 1 10.4926577 318,674 ms (~5.3 min)

Looking at the most expensive nodes in observation:

image

This join is the most expensive, followed by the one you linked to, followed by the window function


I agree @katy-sadowski that second join predicate looks redundant - although interestingly when running that model by itself with and without the join on the patient ID results in 79 fewer rows being returned (253447517 [join on encounter_id] - 253447438 [join on both encounter and patient ID] = 79), which is interesting... I'll need to inspect why!

When running the model by itself they're much faster (Only done on observation but took 2 minutes instead of 12) - at some point I'll run on snowflake with one thread to see what happens; it will be more useful!

@katy-sadowski
Copy link
Collaborator

Interesting thanks @lawrenceadams !

This join is the most expensive, followed by the one you linked to, followed by the window function

In this case I wonder if it would be faster to move the joins down into the individual intermediate tables.

interestingly when running that model by itself with and without the join on the patient ID results in 79 fewer rows being returned (253447517 [join on encounter_id] - 253447438 [join on both encounter and patient ID] = 79)

weird - agree that should be checked! are there encounters missing a patient ID in your source data? (there aren't any in the seed dataset). if so we probably want to handle that scenario explicitly in the ETL.

When running the model by itself they're much faster (Only done on observation but took 2 minutes instead of 12) - at some point I'll run on snowflake with one thread to see what happens; it will be more useful!

you mean dbt run --model observation?

@lawrenceadams
Copy link
Collaborator

lawrenceadams commented Nov 2, 2024

In this case I wonder if it would be faster to move the joins down into the individual intermediate tables.

Quite possibly, from memory there is no filtering that happens at any point so we'd have to pay the price at some point. Maybe worth trying!

When I've done this on real EHR data I tend to have these joins consolidated before the end, and use the final models to do things like global transformations (e.g. join vocabulary where needed) - but we only do that once. I wonder how others in this space handle it!

weird - agree that should be checked! are there encounters missing a patient ID in your source data? (there aren't any in the seed dataset). if so we probably want to handle that scenario explicitly in the ETL.

Great shout - I did this and there are 79 cases which violate this assumption! No idea how this happens as I can't imagine it's a feature of Synthea?

image

It might be worth re-building the dataset and seeing if it happens for others. Can others replicate?

you mean dbt run --model observation?

if effect yep, I re ran it with dbt run --threads 1 just now and it was similar (measurement, observation, int__assign_all_visit_ids, int__cost_drug_exposure_2, int__cost_procedure were the top 5 offenders!)

@katy-sadowski
Copy link
Collaborator

When I've done this on real EHR data I tend to have these joins consolidated before the end, and use the final models to do things like global transformations (e.g. join vocabulary where needed) - but we only do that once. I wonder how others in this space handle it!

Maybe a good discussion topic for our next meeting.

It might be worth re-building the dataset and seeing if it happens for others. Can others replicate?

I will try. I'm guessing it's a bug in Synthea (or maybe a weird feature like this one). Once I've got my mega dataset I can start experimenting with some of these optimization ideas too :)

@TheCedarPrince
Copy link
Collaborator Author

I can try replicating! What do you want me to try doing @lawrenceadams ?

@lawrenceadams
Copy link
Collaborator

Maybe a good discussion topic for our next meeting.

Great idea!

@lawrenceadams
Copy link
Collaborator

I can try replicating! What do you want me to try doing @lawrenceadams ?

Amazing!! Did you manage to make a huge synthea dataset? Is it possible to check if encounters have a different patient ID attached to them, as above?

@TheCedarPrince
Copy link
Collaborator Author

I can try replicating! What do you want me to try doing @lawrenceadams ?

Amazing!! Did you manage to make a huge synthea dataset? Is it possible to check if encounters have a different patient ID attached to them, as above?

So you want me to generate a huge synthea dataset? Would 1 million patients with 3 or 5 year retrospective be enough to be "huge" for you @lawrenceadams?

I can generate that and then check out the different patient ID. :)

@katy-sadowski
Copy link
Collaborator

katy-sadowski commented Dec 3, 2024

Hi guys! I finally had some time to run dbt on my million-patient Synthea dataset (duckdb, single-threaded run). Some initial findings to report:

  • I experienced one OOM error at the stg layer (claims_transactions). However, this was easily fixed by changing stg models to materialize as views instead of tables. Doing this also greatly sped up creation of some other stg models. I think I'd like to commit this change to the repo at large (something we'd already discussed above)
  • I got an OOM in building the final observation model. But I fixed it by moving the joins down into the intermediate layer models instead!
  • drug_exposure and measurement models also failed with OOM. I'm going to look into potential optimizations for those models next
  • I also got some unique key dbt test failures but going to address those later once I see if I can get everything to run start to finish 😄

@katy-sadowski
Copy link
Collaborator

katy-sadowski commented Dec 27, 2024

Hi again guys! I was able to get all models to build on my millon-patient dataset in duckdb after making the changes in this PR: #109

There were some really easy wins in here so I'm excited to dig into more of the code to see what else can be cleaned up :)

I also filed this issue with Synthea because some of my source tables had dupe PKs: synthetichealth/synthea#1545

@katy-sadowski
Copy link
Collaborator

katy-sadowski commented Jan 25, 2025

Performance on "big Synthea" continues to improve with #111 - at least for me! Feel free to test it yourselves by checking out my branch and giving it a run :)

btw @lawrenceadams I am thinking we might want to get rid of all the order by columns in our row_number ID assignments. it's a drag on performance with little benefit since it's rarely (if ever) actually deterministic. i think there's a whole separate discussion worth having on the best way to approach ID assignment (which can extend into topics like support for incremental runs). thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants