AutoML Commit

DanielJanowicz · Nov 16, 2022 · 77b46b9 · 77b46b9
1 parent de3c08b
commit 77b46b9
Show file tree

Hide file tree

Showing 6 changed files with 101,864 additions and 933 deletions.
diff --git a/autoML/autoML.ipynb b/autoML/autoML.ipynb
diff --git a/competition/data/output_noshow1_balanced_2022-11-16.csv b/competition/data/output_noshow1_balanced_2022-11-16.csv
diff --git a/competition/dictionary/.keep b/competition/dictionary/.keep
diff --git a/competition/dictionary/dictionary.py b/competition/dictionary/dictionary.py
@@ -0,0 +1,53 @@
+data_dict = {
+        'id' : 'unique ID of row', 
+        'practice_id' : 'clients + practice identifier', 
+        'appointment_id': 'unique ID of appointment',
+        'patient_id': 'unique ID of patient', 
+        'appointment_date': 'date of appointment',
+        'appointment_start_time': 'start time of appointment',
+        'appointment_duration': 'duration of appointment',
+        'appointment_type_id': 'type of appointment',
+        'appointment_type': 'type of appointment',
+        'appointment_status': 'status of appointment',
+        'appointment_date_time': 'date and time of appointment',
+        'appointment_last_modified_date' : 'date of last modification', 
+        'appointment_scheduled_date': 'date of appointment', 
+        'appointment_cancelled_date': 'date of cancellation', 
+        'appointment_cancelled_reason'  : 'reason for cancellation', 
+        'appointment_cancel_reason_noshow'  : 'reason for cancellation', 
+        'appointment_yosi_noshow1' : 'reason for cancellation defined by Yosi - cancelled within 24 hours of appointment',
+        'appointment_yosi_noshow2' : 'reason for cancellation defined by Yosi - cancelled within 48 hours of appointment', 
+        'patient_dob': 'date of birth', 
+        'patient_gender' : 'gender', 
+        'patient_age' : 'patients age', 
+        'data_collect' : 'pre versus post yosi implementation',  
+        'custom_client' : 'excluding last 2 characters of client identifier', 
+        'custom_client_site' : 'last 2 characters of client identifier',
+        'client_site' : 'last 2 characters of client identifier', 
+        'patient_age_groupper' : 'age group',
+        'appointment_date_qt' : 'quarter of the year',
+        'appointment_date_month' : 'month of the year', 
+        'appointment_date_year' : 'year of the year',
+        'appintmentWithin1DayHoliday' : 'is appointment within 1 day of a holiday', 
+        'appintmentWithin2DayHoliday' : 'is appointment within 2 days of a holiday',
+        'appintmentWithin3DayHoliday' : 'is appointment within 3 days of a holiday', 
+        'appintmentWithin5DayHoliday' : 'is appointment within 5 days of a holiday', 
+        'appintmentWithin7DayHoliday' : 'is appointment within 7 days of a holiday', 
+        'appointment_start_time_groupper' : 'time of appointment',
+        'appointment_start_time_hour' : 'hour of appointment', 
+        'zipcode' : 'zipcode of patient',
+        'weather_conditions' : 'enriched data: weather conditions from VISUALCROSSING', 
+        'weather_icon' : 'enriched data: weather icon from VISUALCROSSING',
+        'geocode_zip' : 'enriched data: geocode zipcode from HUD',
+        'geocode_city' : 'enriched data: geocode city from HUD', 
+        'geocode_county' : 'enriched data: geocode county from HUD',
+        'geocode_state' : 'enriched data: geocode state from HUD',
+        'geocode_stusab' : 'enriched data: geocode stusab from HUD', 
+        'geocode_latitude' : 'enriched data: geocode latitude from HUD', 
+        'geocode_longitude' : 'enriched data: geocode longitude from HUD', 
+        'geocode_lengthlife' : 'enriched data: geocode lengthlife from COUNTYHEALTHRANKINGS/ACS/CDC', 
+        'geocode_healthybehaviors': 'enriched data: geocode healthybehaviors from COUNTYHEALTHRANKINGS/ACS/CDC', 
+        'geocode_clinicalcare': 'enriched data: geocode clinicalcare from COUNTYHEALTHRANKINGS/ACS/CDC',
+        'geocode_socioeconomic': 'enriched data: geocode socioeconomic from COUNTYHEALTHRANKINGS/ACS/CDC',
+        'geocode_physicalenv': 'enriched data: geocode physicalenv from COUNTYHEALTHRANKINGS/ACS/CDC'
+    }
diff --git a/competition/goals/.keep b/competition/goals/.keep
diff --git a/competition/goals/goals.md b/competition/goals/goals.md
@@ -0,0 +1,29 @@
+# Competition Rules 
+
+## Background 
+- This is a dataset that provides information related to no shows for clinical appointments across small healthcare clinics across the country. 
+Please create a new **PRIVATE REPO** called `ahi-competition`. The goal of this competition is to better understand NO SHOWS. In this project you will be required to do research on no-shows, determine how to enhance the existing dataset with new public data, and then peform traditional and non-traditional approaches to understand what factors may predict no shows, general no-show trends (persona groups - e.g., old, poor, with low tech literacy), or a predictive ML model that can be deployed into production.  
+- The dataset that is provided is a small sample of 50,000 random encounters of people that showed up for their appointments, and a random sample of 50,000 encounters of people that did not show up for their appointments 
+
+### Part 0: Expected folder structure: 
+- /research 
+- /data
+- /scripts 
+- /insights 
+
+### Part 1: Research 
+Find at least 3 research articles (e.g., Pubmed) that can be used to inform about what potential variables might be of interest for predicting if a patient shows or does not show to a appointment 
+- What you find should be placed into a `/research` folder inside of your github repo 
+
+### Part 2: Enhancement 
+Find at least one external dataset (that is public) that can be brought in and merged with the existing dataset. The goal is to enhance or enrich the dataset with either sociodemogrpahic, clinical, or other non-traditional data sources that you think might impact, or be a significant predictor of whether a patient shows or does not show based on your Part 1 research that is performed 
+- the enhancement dataset(s) or API references should be placed/references into the `/data` folder inside of your github repo with appropriate naming conventions
+- the scripts that you write to merge/transform the enhanced dataset and original data file should be placed within `/scripts` folder 
+- there should be a new `master.csv` file that is created and stored in the `/data` folder 
+
+### Part 3: Visualization based on descriptives
+Create at least one .py script that utilizes Streamlit to provide visualizations of the baseline data. There should be comparisons that focus on key descriptive differences that may or may not exist between persons that showed versus did not show for their appointments (e.g., gender differences? location differences? weather differences? X differences?)
+
+### Part 4: Statistics and ML 
+Perform at least one traditional statistical test (in either R or python or SATA or SAS) and one ML test (e.g., ML JAR).
+- place your output of these tests inside of the `/insights` folder