This repository contains the implementation of a predictive modeling project for diabetes classification and risk analysis. Using machine learning techniques and advanced interpretability methods, the project aims to enhance healthcare decision-making by analyzing the CDC Diabetes Health Indicators dataset.
Diabetes is a growing public health challenge, making early identification of at-risk individuals crucial for effective intervention. This project develops machine learning models to classify individuals as healthy, pre-diabetic, or diabetic based on health, lifestyle, and demographic data.
Key objectives:
- Develop interpretable models for diabetes prediction.
- Identify significant risk factors using SHAP analysis.
- Generate personalized health reports with Large Language Models (LLMs).
The dataset includes 21 features, such as BMI, physical activity, cholesterol levels, and demographic factors, providing comprehensive insight into diabetes risks.
- Dimensionality Reduction: PCA, t-SNE, and autoencoders reduce dataset complexity while preserving critical insights.
- Clustering: K-means and stratified K-means clustering uncover distinct health profiles.
- Classification: Decision trees, Random Forests, AdaBoost, and Neural Networks for diabetes classification.
- Interpretability: SHAP analysis provides feature importance and transparency for predictive models.
- Generative Reports: LLM integration produces patient-specific, understandable medical reports.
- Programming Language: Python
- Libraries: Scikit-learn, TensorFlow, SHAP, Matplotlib
- Data Source: CDC Diabetes Health Indicators Dataset
- Generative AI: Google Generative AI API
- Clone the repository:
git clone <repository-url> cd <repository-directory>
- Install the dependencies:
pip install -r requirements.txt
- Download and place the datase in the
data/
directory (if not already present)
- Generate API key: visit Google AI Studio to generate your key and paste it in line 6 of
helper.py
- Run the Project: use the main entry script to create personalized reports with the LLM tool
python main.py
- Explore Individual Modules: execute the scripts for different analyses
- Dimensionality Reduction:
- Clustering:
python k_means/kmeans.py
- Classification:
python tree_based_classification/main.py
- Neural Networks:
- Model Performance: Achieved ~85% accuracy for diabetes classification using neural networks.
- Key Findings: Physical health, BMI, and age were the most influential features for diabetes prediction.
- Reports: Patient-specific recommendations were successfully generated using LLMs.
This project was completed as part of CS483 coursework under the guidance of Prof. Lu Cheng. Special thanks to team Spaghetti Coders:
- Eleonora Cabai
- Filippo Corna
- Davide Ettori
- Patrik Poggi