CS483 - Big Data Mining Project: Diabetes Prediction and Analysis

This repository contains the implementation of a predictive modeling project for diabetes classification and risk analysis. Using machine learning techniques and advanced interpretability methods, the project aims to enhance healthcare decision-making by analyzing the CDC Diabetes Health Indicators dataset.

Overview

Diabetes is a growing public health challenge, making early identification of at-risk individuals crucial for effective intervention. This project develops machine learning models to classify individuals as healthy, pre-diabetic, or diabetic based on health, lifestyle, and demographic data.

Key objectives:

Develop interpretable models for diabetes prediction.
Identify significant risk factors using SHAP analysis.
Generate personalized health reports with Large Language Models (LLMs).

The dataset includes 21 features, such as BMI, physical activity, cholesterol levels, and demographic factors, providing comprehensive insight into diabetes risks.

Features

Dimensionality Reduction: PCA, t-SNE, and autoencoders reduce dataset complexity while preserving critical insights.
Clustering: K-means and stratified K-means clustering uncover distinct health profiles.
Classification: Decision trees, Random Forests, AdaBoost, and Neural Networks for diabetes classification.
Interpretability: SHAP analysis provides feature importance and transparency for predictive models.
Generative Reports: LLM integration produces patient-specific, understandable medical reports.

Technologies

Programming Language: Python
Libraries: Scikit-learn, TensorFlow, SHAP, Matplotlib
Data Source: CDC Diabetes Health Indicators Dataset
Generative AI: Google Generative AI API

Setup

Clone the repository:

git clone <repository-url>
cd <repository-directory>

Install the dependencies:
```
pip install -r requirements.txt
```
Download and place the datase in the data/ directory (if not already present)

Usage

Generate API key: visit Google AI Studio to generate your key and paste it in line 6 of helper.py
Run the Project: use the main entry script to create personalized reports with the LLM tool
```
python main.py
```
Explore Individual Modules: execute the scripts for different analyses
- Dimensionality Reduction:
- Clustering: python k_means/kmeans.py
- Classification: python tree_based_classification/main.py
- Neural Networks:

Results

Model Performance: Achieved ~85% accuracy for diabetes classification using neural networks.
Key Findings: Physical health, BMI, and age were the most influential features for diabetes prediction.
Reports: Patient-specific recommendations were successfully generated using LLMs.

Acknowledgments

This project was completed as part of CS483 coursework under the guidance of Prof. Lu Cheng. Special thanks to team Spaghetti Coders:

Eleonora Cabai
Filippo Corna
Davide Ettori
Patrik Poggi

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
Tree_Based_Classification		Tree_Based_Classification
__pycache__		__pycache__
data		data
dimensionality_reduction		dimensionality_reduction
k_means		k_means
models		models
neural_network		neural_network
.DS_Store		.DS_Store
README.md		README.md
Report.pdf		Report.pdf
helper.py		helper.py
main.py		main.py
patient_report.txt		patient_report.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS483 - Big Data Mining Project: Diabetes Prediction and Analysis

Table of Contents

Overview

Features

Technologies

Setup

Usage

Results

Acknowledgments

About

Releases

Packages

Contributors 5

Languages

lele1001/CS483-BigDataMining-Project

Folders and files

Latest commit

History

Repository files navigation

CS483 - Big Data Mining Project: Diabetes Prediction and Analysis

Table of Contents

Overview

Features

Technologies

Setup

Usage

Results

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages