Skip to content

lele1001/CS483-BigDataMining-Project

Repository files navigation

CS483 - Big Data Mining Project: Diabetes Prediction and Analysis

This repository contains the implementation of a predictive modeling project for diabetes classification and risk analysis. Using machine learning techniques and advanced interpretability methods, the project aims to enhance healthcare decision-making by analyzing the CDC Diabetes Health Indicators dataset.

Table of Contents


Overview

Diabetes is a growing public health challenge, making early identification of at-risk individuals crucial for effective intervention. This project develops machine learning models to classify individuals as healthy, pre-diabetic, or diabetic based on health, lifestyle, and demographic data.

Key objectives:

  • Develop interpretable models for diabetes prediction.
  • Identify significant risk factors using SHAP analysis.
  • Generate personalized health reports with Large Language Models (LLMs).

The dataset includes 21 features, such as BMI, physical activity, cholesterol levels, and demographic factors, providing comprehensive insight into diabetes risks.


Features

  • Dimensionality Reduction: PCA, t-SNE, and autoencoders reduce dataset complexity while preserving critical insights.
  • Clustering: K-means and stratified K-means clustering uncover distinct health profiles.
  • Classification: Decision trees, Random Forests, AdaBoost, and Neural Networks for diabetes classification.
  • Interpretability: SHAP analysis provides feature importance and transparency for predictive models.
  • Generative Reports: LLM integration produces patient-specific, understandable medical reports.

Technologies


Setup

  1. Clone the repository:
    git clone <repository-url>
    cd <repository-directory>
    
  2. Install the dependencies:
    pip install -r requirements.txt
    
  3. Download and place the datase in the data/ directory (if not already present)

Usage

  1. Generate API key: visit Google AI Studio to generate your key and paste it in line 6 of helper.py
  2. Run the Project: use the main entry script to create personalized reports with the LLM tool
    python main.py
    
  3. Explore Individual Modules: execute the scripts for different analyses
    • Dimensionality Reduction:
    • Clustering: python k_means/kmeans.py
    • Classification: python tree_based_classification/main.py
    • Neural Networks:

Results

  • Model Performance: Achieved ~85% accuracy for diabetes classification using neural networks.
  • Key Findings: Physical health, BMI, and age were the most influential features for diabetes prediction.
  • Reports: Patient-specific recommendations were successfully generated using LLMs.

Acknowledgments

This project was completed as part of CS483 coursework under the guidance of Prof. Lu Cheng. Special thanks to team Spaghetti Coders:

  • Eleonora Cabai
  • Filippo Corna
  • Davide Ettori
  • Patrik Poggi

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published