Analyzing biological data for anomaly detection using Minimum Enclosing Ball (MEB) and variations of the Frank-Wolfe algorithm.
Anomaly detection, biological data
- About the Project
- Key Features
- Key Results
- Data Overview
- Methodology
- Screenshots and Graphs
- Technologies Used
- Setup & Installation
- Usage
- Contributing
- License
- Contact
This project explores the Minimum Enclosing Ball (MEB) problem and its applications in anomaly detection across biological datasets. We implement three Frank-Wolfe algorithm variants: Pairwise Frank-Wolfe, Blended Pairwise Conditional Gradient (BPCG), and a (1+ϵ)-approximation using Away-Steps. These methods are evaluated for computational efficiency, convergence rates, and anomaly detection performance in biological contexts.
- Anomaly Detection: Identification of anomalies in biological datasets.
- Efficient Algorithms: Implementation of three Frank-Wolfe-based algorithms, each offering unique advantages in convergence and computational performance.
- Biological Data Application: Practical application in detecting anomalies in datasets like breast cancer, gene expression, vertebral pathology, and maternity risk.
- Convergence: All three algorithms demonstrated linear convergence on the MEB problem.
- Best Performance: The (1+ϵ)-approximation (Yildirim 2008) showed superior computational performance, especially on high-dimensional datasets.
- Recall Metrics: Focused on recall as the benchmark for anomaly detection, achieving optimal recall values for most datasets.
This study uses four biological datasets focused on anomaly detection. Links to each dataset:
- Breast Cancer: Wisconsin Diagnostic Breast Cancer Dataset
- Breast Cancer Gene Expression: Gene Expression Cancer RNA-Seq
- Vertebral Column Pathology: Vertebral Column Dataset
- Maternity Risk: Maternity Risk Dataset
We apply three variants of the Frank-Wolfe algorithm to solve the MEB problem:
- Pairwise Frank-Wolfe: An efficient, projection-free algorithm.
- Blended Pairwise Conditional Gradients (BPCG): An optimized variant that minimizes swap steps.
- (1+ϵ)-Approximation with Away-Steps: An approximation method designed for efficient handling of large datasets.
Each algorithm is evaluated on convergence rates, computational time, and accuracy metrics (primarily recall) in detecting anomalies.
- MEB for Maternity Risk dataset
-
Computational Time and Iterations (Table)
Summary table comparing computational time and iterations for each algorithm across datasets.Dataset Algorithm Time (ms) Iterations Breast Cancer PFW 664.50 1,484 Breast Cancer BPCG 647.54 2,998 Breast Cancer MEB(A) 49.15 44 Gene Expression PFW 369,959.76 100,000 Gene Expression BPCG 397,067.57 100,000 Gene Expression MEB(A) 2,116.30 44
🛠️ Emphasizing the primary tools and libraries utilized.
: Main programming language.
- Optimization Algorithms: Implemented variations of the Frank-Wolfe algorithm.
- nbviewer Link: View the Jupyter Notebook
Clone the repository and install dependencies:
# Clone the repository
git clone https://github.com/username/MEBMedBio.git
# Navigate to the project directory
cd MEBMedBio
# Install dependencies
pip install -r requirements.txt
- Code:
MEB_BPCG.ipynb
- Report:
Report.pdf
- Presentation:
Presentation.pdf
The repository includes the following files:
MEB_BPCG.ipynb
: Jupyter notebook with the full workflow, from data preprocessing to algorithm implementation and evaluation.Report.pdf
: Detailed report on methodology, algorithmic analysis, and findings.Presentation.pdf
: Summary presentation of key findings and insights.
To run the project, open MEB_BPCG.ipynb
in Jupyter Notebook or view it on nbviewer.
Contributions are welcome! Please see the contributing guidelines for more details.