August 2024 / Lugano / Switzerland
This repository provides a structured approach to implementing and understanding various machine learning techniques across multiple stages of the ML pipeline. It includes essential aspects such as data preprocessing, feature extraction, classifier implementation, and evaluation metrics. Modalities are body motion or body tracking data plus Galvanic Skin Response (GSR) data. But, GSR is only used for multi-modal fusion task.
The repository is organized into seven chapters, each focusing on a different layer of the ML process, from preprocessing data to advanced cross-validation techniques. The goal is to provide a hands-on guide that covers both foundational and advanced methods used in machine learning, including:
- Data Understanding & Preprocessing: Learn how to properly clean and prepare datasets, including handling missing values and performing feature selection.
- Mid-Level Processing: Techniques like dimensionality reduction and multi-modal fusion.
- Classification Algorithms: Implement various machine learning models such as XGBoost, Random Forest, SVM, and others.
- Evaluation Metrics: In-depth coverage of metrics used for model evaluation, such as accuracy, precision, recall, F1-score, ROC-AUC, and more.
- Evaluation Plots: Visual representations of model performance, including confusion matrix heatmaps, violin plots, and more.
- Advanced Techniques: Explore cross-validation methods such as K-Fold, Leave-One-Out, and bootstrapping.
The project is broken down into the following main chapters:
-
Pre-Processing
- Data parsing, interpolation, denoising, handling NaNs, normalization, and standardization.
-
Mid-Level Processing
- Feature extraction, augmentation, selection, dimensionality reduction, and multi-modal fusion.
-
Data Understanding
- Feature distribution, correlation, feature importance, SHAP, LIME, PCA, t-SNE, and class distribution ratios.
-
Metrics
- A comprehensive breakdown of metrics, including TP, TN, FP, FN, ROC-AUC, precision-recall curves, Cohen's Kappa, MCC, specificity, and more.
-
Classifiers
- Implementation of popular classifiers such as Naive Bayes, K-Nearest Neighbor, SVM, Logistic Regression, Decision Trees, XGBoost, Random Forest, and AdaBoost.
-
Evaluation Plots
- Visual analysis of model performance with violin plots, cumulative gain charts, confusion matrix heatmaps, test accuracy plots, and box plots.
-
Advanced Techniques and Analysis
- Explore advanced validation techniques such as K-Fold Cross-Validation, Leave-One-Out, Bootstrapping, and nested cross-validation.
This project includes detailed slides that cover the concepts, techniques, and findings presented in the code. These slides are available in both PPT and PDF formats, making them easy to use for presentations or further study.
In this chapter, we focus on the essential preprocessing steps required to clean and prepare data for machine learning models. Proper preprocessing ensures that your dataset is structured, consistent, and ready for model training, which can significantly impact performance.
The chapter covers the following key steps:
-
Parsing:
- Reading and extracting relevant data from raw files, such as BVH files, and handling different formats.
-
Data Interpolation/Resizing:
- Filling in missing data points by interpolating values between known data points, ensuring uniformity across the dataset.
-
Denoising:
- Removing noise or outliers from the dataset to improve model accuracy and reduce errors.
-
Handling NaNs (Missing Values):
- Detecting and imputing missing values using various techniques (e.g., mean, median, or KNN imputation).
-
Normalization:
- Scaling features to a specific range (typically [0, 1]) to ensure consistent input values for machine learning models.
- Standardization:
- Transforming features to have a mean of 0 and a standard deviation of 1, which is crucial for algorithms like SVM and neural networks.
- Improves Model Performance: Clean and structured data leads to more accurate predictions and better generalization in machine learning models.
- Handles Missing Data: Properly imputing or interpolating missing data ensures the integrity of the dataset and prevents model bias.
- Reduces Noise: Denoising helps remove irrelevant or erroneous data points, leading to better learning during training.
- Enables Fair Feature Comparisons: Normalization and standardization ensure that features with different scales (e.g., height in meters, weight in kilograms) are comparable, preventing certain features from dominating the model.
This chapter focuses on mid-level processing techniques, which are applied after data has been pre-processed. These techniques further refine the data, extract meaningful features, and reduce the dimensionality of the dataset, allowing machine learning models to train more efficiently and accurately.
Mid-level processing includes the following steps:
-
Data Augmentation:
- Enhance the size and diversity of your dataset by applying transformations like rotation, flipping, scaling, and more. Data augmentation is especially useful in scenarios where data is limited.
-
Feature Extraction:
- Extract meaningful features from raw data to improve the representational power of your models. This includes techniques such as calculating velocity, acceleration, and range of motion for time-series data.
-
Feature Selection:
- Identify and retain the most relevant features using methods such as
SelectKBest
and chi-squared tests. Reducing the number of irrelevant or redundant features enhances model performance and reduces overfitting.
- Identify and retain the most relevant features using methods such as
-
Dimensionality Reduction:
- Reduce the number of features or variables in your dataset while retaining important information. Techniques such as Principal Component Analysis (PCA) or t-SNE help visualize high-dimensional data and speed up the model training process.
-
Multi-Modal Fusion:
- Combine data from different sources (e.g., combining video, motion, and physiological data) to enhance model robustness and accuracy. Multi-modal fusion is performed in early, late, and hybrid stages of processing.
- Early fusion used methods: Concatenation Early Stage KNN Imputation Fusion and PCA Early Stage KNN Imputation Fusion.
- Late fusion used methods: Weighted Averaging Late Stage Fusion and Majority Voting Late Stage Fusion.
- Hybrid fusion used methods: PCA and LDA Hybrid Fusion, Attention-Weighted XGBoost Hybrid Fusion, and CNN and LSTM Hybrid Fusion.
- Improves Model Accuracy: Feature extraction and selection ensure that models focus on the most important aspects of the data.
- Prevents Overfitting: By reducing dimensionality, mid-level processing helps prevent models from learning noise in the data.
- Enhances Training Efficiency: Reducing the number of features or augmenting data can speed up the training process and improve generalization.
- Leverages Multiple Data Sources: Multi-modal fusion allows models to combine different types of data, leading to more informed and accurate predictions.
In this chapter, we focus on data understanding, which is a crucial step in the machine learning pipeline. Before applying models, it’s important to explore and gain insights into the dataset. This chapter provides methods and tools for analyzing the structure and relationships within your data, ensuring that the data is ready for modeling.
Key components of data understanding include feature analysis, correlation studies, and distribution checks, which help you identify patterns, relationships, and potential biases.
- Feature Distribution Over Classes:
- Analyze how features are distributed across different classes in the dataset. This helps identify imbalances or trends that may influence the model's learning process.
- Feature Correlation:
- Examine the relationships between features using correlation matrices. Understanding how features are related can help you eliminate redundant information and improve model efficiency.
- Feature Importance:
- Utilize techniques such as SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) to assess the importance of individual features. Feature importance provides insights into which features contribute the most to model predictions.
-
Class Distribution Ratio:
- Check the distribution of class labels to ensure the dataset is balanced. Imbalanced data can cause bias in model predictions, so understanding class distribution is critical for fair modeling.
-
PCA and t-SNE Feature Plots:
- Use Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) to reduce the dimensionality of the dataset and visualize high-dimensional data in 2D or 3D. These visualizations provide a deeper understanding of data clustering and separability between classes.
- The Baseline and The Ground Truth:
- Establish a baseline model and compare the model’s predictions against the ground truth. This helps in evaluating the performance of the model and understanding the nature of prediction errors.
- Identifies Key Relationships: By understanding correlations and interactions between features, you can make more informed decisions about which features to use for modeling.
- Detects Data Imbalances: Analyzing class distributions ensures that your model doesn’t become biased toward overrepresented classes.
- Improves Interpretability: Techniques like SHAP and LIME provide explanations for why a model makes certain predictions, improving the transparency of your machine learning models.
- Visualizes Data Patterns: PCA and t-SNE visualizations make it easier to see how well-separated your classes are and where the model might struggle.
Chapter Four dives into the critical aspect of evaluating machine learning models by using various performance metrics. Evaluation metrics are essential for understanding how well your model performs across different datasets and tasks, ensuring it generalizes well to unseen data.
This chapter covers a wide range of metrics for both binary and multiclass classification, providing a comprehensive view of the different aspects of model performance.
- Confusion Matrix:
- A summary table that shows the correct and incorrect predictions across all classes. It provides True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN), which are the building blocks for many other metrics.
-
Accuracy:
- The ratio of correctly predicted instances to the total instances. While simple, accuracy can be misleading in imbalanced datasets.
-
Precision:
- Precision calculates the ratio of correctly predicted positive observations to the total predicted positives. It's especially important in cases where false positives are costly.
-
Recall (Sensitivity):
- Recall, also known as sensitivity, measures the ratio of correctly predicted positive observations to all observations in the actual class.
-
F1-Score:
- A harmonic mean of precision and recall. It's useful when you need a balance between precision and recall, especially in cases of class imbalance.
-
ROC-AUC (Receiver Operating Characteristic - Area Under Curve):
- The ROC-AUC score is used to evaluate the performance of classification models at various threshold settings. It provides insight into the model's ability to distinguish between classes.
- Precision-Recall AUC:
- This metric is particularly useful when dealing with imbalanced datasets. It provides a more informative measure of performance when there are many more negative instances than positive ones.
-
Balanced Accuracy:
- A metric that takes the imbalance in classes into account by averaging the recall obtained on each class.
-
Specificity:
- Also known as the True Negative Rate, specificity measures the proportion of actual negatives that were correctly identified by the model.
-
Cohen's Kappa:
- Measures the agreement between two raters who each classify items into mutually exclusive categories. It is useful for understanding how well the model performs compared to random chance.
-
Matthews Correlation Coefficient (MCC):
- A balanced measure used in binary classification that takes into account true and false positives and negatives, providing a robust metric even when classes are imbalanced.
-
Hamming Loss:
- The fraction of labels that are incorrectly predicted. This is particularly useful for multi-label classification tasks.
-
Jaccard Index (Intersection over Union):
- This metric measures the similarity between the predicted set and the ground truth set, particularly useful in multi-label classification.
- Comprehensive Performance Evaluation: Different metrics provide unique insights. For example, precision is important when false positives are costly, while recall matters more when false negatives are the bigger issue.
- Handling Class Imbalance: Metrics like ROC-AUC, Precision-Recall AUC, and Balanced Accuracy help in evaluating models on imbalanced datasets, where traditional metrics like accuracy might be misleading.
- Tailored to Specific Problems: Different tasks (binary classification, multiclass classification, multi-label classification) require different metrics to effectively measure model performance.
In this chapter, we focus on the visual representation of the model evaluation process. Plots and visualizations provide deeper insights into model performance and help interpret complex metrics. By leveraging different types of plots, we can better understand the strengths and weaknesses of machine learning models and make more informed decisions about tuning and improvements.
The chapter covers various types of evaluation plots, each offering a unique way to assess model performance, especially for classification problems.
-
Confusion Matrix Heatmap:
- A visual representation of the confusion matrix, showing the number of correct and incorrect predictions for each class. The heatmap provides a quick overview of how well the model is distinguishing between classes.
-
Violin Plot for Prediction Probabilities:
- Displays the distribution of predicted probabilities across different classes. This plot provides insights into the certainty of predictions and shows how confident the model is in its predictions for each class.
- Residual Plot:
- Used to visualize the difference between the actual and predicted values for classification problems. A residual plot can help identify patterns in prediction errors and assess whether the model is systematically over or under-predicting.
-
Test Accuracy Over Multiple Runs:
- A bar chart displaying the test accuracy across multiple runs of the model. This plot is useful for understanding the variability in model performance over different random splits of the dataset.
-
Precision-Recall Curve:
- A curve that plots precision vs. recall for different threshold settings of the classifier. This plot is especially useful in situations with imbalanced datasets, where accuracy might not be the best measure of model performance.
-
ROC Curve (Receiver Operating Characteristic Curve):
- A plot that shows the trade-off between the true positive rate (sensitivity) and the false positive rate. The area under the curve (AUC) provides a single value that summarizes the model’s ability to distinguish between classes.
-
Cumulative Gain Chart (Lift Chart):
- A plot that helps evaluate the effectiveness of a model by comparing the lift it provides compared to a random model. This chart is useful for assessing how well the model is performing at different thresholds.
- Histogram of Prediction Probabilities:
- A simple histogram showing the distribution of predicted probabilities for each class. This plot helps evaluate how confident the model is across all predictions and provides insight into whether the model is making overly confident or uncertain predictions.
-
Class Distribution Plot:
- A count plot that shows the distribution of different classes in the dataset. This plot helps ensure that the dataset is balanced and provides insight into potential class imbalances that could affect model performance.
-
Box Plot for Prediction Probabilities:
- A box plot that shows the statistical summary of predicted probabilities for each class. It includes median values, quartiles, and potential outliers, providing a clearer view of how predictions are spread across classes.
- Better Interpretability: Visualizing metrics allows us to intuitively understand model performance and identify patterns or biases in predictions.
- Identify Weaknesses: Plots like the confusion matrix heatmap and residual plots can highlight specific areas where the model is underperforming, helping to target specific improvements.
- Assess Model Confidence: Violin plots, box plots, and probability histograms help assess how confident the model is in its predictions, offering insights into whether the model is making overly aggressive or conservative predictions.
- Tune and Improve Models: Plots like the ROC and Precision-Recall curves are particularly useful for adjusting decision thresholds and improving model performance in imbalanced datasets.
In Chapter Six, we explore various classification algorithms, which are the core tools in supervised machine learning for predicting categorical outcomes. Different classifiers come with their own strengths and are suited to different types of problems. This chapter provides an overview of both basic and advanced classifiers, covering a wide range of algorithms, from simple linear models to powerful ensemble methods.
-
Naive Bayes:
- A probabilistic classifier that assumes strong independence between features. Naive Bayes is highly efficient for tasks like text classification and works well even with limited training data.
-
K-Nearest Neighbors (KNN):
- A simple and intuitive classifier that assigns the class of a data point based on the majority class of its nearest neighbors. KNN is effective for small datasets with clear distinctions between classes.
-
Support Vector Machine (SVM):
- A powerful classifier that finds the optimal hyperplane to separate different classes. SVM works well for both linearly and non-linearly separable data, especially in high-dimensional spaces.
-
Logistic Regression:
- A simple yet powerful linear model used for binary classification. Logistic regression estimates the probability of a binary outcome based on input features and is commonly used in problems with a linear decision boundary.
-
Decision Trees:
- A non-parametric method that uses a tree structure to make decisions based on the most significant features. Decision trees are highly interpretable and can handle both classification and regression tasks.
-
Random Forest:
- An ensemble method that builds multiple decision trees and aggregates their results to improve accuracy and reduce overfitting. Random Forest is robust and works well on large datasets with many features.
-
XGBoost:
- A highly efficient and scalable boosting algorithm that has become a favorite in machine learning competitions. XGBoost builds trees sequentially, with each new tree correcting the errors of the previous ones, making it one of the most powerful classifiers for structured data.
-
AdaBoost:
- A boosting algorithm that combines weak learners to form a strong classifier. AdaBoost assigns higher weights to misclassified examples and adjusts them iteratively, improving performance over multiple rounds.
- Model Selection: Choosing the right classifier is crucial for achieving optimal performance. Each classifier has its strengths depending on the nature of the data and the specific problem being solved.
- Scalability: Some algorithms, like Naive Bayes and Logistic Regression, are highly scalable for large datasets, while others like Random Forest and XGBoost are better suited for more complex datasets.
- Trade-off Between Interpretability and Power: Simple models like Decision Trees offer high interpretability, while ensemble methods like XGBoost and Random Forest are more powerful but harder to interpret.
- Handling Different Data Types: Some classifiers handle numerical and categorical data well, while others require feature engineering. Understanding which classifier works best for different types of data is critical for success.
In this final chapter, we explore advanced techniques that enhance model performance, improve generalization, and ensure robust evaluation. These techniques go beyond the basics of machine learning, allowing for deeper insights and more fine-tuned models. By incorporating these methods, you can handle more complex problems and maximize the performance of your models.
-
K-Fold Cross-Validation:
- A resampling method that involves splitting the dataset into k subsets (or folds). The model is trained on k-1 folds and tested on the remaining one. This process is repeated k times, and the average performance across all folds is taken as the final score. K-Fold cross-validation helps in getting a more reliable estimate of model performance, especially with limited data.
-
Leave-One-Out Cross-Validation (LOOCV):
- A specific case of K-Fold cross-validation where k equals the number of data points, meaning the model is trained on all data except one point and tested on that point. This method is particularly useful for small datasets but can be computationally expensive for larger datasets.
-
Bootstrapping:
- A sampling technique used to estimate the accuracy of models by repeatedly sampling from the dataset with replacement. Bootstrapping provides insights into model variability and confidence intervals, helping to better understand model stability.
-
Nested Cross-Validation:
- This technique is used for hyperparameter tuning and model selection. In nested cross-validation, an outer loop is used to evaluate the model, and an inner loop is used for hyperparameter optimization. It ensures that the performance estimates are unbiased and prevents overfitting during model selection.
-
Model Ensembling:
- Combining multiple models to create a stronger, more robust prediction. Techniques like bagging (e.g., Random Forest) and boosting (e.g., AdaBoost, XGBoost) are popular ensembling methods that improve accuracy by reducing variance and bias.
-
Hyperparameter Tuning:
- The process of optimizing model hyperparameters to improve performance. Methods like grid search and random search help in finding the best combination of parameters for classifiers like Random Forest, SVM, and XGBoost, ensuring optimal performance.
-
Early Stopping:
- A regularization technique used during training to stop the learning process when the model performance on a validation set starts to degrade. This helps prevent overfitting and ensures that the model generalizes well to unseen data.
-
Feature Importance and Selection:
- Advanced methods for selecting the most significant features using techniques like Recursive Feature Elimination (RFE), SHAP (SHapley Additive Explanations), and LIME (Local Interpretable Model-agnostic Explanations). These techniques help in reducing dimensionality and improving interpretability while maintaining model performance.
-
Class Imbalance Handling:
- Techniques for handling imbalanced datasets, such as oversampling the minority class (e.g., SMOTE) or undersampling the majority class. These methods ensure that models are not biased toward the majority class and perform well across all classes.
-
Model Interpretability:
- Techniques to interpret model predictions and understand how decisions are made. Tools like SHAP and LIME are used to explain individual predictions, providing insights into the contribution of each feature to the final decision.
- Improved Generalization: Techniques like cross-validation, ensembling, and early stopping help ensure that the model generalizes well to unseen data and avoids overfitting.
- Enhanced Model Performance: Advanced tuning and optimization strategies like hyperparameter tuning and feature selection help achieve higher accuracy and reduce model errors.
- Handling Complex Problems: Methods like nested cross-validation and bootstrapping allow for deeper exploration and evaluation of models, ensuring that they perform well even on complex datasets.
- Dealing with Real-World Challenges: Techniques for handling imbalanced datasets and improving model interpretability are critical when dealing with real-world datasets, where data may be unbalanced or the decision-making process needs to be transparent.
To set up the required Python libraries for this project, follow these steps:
-
Ensure you have Python 3.9 installed on your system.
-
Open your terminal or command prompt and navigate to the directory containing the
Requirements.txt
file. -
Run the following command to install all dependencies:
pip install -r Requirements.txt
To clone and install this project, follow these steps:
-
Clone the Repository:
git clone https://github.com/SeyedMuhammadHosseinMousavi/Comprehensive-Machine-Learning-Techniques-Metrics-Classifiers-and-Evaluation.git
After cloning the repository and installing the required dependencies, you can run the Python scripts to perform various machine learning tasks.
-
Prepare the Dataset:
- Ensure your dataset is placed in the
Small Dataset/
folder. The project assumes that the required dataset files are located in this directory. - The dataset should be in a format supported by the project (e.g., CSV, BVH files). You can modify the script to point to your specific dataset if needed.
- Ensure your dataset is placed in the
-
Run the Python Scripts:
You can execute different Python files for specific tasks.
This project utilizes two key datasets for emotion recognition tasks, combining physiological data and body motion data. Below is a description of each dataset and links to access them.
The VR Eyes & Emotions Dataset (VREED) is used to analyze physiological responses in various emotional states. It includes eye-tracking data collected while participants were exposed to different stimuli, allowing for emotion recognition based on physiological signals.
- Link to Dataset: Kaggle - VR Eyes & Emotions Dataset (VREED)
- Data Content:
- Eye-tracking signals recorded in virtual reality environments.
- Emotion labels corresponding to different emotional states.
The Body Motion Dataset for Emotion Recognition focuses on capturing human motion during emotional states. The dataset contains body motion sequences that can be used to train models to recognize emotions based on movement patterns.
- Link to Dataset: GitHub - Online Motion Style Transfer Repository
- Data Content:
- Motion capture sequences of body movements under different emotional states.
- Labels for various emotions based on motion data.
Contributions are welcome! If you find a bug or have a feature request, please open an issue or submit a pull request.
This project is licensed under the Creative Commons Zero v1.0 Universal license (CC0 1.0) – see the LICENSE file for details. You can copy, modify, and distribute this work, even for commercial purposes, all without asking permission.
#MachineLearning #Python #Classification #EmotionRecognition #EvaluationMetrics #DataScience #XGBoost #SVM #RandomForest #DeepLearning #AI #PhysiologicalData #BodyMotionData #FeatureSelection #DataVisualization #USI #KaggleDataset #OpenSource #SeyedMuhammadHosseinMousavi #DataAnalysis #NeuralNetworks #CrossValidation #Boosting #ModelEvaluation