This repository contains notebooks, data, and slides for the survey of generalized machine learning and distributed computing training from September 14, 2018 - September 28, 2018. During this three day course, we will cover the following topics:
Day One:
- ML Review: Generalized ML and Spatial Learning, Bias/Variance Tradeoff, Model Selection Triple
- Regularized Regression: LASSO vs Ridge; ElasticNet and more
- Clustering: Partitive vs Agglomerative Clustering; clustering evaluation methods, visualization
- Classification I: Instance and Inductive Models (kNN, Decision Trees, Ensembles of Trees)
Day 2:
- Classification II: Parametric Models: SVMs, Bayesian Models, Logistic Regression
- Dimensionality Reduction and Manifolds: PCA, SVD, tSNE, Isomaps
- Neural Networks I: Multi-Layer Perceptrons
- Neural Networks II: Deep Learning and Tensorflow
Day 3:
- Introduction to Spark: RDDs and Architecture
- Programming Spark - interactive analysis and distributed jobs
- Using Spark for data analysis: Spark SQL and Spark DataFrames
- Spark for distributed ML: Spark MLlib
Notes:
- class experience with Logistic Regression and ANNs
- background is mostly math and stats, not computational
- don't rely on Python or coding knowledge; do exercises as live demos
- focus on feature analysis and hyperparameter tuning
- visual analysis with YB a big help!
- for distributed computing, focus on high level computing issues, not mechanisms
- no need for a cluster or workshops on the distributed computing day
Other Notes:
- Classification Metrics II to follow (ROC/AUC, DecisionThreshold, PR Curves, Class Balance issues)