Skip to content

cherifbenham/ml

Repository files navigation

ML Fundamentals

First things first, you will learn what we mean by Machine Learning, its difference to regular programming, and the two sets of tasks that you’ll have to tackle as a data scientist - regression and classification tasks.

We'll then introduce you to Scikit-learn, probably the most popular Machine Learning library for Python, and your best friend during the ML module.

You will then dive into ML's fundamental concepts, and will be introduced to the key steps of the implementation of ML algorithms. We want our models to be able to generalise on unseen data flawlessly, and we’ll need specific techniques that will allow us to not overfit and validate the performance of our model.

Following those fundamental guidelines and techniques, you will train your first Machine Learning models with Scikit-learn: linear regression and logistic regression.

Data Preparation

This module is dedicated to the preprocessing of your dataset - how to have a clean, balanced dataset that is representative of your problem. You will discover how to deal with missing values, scale your features and encode your data into vectorized forms that can be used later in your models.

You will realise that the quality of your preprocessing will affect the performance of your models, and that its importance should not be underestimated.

We'll also show you how to enrich your dataset via feature engineering, and how to evaluate the contribution of individual features to the performance of a model as a way to get rid of the "noise".

Performance metrics

This module is split into two sections, Regression metrics and Classification metrics, which are different by nature. We will see here how to evaluate the performance of our models precisely, and how to choose the error metric appropriate to the task.

In an applied setting, knowing what to optimise is actually paramount to cover correctly business use case and evaluate feasibility and utility of the use of machine learning technologies. Metrics like precision, recall & f1 score, are very important to use in order to determine business impact.

Finally, we'll introduce you to your first distance based algorithm, the K-nearest neighbors, a versatile model capable of solving both Classification and Regression tasks.

Under the hood

Now that we have covered the fundamentals of Machine learning, let's get a bit more theoretical. This module is dedicated to the understanding of the learning mechanism of algorithms, which consists of a Solver minimizing a Loss function.

We will cover the famous Gradient Descent - an iterative optimisation solver - in depth, and will introduce other types of solvers. We will then dive into the different Loss functions, their specificities, and how they influence the learning process of algorithms.

Understanding the optimization process will give you more control over the design of your models in order to have the best performance possible for the specific problem you want to solve.

Model Tuning

To conclude the first week of Machine learning, you will learn to fine tune your models and push them to the limits.

First, you'll be introduced to the concept of Regularization, a technique used to combat overfitting. We'll then speak about Grid and Random Search, two model hyperparameter optimization techniques.

Finally we will see another powerful supervised learning method called the SVM (Support Vector Machines).

Workflow

This module is dedicated to Pipelines, the transition from notebooks to production.

A pipeline is a chain of operations of a Machine Learning project (preprocessing, training, predicting, etc...) combined into a single object.

Pipelines allow you to structure your code, make your workflow easy to read and understand, enforce the implementation and order of steps in your ML project, and make your work reproducible and deployable.

Ensemble methods

Today, we get the final family of supervised learning models: ensemble methods. Ensemble methods consist of combining multiple weak models into a single powerful one via techniques called Bagging, Boosting, and Stacking.

We will start by introducing the Decision Tree, a simple ML algorithm, and will show you how it can be enhanced by Bagging and Boosting.

You will also get to know the famous Gradient Boosting algorithm, one of the most powerful models out there, often winning Kaggle-type competitions.

Unsupervised learning

Let's dive into the other dimension of ML, Unsupervised learning. We will cover algorithms that are used for exploratory analysis, dimensionality reduction and other compression-type tasks.

We’ll see the Principal Component Analysis (PCA), a powerful dimensionality reduction tool which will help you to deal with very large datasets.

We will then dive into the K-means algorithm, a clustering methods used to discover coherent sub-groups within a dataset without supervision.

This day is also an opportunity to discover big machine learning applications through building a recommender system and an image compression program.

Time Series

Time to dive into one of the specific domains of Machine Learning: Time Series.

First, you will be introduced to the fundamental concepts of Time series: Decomposition, Stationarity, and Autocorrelation.

We will then show you three state of the art Machine Learning models: ARMA, ARIMA, and SARIMAX.

NLP

This day is dedicated to one of the most common types of data out there: text.

The first part of the lecture is dedicated to text preprocessing and representation techniques. Text is different by nature to numerical data, and needs to be preprocessed and represented in a specific way in order for algorithms to be able to interpret it.

You will discover the NLTK, a popular NLP library with powerful tools that facilitate the crunching of text data.

We will then break down the Naive Bayes algorithm, a probablistic model that has proved succesful for text classification tasks such as authorship attribution.

Finally, we will also tackle word embeddings and document clustering with the LDA algorithm, which is used to explore large corpus' of unstructured text documents.

Releases

No releases published

Packages

No packages published

Languages