This repo contains some of the courseworks I completed in my Statistics (Data Science) Masters at Imperial College London during 2021-2022. Some modules were excluded either because they did not have a coding part or because the course content distribution was not permitted (Deep Learning with Tensorflow).
The course description for each module are included below:
The course covered the following topics:
- The Normal Linear model (estimation, residuals, residual sum of squares, goodness of fit, hypothesis testing, ANOVA, model comparison).
- Improving designs and explanatory Variables (categorical variables and multi-level regression, random and mixed effects models).
- Diagnostics and Model Selection (outliers, leverage, misfit, exploratory and criterion-based model selection, Box-Cox transformations, weighted regression)
- Generalised Linear Models (exponential family of distributions, iteratively re-weighted least squares, model selection and diagnostics).
The objective of this module was to become comfotable with the use of common Big Data tools, with an emphasis on the use of advanced statistical methods for analysis. The module focused on the application of statistical methods in the processing platforms Hadoop and Spark.
The course covers a number of computational methods that are key in modern statistics. Topics include:
- Statistical computing: R programming, data structures, programming constructs, object system, graphics.
- Numerical methods: root finding, numerical integration, optimisation methods such as EM-type algorithms.
- Simulation: generating random variates, Monte Carlo integration.
- Simulation approaches in inference: randomisation and permutation procedures, bootstrap, Markov Chain Monte-Carlo.
This module covered computing with data, producing reproducible work flows, preparing messy real-world datasets, performing exploratory data analysis and presenting data via data visualisation techniques. In addition, it covered the science in data science, exploring what data analysts really do, thinking critically about appropriate uses and misuses of data science.
The course focused on a variety of useful techniques including methods for regression, classification, feature extraction, dimensionality reduction, and data clustering. State-of-art approaches such as Random Forest, Neural networks, kernel methods and Gaussian processes were introduced.
In this module we developed models and tools to understand complex and high-dimensional genetics datasets. This included statistical and machine learning techniques for: multiple testing, penalised regression, clustering, p-value combination, dimension reduction. The module covered both Frequentist and Bayesian statistical approaches. In addition to the statistical approaches, we were introduced to genome-wide association and expression studies data, next generation sequencing and other OMICS datasets.