Author: Carl McBride Ellis (LinkedIn)
The following represents a selection of my kaggle notebooks
- Anscombe's quartet and the importance of EDA (+ dataset)
- Absolute beginners Titanic 'EDA' using dabl
- Exploratory data analysis using pandas pivot table
- Pearson correlation coefficient, mutual information (MI) and Predictive Power Score (PPS) - a simple comparison
- Use case example: Jane Street: EDA of day 0 and feature importance
- Use case example: Riiid: EDA and feature importance
- Use case example: Ventilator Pressure: EDA and simple submission
- Filtering outliers using the Isolation Forest
- Data anonymization using Faker (Titanic example)
- AWS PyDeequ unit tests to measure data quality
- Naïve Dataset Distillation
This is a collection of my python example scripts for either classification, using the Titanic: Machine Learning from Disaster competition data, or regression, for which I use the House Prices: Advanced Regression Techniques competition data:
algorithm | classification | regression |
---|---|---|
Logistic regression | link | --- |
Generalized Additive Models (GAM) | link | --- |
Iterative Dichotomiser 3 (ID3) | link | --- |
Decision tree | link | --- |
Regularized Greedy Forest (RGF) | link | link |
XGBoost | --- | link |
TabNet | link | link |
Neural networks (using keras) | link | link |
Gaussian process | link | link |
Hyperparameter grid search | link | link |
- Classification using TensorFlow Decision Forests
- Titanic in pure H2O.ai
- Predict house prices using H2O.ai (regression)
- Automatic tuning of XGBoost with XGBTune
- MNIST with no neural network
- PyTorch Tabular: Gated Additive Tree Ensemble
- Regression prediction intervals with MAPIE
- Prediction intervals: Quantile Regression Forests
- Locally-weighted conformal regression
- Classifier probability calibration using Venn-ABERS predictors (winner of the Gabriel Preda "Best recent Kaggle Notebook" prize)
- Feature importance using the LASSO
- Feature selection using the Boruta-SHAP package
- Recursive Feature Elimination (RFE) example
- House Prices: Permutation Importance example
- SHAP Permutation explainer + random "probe"
- What is Adversarial Validation?
- Jane Street: t-SNE using RAPIDS cuML
- Synthanic feature engineering: Beware!
- Time series: A simple moving average (MA) model
- Time series decomposition: Naive example
- LSTM time series prediction: sine wave example
- LSTM time series + stock price prediction = FAIL
- Interrupted time series analysis: Causal Impact
- Temporal Convolutional Network using Keras-TCN
- Plotting OHLC and V ticker data using mplfinance
- Correlograms of 14 cryptocurrencies (1 day)
- Granger causality testing for 1 day along with Granger causality Part II: The Movie
- Store Sales: Day of the Week model
- TPS Jan 2022: A simple average model (no ML)
- TPS Jan 2022: Prophet + holidays + GDP regressor
- Multivariable time series forecasting: Linear tree
- Probabilistic forecasting using GluonTS: Bitcoin
- [PFI Starter] Skforecast example - starter notebook provided for the "Probabilistic forecasting I: Temperature" competition
- Ensemble methods: majority voting example
- ML-Ensemble example using House Prices data
- Stacking ensemble using the House Prices data
- Ensembling: Convex combination of model predictions (CCMP) and the
hillclimbers
package
- Explainability, collinearity and the variance inflation factor (VIF)
- KISS: Small and simple Titanic models
- House Prices: my score using only 'OverallQual' and also a simple two variable model
- Titanic explainability: Why me? asks Miss Doyle
- TabNet and interpretability: Jane Street example
- GPU accelerated SHAP values: Jane Street example
- Animated histogram of the central limit theorem
- Hypothesis testing: The two sample t-test, p-value and power
- Explainability, collinearity and the variance inflation factor (VIF)
-
Beautiful math in your notebook: a guide to using
$\LaTeX$ math markup in kaggle notebooks. -
Titanic: In all the confusion... which looks at the confusion matrix, ROC curves,
$F_1$ scores etc. - Classification: How imbalanced is "imbalanced"? - (mentioned in "Notebooks of the week: Hidden Gems")
- Overfitting and underfitting the Titanic
- False positives, false negatives and the discrimination threshold
- Introduction to the Regularized Greedy Forest (using rgf_python)
- Extrapolation: Do not stray out of the forest!
- Titanic: some sex, a bit of class, and a tree...
- The Lehmer RNG algorithm for
seed=42
- Titanic leaderboard: a score > 0.8 is great!
- House Prices: How to work offline (+ dataset)
- Pandas one-liners
- The latest trends in data science
- The Titanic using SQL
- Some pretty t-SNE plots
- Encuesta kaggle 2021: ¿España es diferente?
- How much do people on kaggle earn by country (2021)
- All in a pickle: Saving the Titanic - Saving our machine learning model to a file using pickle
- Machine learning review papers on arXiv [polars]
Geospatial analysis
- Choropleth map of kaggle Grandmaster locations
- Smartphone 2022: A look at the ground truth maps
- Animating the path of a smartphone GPS signal
Finance related
- S&P 500 daily returns: Normal and Cauchy fits
- Bitcoin candlestick chart (2021)
- Irrational Exuberance? S&P vs Bitcoin
The Meta Kaggle dataset consists of data regarding the kaggle site
- Kaggle in numbers - updated almost daily
- Simple EDA of kaggle Grandmasters - updated almost daily
- Number of active Kaggle users
- Notebooks: Number of views, and days, per vote
- kaggle discussions: busiest time of the day? - (mentioned in "Notebooks of the week: Hidden Gems")
- The kaggle working week
- WordCloud of gold medal winning notebook titles
- Shakeup interactive scatterplot maker
- Shakeup scatterplots: Boxes, strings and things...
- When will my notebook get its medal?