diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index 7c866fdb..34371413 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -33,17 +33,16 @@ repos:
types: [python]
- repo: local
hooks:
- - id: ruff
- name: 'ruff: Check for errors, styling issues and complexity'
+ - id: ruff-check
+ name: 'Ruff: Check for errors, styling issues and complexity, and fixes issues if possible (including import order)'
entry: ruff
language: system
- - repo: local
- hooks:
- - id: isort
- name: 'isort: Sort file imports'
- entry: isort
+ args: [ --fix, --no-cache ]
+ - id: ruff-format
+ name: 'Ruff: format code in line with PEP8'
+ entry: ruff format
language: system
- types: [python]
+ args: [ --no-cache ]
- repo: local
hooks:
- id: codespell
@@ -57,12 +56,4 @@ repos:
hooks:
- id: pyupgrade
name: 'pyupgrade: Updates code to Python 3.8+ code convention'
- args: [*py_version]
- - repo: local
- hooks:
- - id: black
- name: 'black: PEP8 compliant code formatter'
- entry: black
- language: python
- types: [python]
- language_version: python3
\ No newline at end of file
+ args: [*py_version]
\ No newline at end of file
diff --git a/CHANGELOG.md b/CHANGELOG.md
deleted file mode 100644
index 5a07bfa9..00000000
--- a/CHANGELOG.md
+++ /dev/null
@@ -1,216 +0,0 @@
-# Changelog
-
-All notable changes to this project will be documented in this file.
-
-The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
-and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
-
-## [2.1.1] - 2023-09
-Improvements in this release:
-- Update SHAP version to the latest #228
-
-## [2.1.0] - 2023-07
-Improvements in this release:
-- Make ShapRFECV return matplotfigure (instead of axis) #222
-- Add option for penalty on shap calculation to distinguish features with similar shap performance # 213
-- Implement automatic feature selection #220
-
-## [2.0.1] - 2023-06
-Improvements in this release:
-- Update pre-commit hooks & add validation for jupyter notebooks # 213
-- Fix the docs deployment #211
-
-## [2.0.0] - 2023-06
-Improvements in this release:
-- Drop explicit support for python 3.7, add support for 3.11 #206, #203, #185
-- Activate and add pre-commit hooks (isort, codespell) #205, #206
-- Add support for groups in SHAP RFECV #182
-- Bug fix: SHAP RFECV now produces reproducible results every time (this breaks backwards compatibility) #197
-- Bug fix: Updated GitHub actions, fixed deprecations #199
-- Bug fix: Remove most of the unreliable warning assertion checks #207
-
-## [1.8.9] - 2022-04-08
-Improvements in this release:
-- Drop explicit support for python 3.6, add 3.10 #177
-- Bug fix: define shap mask based on rows, instead of columns #178
-- Bug fixes in unit tests #180
-- Improve support for categorical features in shap calculations #184
-
-## [1.8.8] - 2021-12-08
-Improvements in this release:
-- Added support for XGBoost and Catboost models in ShapRFECV #175
-
-## [1.8.7] - 2021-10-28
-Improvements in this release:
-- Added support for early stopping in new lightgbm version #164
-
-## [1.8.6] - 2021-10-05
-Improvements in this release:
-- Added alpha parameter to DependencePlotter #162
-
-## [1.8.5] - 2021-08-24
-Improvements in this release:
-- Docs and docstrings improvements for stats tests #158
-
-## [1.8.4] - 2021-06-16
-Improvements in this release:
-- Fix the bug in the Shap Dependence Plot #153
-- Add HowTo guide for using grouped data #154
-
-## [1.8.3] - 2021-06-15
-Improvements in this release:
-- Fix p-value calculation in PSI #142
-
-## [1.8.2] - 2021-05-04
-Improvements in this release:
-- Fix catboost bug when calculating SHAP values #147
-- Supply eval_sample_weight for fit in EarlyStoppingShapRFECV #144
-- Remove codecov.io #145
-- Remove sample_row from probatus #140
-
-## [1.8.1] - 2021-04-18
-Improvements in this release:
-- Enable use of sample_weight in ShapRFECV and EarlyStoppingShapRFECV #139
-- Fix bug in EarlyStoppingShapRFECV #139
-- Fix issue with categorical features in SHAP #138
-- Missing values handled by AutoDist #126
-- Fix issue with missing histogram in DependencePlot #137
-
-## [1.8.0] - 2021-04-14
-Improvements in this release:
-- Implemented EarlyStoppingShapRFECV #108
-- Added support for Python 3.9 #132
-
-## [1.7.1] - 2021-04-13
-Improvements in this release:
-- Add error if model pipeline passed to SHAP #129
-- Fixed PSI bug with empty bins #116
-- Unit tests are run daily #113
-- TreeBucketer has been refactored #124
-- Fixes to failing test pipeline #120
-- Improving language in docs #109, #107
-
-## [1.7.0] - 2021-03-16
-Improvements in this release:
-- Create a comparison of imputation strategies #86
-- Added support for passing check_additivity argument #103
-- Range of code styling issues fixed, based on precommit config #100
-- Renamed TreeDependencePlotter to DependencePlotter and exposed the docs #94
-- Enable instalation of extra dependencies #97
-- Added how to notebook to ensure reproducibility #99
-- Description of vision of probatus #91
-
-## [1.6.2] - 2021-03-10
-Improvements in this release:
-- Bugfix, allow passing kwargs to dependence plot in ShapModelInterpreter #90
-
-## [1.6.1] - 2021-03-09
-Improvements in this release:
-- Added ShapRFECV support for all sklearn compatible search CVs. #76 #49
-
-## [1.6.0] - 2021-03-01
-Improvements in this release:
-- Added features list to README #53
-- Added docs for sample row functionality #54
-- Added 'open in colab' badges to tutorial notebooks #56
-- Deploy documentation on release #47
-- Added columns_to_keep for shap feature elimination #63
-- Updated docs for usage of columns to keep functionality in SHAPRFECV #66
-- Added shap support for linear models #69
-- Installed probatus in colab notebooks #80
-- Minor infrastructure tweaks #81
-
-## [1.5.1] - 2020-12-04
-
-Various improvements to the consistency and usability of the package
-- Unit test docstring and notebooks #41
-- Unified scoring metric within probatus #27
-- Improve docstrings consistency documentation #25
-- Implemented unified interface #24
-- Added images to API docs documentation #23
-- Added verbose parameter to ShapRFECV #21
-- Make API more consistent #19
- - Set model parameter name to clf across probatus
- - Set default random_state to None
- - Ensure that verbose is used consistently in probatus
- - Unify parameter class_names for classes in which it is relevant
- - Add return scores parameter to compute wherever applicable
-- Add sample row functionality to utils #17
-- Make an experiment comparing sklearn.RFECV with ShapRFECV #16
-- ShapModelInterpreter calculate train set feature importance #13
-
-## [1.5.0] - 2020-11-18
-- Improve SHAP RFECV API and documentation
-
-## [1.4.4] - 2020-11-11
-- Fix issue with the distribution uploaded to pypi
-
-## [1.4.0] - 2020-11-10 (Broken)
-- Add SHAP RFECV for features elimination
-
-## [1.3.0] - 2020-11-05 (Broken)
-- Add SHAP Model Inspector with docs and tests
-
-## [1.2.0] - 2020-09-30
-- Add resemblance model, with SHAP based importance
-- Improve the docs for resemblance model
-- Refactor stats tests, improve docs and expose functionality to users
-
-## [1.1.1] - 2020-09-08
-- Improve Tree Bucketer, enable user to pass own tree object
-
-## [1.1.0] - 2020-08-24
-- Improve docs for stats_tests
-- Refactor stats_tests
-
-## [1.0.1] - 2020-08-07
-- TreeBucketer, which bins the data based on the target distribution, using Decision Trees fitted on a single feature
-- PSI calculation includes the p-values calculation
-
-## [1.0.0] - 2020-02-24
-- metric_volatility and sample_similarity rebuilt
-- New documentation
-- Faster tests
-- Improved and simplified API
-- Scorer class added to the package
-- Removed data from repository
-- Hiding unfinished functionality from the user
-
-## [0.1.3] - 2020-02-24
-
-### Added
-
-- VolalityEstimation now has random_seed argument
-
-### Changed
-
-- Improved unit testing
-- Improved documentation README and CONTRIBUTING
-
-### Fixed
-
-- Added dependency on scipy 1.4+
-
-## [0.1.2] - 2019-10-29
-### Added
-
-- Readthedocs documentation website
-
-## [0.1.1] - 2019-10-09
-
-### Added
-
-- Added CHANGELOG.md
-
-### Changed
-
-- Renamed to probatus
-- Improved testing by adding pyflakes to CI
-- probatus.metric_uncertainty.VolatilityEstimation is now deterministic, added random_state parameter
-
-## [0.1.0] - 2019-09-21
-
-Initial release, commit ecbd0d08a6eea370afda4a4790edeb4ee382995c
-
-[Unreleased]: https://gitlab.com/ing_rpaa/probatus/compare/ecbd0d08a6eea370afda4a4790edeb4ee382995c...master
-[0.1.0]: https://gitlab.com/ing_rpaa/probatus/commit/ecbd0d08a6eea370afda4a4790edeb4ee382995c
diff --git a/README.md b/README.md
index bd4d37ff..49c70ef9 100644
--- a/README.md
+++ b/README.md
@@ -13,10 +13,8 @@
**Probatus** is a python package that helps validate binary classification models and the data used to develop them. Main features:
- [probatus.interpret](https://ing-bank.github.io/probatus/api/model_interpret.html) provides shap-based model interpretation tools
-- [probatus.metric_volatility](https://ing-bank.github.io/probatus/api/metric_volatility.html) provides tools using bootstrapping and/or different random seeds to assess metric volatility/stability.
- [probatus.sample_similarity](https://ing-bank.github.io/probatus/api/sample_similarity.html) to compare two datasets using resemblance modelling, f.e. `train` with out-of-time `test`.
- [probatus.feature_elimination.ShapRFECV](https://ing-bank.github.io/probatus/api/feature_elimination.html) provides cross-validated Recursive Feature Elimination using shap feature importance.
-- [probatus.missing_values](https://ing-bank.github.io/probatus/api/imputation_selector.html) compares performance gains of different missing values imputation strategies for a given model.
## Installation
diff --git a/VISION.md b/VISION.md
index bb30b6cf..253b6ac9 100644
--- a/VISION.md
+++ b/VISION.md
@@ -28,5 +28,4 @@ The main principles that drive development of `Probatus` are the following
## The Roadmap
-The following [issue](https://github.com/ing-bank/Probatus/issues/93) keeps track of the features coming to Probatus.
We are open to new ideas, so if you can think of a feature that fits the vision, make an [issue](https://github.com/ing-bank/Probatus/issues) and help us further develop this package.
\ No newline at end of file
diff --git a/docs/api/imputation_selector.md b/docs/api/imputation_selector.md
deleted file mode 100644
index d4fc675f..00000000
--- a/docs/api/imputation_selector.md
+++ /dev/null
@@ -1,6 +0,0 @@
-# Imputation Selector
-
-This module allows us to select imputation strategies.
-
-
-::: probatus.missing_values.imputation
diff --git a/docs/api/metric_volatility.md b/docs/api/metric_volatility.md
deleted file mode 100644
index 2da631ec..00000000
--- a/docs/api/metric_volatility.md
+++ /dev/null
@@ -1,12 +0,0 @@
-# Metric Volatility
-
-The aim of this module is the analysis of how well a model performs on a given dataset, and how stable the performance is.
-
-The following features are implemented:
-
-- [TrainTestVolatility][probatus.metric_volatility.volatility.TrainTestVolatility]: Estimation of the volatility of metrics. The estimation is done by splitting the data into train and test multiple times and training and scoring a model based on these metrics.
-- [SplitSeedVolatility][probatus.metric_volatility.volatility.SplitSeedVolatility]: Estimates the volatility of metrics based on splitting the data into train and test sets multiple times randomly, each time with a different seed.
-- [BootstrappedVolatility][probatus.metric_volatility.volatility.BootstrappedVolatility]: Estimates the volatility of metrics based on splitting the data into train and test with static seed, and bootstrapping the train and test set.
-
-
-::: probatus.metric_volatility.volatility
\ No newline at end of file
diff --git a/docs/api/stat_tests.md b/docs/api/stat_tests.md
deleted file mode 100644
index 55f5ba26..00000000
--- a/docs/api/stat_tests.md
+++ /dev/null
@@ -1,18 +0,0 @@
-# Statistical Tests
-
-This module allows us to apply different statistical tests.
-
-::: probatus.stat_tests.distribution_statistics
-
-## Available tests
-- [Anderson-Darling (ad)][probatus.stat_tests.ad.ad]
-- [Epps-Singleton (es)][probatus.stat_tests.es.es]
-- [Kolmogorov-Smirnov (ks)][probatus.stat_tests.ks.ks]
-- [Population Stability Index (psi)][probatus.stat_tests.psi.psi]
-- [Shapiro-Wilk (sw)][probatus.stat_tests.sw.sw]
-
-::: probatus.stat_tests.ad
-::: probatus.stat_tests.es
-::: probatus.stat_tests.ks
-::: probatus.stat_tests.psi
-::: probatus.stat_tests.sw
diff --git a/docs/img/KS2_Example.png b/docs/img/KS2_Example.png
deleted file mode 100644
index a1d64c14..00000000
Binary files a/docs/img/KS2_Example.png and /dev/null differ
diff --git a/docs/img/autodist.png b/docs/img/autodist.png
deleted file mode 100644
index b3fc3896..00000000
Binary files a/docs/img/autodist.png and /dev/null differ
diff --git a/docs/img/imputation_comparison.png b/docs/img/imputation_comparison.png
deleted file mode 100644
index 6050aa6e..00000000
Binary files a/docs/img/imputation_comparison.png and /dev/null differ
diff --git a/docs/img/metric_volatility_bootstrapped.png b/docs/img/metric_volatility_bootstrapped.png
deleted file mode 100644
index 947f0c4a..00000000
Binary files a/docs/img/metric_volatility_bootstrapped.png and /dev/null differ
diff --git a/docs/img/metric_volatility_split_seed.png b/docs/img/metric_volatility_split_seed.png
deleted file mode 100644
index 75443b23..00000000
Binary files a/docs/img/metric_volatility_split_seed.png and /dev/null differ
diff --git a/docs/img/metric_volatility_train_test.png b/docs/img/metric_volatility_train_test.png
deleted file mode 100644
index d780cb9a..00000000
Binary files a/docs/img/metric_volatility_train_test.png and /dev/null differ
diff --git a/docs/index.md b/docs/index.md
index ffb58f2e..c3ed477b 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,5 +1,3 @@
-# Welcome to probatus documentation!
-
**Probatus** is a Python library that allows to analyse binary classification models as well as the data used to develop them.
diff --git a/docs/tutorials/nb_binning.ipynb b/docs/tutorials/nb_binning.ipynb
deleted file mode 100644
index 41f4acb7..00000000
--- a/docs/tutorials/nb_binning.ipynb
+++ /dev/null
@@ -1,642 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Binning\n",
- "\n",
- "[![open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ing-bank/probatus/blob/master/docs/tutorials/nb_binning.ipynb)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {},
- "outputs": [],
- "source": [
- "%%capture\n",
- "!pip install probatus"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [],
- "source": [
- "%matplotlib inline\n",
- "%config Completer.use_jedi = False\n",
- "%load_ext autoreload\n",
- "%autoreload 2\n",
- "import matplotlib.pyplot as plt\n",
- "import numpy as np\n",
- "import pandas as pd\n",
- "\n",
- "pd.set_option(\"display.max_columns\", 100)\n",
- "pd.set_option(\"display.max_row\", 500)\n",
- "pd.set_option(\"display.max_colwidth\", 200)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "This notebook explains how the various implemented binning strategies of `probatus` work. \n",
- "First, we import all binning strategies:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [],
- "source": [
- "from probatus.binning import AgglomerativeBucketer, QuantileBucketer, SimpleBucketer, TreeBucketer"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Let's create some data on which we want to apply the binning strategies. We choose a logistic function because it clearly supports the explanation on how binning strategies work. Moreover, the typical reliability curve for a trained random forest model has this shape and binning strategies could be used for probability calibration (see also the website of Scikit-learn on [probability calibration](https://scikit-learn.org/stable/modules/calibration.html))."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- "
"
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "def log_function(x):\n",
- " return 1 / (1 + np.exp(-x))\n",
- "\n",
- "\n",
- "x = [log_function(x) for x in np.arange(-10, 10, 0.01)]\n",
- "\n",
- "plt.plot(x);"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Simple binning"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The `SimpleBucketer` object creates binning of the values of `x` into equally sized bins. The attributes `counts`, the number of elements per bin, and `boundaries`, the actual boundaries that resulted from the binning strategy, are assigned to the object instance. In this example we choose to get 4 bins:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "counts [891 109 110 890]\n",
- "boundaries [4.53978687e-05 2.50022585e-01 4.99999772e-01 7.49976959e-01\n",
- " 9.99954146e-01]\n"
- ]
- }
- ],
- "source": [
- "mySimpleBucketer = SimpleBucketer(bin_count=4)\n",
- "mySimpleBucketer.fit(x)\n",
- "print(\"counts\", mySimpleBucketer.counts_)\n",
- "print(\"boundaries\", mySimpleBucketer.boundaries_)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD4CAYAAAD8Zh1EAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy86wFpkAAAACXBIWXMAAAsTAAALEwEAmpwYAAAfF0lEQVR4nO3de3xU9Z3/8dcnkwsh3CGAXANysdBFxYi32npBBbXQixesrbY/V35t112r9relq7Vdu93ftv52W926tVpt0Xq31lKLIEW8FwFvlItIuAeQAIGEhJDLzOf3xxzoGBNIYDJnZvJ+Ph7jnPnOycybM5O3J+ecmWPujoiIZL6csAOIiEhyqNBFRLKECl1EJEuo0EVEsoQKXUQkS+SG9cT9+vXzkpKSsJ5eRCQjvfXWW7vcvbil+0Ir9JKSEpYtWxbW04uIZCQz29TafdrkIiKSJVToIiJZQoUuIpIlVOgiIllChS4ikiWOWOhm9qCZVZjZilbuNzO728zKzGy5mU1MfkwRETmStqyh/waYcpj7pwKjg8tM4BfHHktERNrriMehu/srZlZymFmmAw95/Ht4F5tZLzM7zt23JyukiGQnd4fGRmINjXhDPd7QgNfXE6uvxxsaIdqER2OHrj3aBLEYHo1CNJpwHYNYFG+Kxq+jMfAYuMefwx2c+DUHrz3h/mCstfEWf6bZ/X/7R7X0L/3IrW7nnkvh3/1dEpdkXDI+WDQY2JJwuzwY+1ihm9lM4mvxDBs2LAlPLSJh8ViMaFUV0cpKopWVNFXuIbqnkuiePUT31RCrrSVWE1zX1hKtDab378fr48XtDQ2tFGAWMjs0mdu/f9oWepu5+33AfQClpaWd5FUUyUzuTtPOnTSsW0fDps00bttG4/bt8ett22iqqIBotMWftS5dyOnWjZyiruQUFREp6kZe/wHxsa5dsYICcgrysfx8LL8guM7HCvLJyc/HCoKxSAQiuVgkB3IiWG4EcnKw3NxD15aTA5FIMG9wnZMDZphZvEgTLwcztnifBVct/5zR8n2W8LhhSkahbwWGJtweEoyJSIbwpibq166l7r3lHFi5gvqyddSvW0esuvpvM+XmkjdgAHmDBlE0aRK5AweS27cvkT59yO3Tm0ifPkR69yG3dy8sPz+8f0wnloxCnwPcYGaPA6cBVdp+LpLevKmJAytWUPPGG+z/y2LqVqzA6+oAiPTqRcHo0fS45GIKjh9FwajjyS8pIbe4OL72K2nriIVuZo8B5wD9zKwc+D6QB+Du9wJzgYuBMmA/8LWOCisiRy924AA1r7zCvnnzqHnt9fjatxldxo2j12WXUThhAoUnTiBv6NC02YQg7dOWo1yuOsL9DvxD0hKJSNK4O3XLlrHnqaeo+fNCYvv3E+nTh+6TJ9PtU2fR9YwzyO3dO+yYkiShfX2uiHScWF0de3//e/Y+9hj1a8vI6dGDHpdcTI+pU+k6aVJ8p6JkHb2qIlkktn8/ex5/gt0PPEB09266jB/PcT/6ET0unkpOYWHY8aSDqdBFsoDHYlQ98wwVP/0Z0d27KTrzDPp985t0LS0NO5qkkApdJMPVvfceH97xQw6sXEnhxIn0/++76TpRX6nUGanQRTJUrKGBXXffze4HHiS3uJhBd95Jj0sv0REqnZgKXSQD1a9dy9Zbvk39Bx/Q6/LL6f+d7xDpVhR2LAmZCl0kw1TPm8e2f7mVnK5dGXLvL+h+zjlhR5I0oUIXyRAei7Hzpz9j9/33U3jSSQy+6y7yBvQPO5akERW6SAbwxka2ffdfqH7uOXpdcQUDb7tV35ciH6NCF0lzsbo6ym+8kdpXXqX45pvpe/3fa8entEiFLpLGYvX1bPnGN9m/ZAkD7/hXel9xRdiRJI2p0EXSlDc0sPWfbmT/m28y6D/+Lz2nTw87kqS5tpxTVERSzGMxts2aRc3LLzPw+99XmUubqNBF0tDOu++meu7z9P/2LfSecWXYcSRDqNBF0kzVnDnsvveX9Lr8cvpcd13YcSSDqNBF0kjd8uVsv/U2uk6axMDbv6ejWaRdVOgiaSJaVcXWb91Ebv/+DL7rZ1heXtiRJMPoKBeRNODubL/tezRWVFDy6CM6i5AcFa2hi6SBPY89xr4FC+h/000UTpgQdhzJUCp0kZA1bNpExU/upOjss+nzta+GHUcymApdJEQei7H9tu9heXkc928/xHL0KylHT+8ekRDtfeIJ9i9dyoBZ3yFvwICw40iGU6GLhKRx+3Yq7vx/FJ15Jj2/8IWw40gWUKGLhKTizjvxWIyBd9yh480lKVToIiGoXbKE6rnP0/f6vyd/yOCw40iWUKGLpJg3NbHjR/9O3qBB9NVH+yWJ9MEikRTb8+ST1K9Zw+C77yKnS5ew40gW0Rq6SApFa2rZ9fN76HraaXS/4IKw40iWUaGLpFDlQ7OJVlbS/5abtSNUkk6FLpIiTXv2UPngr+k2+Xx9vF86hApdJEV2/+pXxGpr6X/jjWFHkSzVpkI3sylmtsbMysxsVgv3DzOzRWb2jpktN7OLkx9VJHM1VlSw57eP0HPaZykYPTrsOJKljljoZhYB7gGmAuOAq8xsXLPZbgOedPeTgRnA/yQ7qEgmq5w9G29spN83vxl2FMlibVlDnwSUuft6d28AHgean7HWgR7BdE9gW/IiimS2aFUVex97nB5Tp5I/fHjYcSSLtaXQBwNbEm6XB2OJfgB82czKgbnAP7b0QGY208yWmdmynTt3HkVckcxT+cgjxPbvp+/M68OOIlkuWTtFrwJ+4+5DgIuBh83sY4/t7ve5e6m7lxYXFyfpqUXSV6y2lj2zH6LbuefSZezYsONIlmtLoW8FhibcHhKMJboOeBLA3f8CdAH6JSOgSCbb89RTRKuqtHYuKdGWQl8KjDazEWaWT3yn55xm82wGzgcws08QL3RtU5FOzZuaqHzoIbqWltL15JPDjiOdwBEL3d2bgBuA+cBq4kezrDSzO8xsWjDbLcD1ZvYe8BjwVXf3jgotkgn2vfgiTdu20+er14YdRTqJNn05l7vPJb6zM3Hs9oTpVcBZyY0mktn2PPQweYMH0+3cc8OOIp2EPikq0gEOrF7N/mXL6H311VgkEnYc6SRU6CIdoPK3v8UKC+n1RZ1aTlJHhS6SZE2VlVT/8Tl6Tp9GpGfPsONIJ6JCF0myhf89C29ooMdVM8KOIp2MCl0kiWIe44Fh63n+quMpGntC2HGkk1GhiyTRG9veYHXODsZdc0PYUaQTUqGLJNHTHzxN74LenDfsvLCjSCekQhdJkl11u3h5y8tMO34a+ZH8sONIJ6RCF0mSP677I03exBfG6FBFCYcKXSQJ3J1ny57lpOKTGNlzZNhxpJNSoYskwfJdy1lftZ7Pjfpc2FGkE1OhiyTBH8r+QJdIFy4quSjsKNKJqdBFjlF9tJ55G+YxefhkuuV3CzuOdGIqdJFj9NKWl9jXuI9px0874rwiHUmFLnKMnlv3HP0L+zNp4KSwo0gnp0IXOQZ7Duzhta2vccnIS4jk6GtyJVwqdJFjMH/jfJq8iUtGXhJ2FBEVusixeG79c4zqNYqxfcaGHUVEhS5ytMr3lfPezve0di5pQ4UucpTmbZwHwNQRU0NOIhKnQhc5SnM3zOXE4hMZ3G1w2FFEABW6yFFZt3cda/es1dq5pBUVushRmLdxHjmWo4/6S1pRoYu0k7szf+N8ThlwCv0K+4UdR+QQFbpIO63du5YNVRu4aLjWziW9qNBF2mn+xvnkWA6Th08OO4rIR6jQRdrB3Xlh4wucOuBU+hb2DTuOyEeo0EXaYe3etWys3siFJReGHUXkY1ToIu2wYNMCciyH84adF3YUkY9RoYu0w4KNC5jYf6KObpG01KZCN7MpZrbGzMrMbFYr81xhZqvMbKWZPZrcmCLhW793Peuq1mlnqKSt3CPNYGYR4B7gAqAcWGpmc9x9VcI8o4HvAme5+x4z699RgUXCsmDTAgAmD1OhS3pqyxr6JKDM3de7ewPwODC92TzXA/e4+x4Ad69IbkyR8C3cvJATi09kQNGAsKOItKgthT4Y2JJwuzwYSzQGGGNmr5vZYjOb0tIDmdlMM1tmZst27tx5dIlFQlC+r5zVlau1di5pLVk7RXOB0cA5wFXA/WbWq/lM7n6fu5e6e2lxcXGSnlqk4y3cvBCA84efH3ISkda1pdC3AkMTbg8JxhKVA3PcvdHdNwAfEC94kaywcPNCxvYey9DuQ488s0hI2lLoS4HRZjbCzPKBGcCcZvM8S3ztHDPrR3wTzPrkxRQJz666Xbxb8S7nD9PauaS3Ixa6uzcBNwDzgdXAk+6+0szuMLNpwWzzgd1mtgpYBPwfd9/dUaFFUmnRlkU4rg8TSdo74mGLAO4+F5jbbOz2hGkHbg4uIlll4eaFDOk2hDG9x4QdReSw9ElRkcOoaajhze1vct6w8zCzsOOIHJYKXeQwXtv6Gk2xJm0/l4ygQhc5jBc3v0ifLn04sfjEsKOIHJEKXaQVjdFGXt36KucMPYdITiTsOCJHpEIXacWSD5dQ01jDeUN1dItkBhW6SCte3PwihbmFnHbcaWFHEWkTFbpIC2Ie46UtL3HWoLPoktsl7DgibaJCF2nBqt2rqKir4Nxh54YdRaTNVOgiLXhx84tELMKnB3867CgibaZCF2nBoi2LmDhgIr269Ao7ikibqdBFmtlSvYWyvWWcO1SbWySzqNBFmlm0ZREA5ww9J9wgIu2kQhdp5qXylxjVa5S++1wyjgpdJMHeqs28veMtbW6RjKRCF0mwbeXvGFpfz3mFQ8KOItJubfo+dJHOYtyWt/ljleNjpocdRaTdtIYuclDjAShbCGOnYhF9GZdkHhW6yEEbXoHGWjjhkrCTiBwVFbrIQWv+BPndYIQ+HSqZSYUuAhCLwZrnYdT5kFsQdhqRo6JCFwHY+hbU7ICx2twimUuFLgLxzS0WgTEXhp1E5Kip0EUA3p8LJWdBYe+wk4gcNRW6yK4y2LUGTrg07CQix0SFLvL+c/HrsVPDzSFyjFToIu//CQZOgF7Dwk4ickxU6NK57dsB5Uu1uUWyggpdOrc1cwHXp0MlK6jQpXN7/0/QuwQGjA87icgxU6FL53WgGja8HN/cYhZ2GpFjpkKXzmvtCxBt0PZzyRptKnQzm2Jma8yszMxmHWa+L5qZm1lp8iKKdJD3n4OiYhg6KewkIklxxEI3swhwDzAVGAdcZWbjWpivO3Aj8GayQ4okXeMBWLsgvjM0R999LtmhLWvok4Ayd1/v7g3A40BLp3P5IfBj4EAS84l0jPWLoKEGTvhs2ElEkqYthT4Y2JJwuzwYO8TMJgJD3f1Ph3sgM5tpZsvMbNnOnTvbHVYkaVb/EQp66rvPJasc805RM8sB/gu45Ujzuvt97l7q7qXFxcXH+tQiRyfaGD9ccewUyM0PO41I0rSl0LcCQxNuDwnGDuoOfBJ4ycw2AqcDc7RjVNLWxlfhwF74xLSwk4gkVVsKfSkw2sxGmFk+MAOYc/BOd69y937uXuLuJcBiYJq7L+uQxCLHatUcyCuKn51IJIscsdDdvQm4AZgPrAaedPeVZnaHmWkVRzJLLBo/XHHMhZBXGHYakaTKbctM7j4XmNts7PZW5j3n2GOJdJBNr0PtThjX0oFaIplNnxSVzmXls5BbCKN1qjnJPip06Txi0fjhimMuhPyisNOIJJ0KXTqPTa9DbQWM/3zYSUQ6hApdOo+Vz0JeV21ukaylQpfOIdoEq/4AYy7S5hbJWip06Rw2vgL7d8Envxh2EpEOo0KXzuGvv4P87jDqgrCTiHQYFbpkv6b6+NEtn7gU8rqEnUakw6jQJfutXQD1VfDJy8JOItKhVOiS/f76FHTtByPPCTuJSIdSoUt2O1ANH8yLH3seadM3XYhkLBW6ZLfVc6DpAEy4MuwkIh1OhS7Z7b3Hoc9IGKKv55fsp0KX7LV3C2x8DSbMALOw04h0OBW6ZK/lTwAOJ2pzi3QOKnTJTu7w3mMw/CzoXRJ2GpGUUKFLdtqyBHaXwYlXhZ1EJGVU6JKd3v1t/Lyh4z8XdhKRlFGhS/apr4EVz8SPPS/oHnYakZRRoUv2WfUsNNTAxK+EnUQkpVTokn3emg39xsDQ08JOIpJSKnTJLjtWQvkSmHitjj2XTkeFLlmlYtG9eKQATvpS2FFEUk6FLlmj5kAjm1Yv453un4GufcKOI5JyKnTJGs+8s5XL628lZ9pdYUcRCYUKXbKCuzP7jY2cOKQXJ40cFHYckVCo0CUrvFa2i3U7a7n2zJKwo4iERoUuWWH2GxvpW5TPJROOCzuKSGhU6JLx1u+sYeH7FXzptGEU5EbCjiMSGhW6ZLz7X91AXiSHa84oCTuKSKjaVOhmNsXM1phZmZnNauH+m81slZktN7OFZjY8+VFFPq5i3wF+93Y5l50yhOLuBWHHEQnVEQvdzCLAPcBUYBxwlZmNazbbO0Cpu08AngZ+kuygIi2Z/cZGGqMxrj97ZNhRRELXljX0SUCZu6939wbgcWB64gzuvsjd9wc3FwNDkhtT5ONq6pt4+C+bmDJ+ICP6FYUdRyR0bSn0wcCWhNvlwVhrrgOeb+kOM5tpZsvMbNnOnTvbnlKkBY++uYnqA03M/LTWzkUgyTtFzezLQClwZ0v3u/t97l7q7qXFxcXJfGrpZGrrm7j35fWcPbofJw/rHXYckbSQ24Z5tgJDE24PCcY+wswmA7cCn3H3+uTEE2nZ7L9spLK2gZsuGBN2FJG00ZY19KXAaDMbYWb5wAxgTuIMZnYy8EtgmrtXJD+myN/sO9DIfa+s59yxxUzU2rnIIUcsdHdvAm4A5gOrgSfdfaWZ3WFm04LZ7gS6AU+Z2btmNqeVhxM5ZrPf2Mje/Y18a7LWzkUStWWTC+4+F5jbbOz2hOnJSc4l0qLdNfX88uX1TP5Ef04c2ivsOCJpRZ8UlYxy18K17G+MMmvqCWFHEUk7KnTJGGUV+3jkzc1cfdowRvXvHnYckbSjQpeM8e9z36drfoQbzx8ddhSRtKRCl4ywaE0FL75fwT+eN4q+3fSdLSItUaFL2qtriPK9Z1dwfHGRTmAhchhtOspFJEx3LVxL+Z46nph5ur7vXOQwtIYuaW319mp+9ep6rigdwmkj+4YdRyStqdAlbTU0xbjlyffoWZjHd6d+Iuw4ImlPm1wkbf30zx+wans1919TSu+i/LDjiKQ9raFLWlqyoZJ7X17HjFOHcsG4AWHHEckIKnRJO5W1Ddz0xLsM7d2V713a/ORYItIabXKRtBKNOf/02DvsrKnn6a+fQVGB3qIibaU1dEkr//nCGl4r28UPp49nwpBeYccRySgqdEkbv3+nnP95Kb7d/MpTh4UdRyTjqNAlLbxetot/fno5p4/sw79OHx92HJGMpEKX0K3YWsXXH36LEf2K+OVXSvVpUJGjpEKXUK3cVsWXH3iT7l1y+c3XJtGzMC/sSCIZS4UuoVm5rYqrf/UmXfMiPD7zDAb1Kgw7kkhGU6FLKN4o28WM+xYfKvNhfbuGHUkk46nQJeWeebuca3+9hON6duGpb5ypMhdJEn1qQ1KmMRrjJ/Pe5/5XN3DGyL7c+5VTtM1cJIlU6JIS2/bWccOjb/P25r1cc8ZwbrtkHPm5+gNRJJlU6NKh3J2n3irn355bRczh5186mUsnDAo7lkhWUqFLh9mwq5bb/7CCV9fuYlJJH35y2QRK+hWFHUska6nQJekqaxu4e+Fafrt4EwW5Ofxw+niuPm04OTkWdjSRrKZCl6TZVVPPr1/fwENvbKK2oYkrTx3GTReMpn/3LmFHE+kUVOhyzFZtq+bRJZt4alk5DdEYU8YP5KYLxjBmQPewo4l0Kip0OSq7a+qZt/JDnli6heXlVeRHcvj8yYP5358ZycjibmHHE+mUVOjSJu7O5sr9LFxdwfyVH7J0YyUxhxMGduf7nx3H504arPN+ioRMhS4tisacDbtqeHvzXhav283i9bvZVnUAgLEDunPDuaO4cPxAxg/qgZl2doqkAxV6J+fu7KiuZ+PuWjbsqmX19mpWbqtm1bZq6hqjAPQtyuf0kX35xvF9OXtUPx16KJKm2lToZjYFuAuIAL9y9/9odn8B8BBwCrAbuNLdNyY3qrRXYzRGdV0ju2oa2FF9gB3VB6jYV39oetPu/Wzavf9QcQMU5UcYP6gnV546lE8O7smEIT0Z3b+b1sJFMsARC93MIsA9wAVAObDUzOa4+6qE2a4D9rj7KDObAfwYuLIjAmcid6cp5kRjTmM0RjTWttuNTTEONMWoa4hS3xSlriFKXWOUA40x6hqj1DfGb+9viFJV10hVXSPVwaWqrpHahmiLeXoW5tG/ewFD+3TlzOP7MaJfV0r6FVHSt4jBvQp1vLhIhmrLGvokoMzd1wOY2ePAdCCx0KcDPwimnwZ+bmbm7p7ErAA8uXQLv3xlHQAe/MeJl+bBJ3MHx+PXCQkOznPw/r/Ne3C+5mPNHvPgbSdhvPXHxCHq8aLuCAW5ORTmRyjMi9CzMI8ehXkM6d2VnoPy6Fl48JJLv+4FDOjRhQHdu9C/RwFd8nRGIJFs1JZCHwxsSbhdDpzW2jzu3mRmVUBfYFfiTGY2E5gJMGzY0Z0EuHdRPicM7AHBSqTFHze4PjR8aAyDYOrQ/dZ8LJjxoz8fn6f5Y9LSzx96HDs078Hnzc0xIjlGXsSI5OS0eDs3Eh/LzclJuM/Ii+TQJS+HLnnx0k68LsjN0Zq0iHxESneKuvt9wH0ApaWlR7XaesG4AVwwbkBSc4mIZIO2fH/pVmBowu0hwViL85hZLtCT+M5RERFJkbYU+lJgtJmNMLN8YAYwp9k8c4Brg+nLgBc7Yvu5iIi07oibXIJt4jcA84kftvigu680szuAZe4+B3gAeNjMyoBK4qUvIiIp1KZt6O4+F5jbbOz2hOkDwOXJjSYiIu2hc4CJiGQJFbqISJZQoYuIZAkVuohIlrCwji40s53ApqP88X40+xRqmlCu9knXXJC+2ZSrfbIx13B3L27pjtAK/ViY2TJ3Lw07R3PK1T7pmgvSN5tytU9ny6VNLiIiWUKFLiKSJTK10O8LO0ArlKt90jUXpG825WqfTpUrI7ehi4jIx2XqGrqIiDSjQhcRyRIZV+hmNsXM1phZmZnNSvFzDzWzRWa2ysxWmtmNwfgPzGyrmb0bXC5O+JnvBlnXmNlFHZhto5n9NXj+ZcFYHzNbYGZrg+vewbiZ2d1BruVmNrGDMo1NWCbvmlm1mX0rjOVlZg+aWYWZrUgYa/fyMbNrg/nXmtm1LT1XEnLdaWbvB8/9ezPrFYyXmFldwnK7N+FnTgle/7Ig+zGdzqqVXO1+3ZL9+9pKricSMm00s3eD8VQur9a6IbXvMXfPmAvxr+9dB4wE8oH3gHEpfP7jgInBdHfgA2Ac8fOpfruF+ccFGQuAEUH2SAdl2wj0azb2E2BWMD0L+HEwfTHwPPGz550OvJmi1+5DYHgYywv4NDARWHG0ywfoA6wPrnsH0707INeFQG4w/eOEXCWJ8zV7nCVBVguyT+2AXO163Tri97WlXM3u/0/g9hCWV2vdkNL3WKatoR86YbW7NwAHT1idEu6+3d3fDqb3AauJn0+1NdOBx9293t03AGXE/w2pMh2YHUzPBj6XMP6Qxy0GepnZcR2c5Xxgnbsf7tPBHba83P0V4t/V3/z52rN8LgIWuHulu+8BFgBTkp3L3V9w96bg5mLiZwlrVZCth7sv9ngrPJTwb0larsNo7XVL+u/r4XIFa9lXAI8d7jE6aHm11g0pfY9lWqG3dMLqwxVqhzGzEuBk4M1g6IbgT6cHD/5ZRWrzOvCCmb1l8ZNxAwxw9+3B9IfAwZOxhrEcZ/DRX7Swlxe0f/mEsdz+F/E1uYNGmNk7ZvaymZ0djA0OsqQiV3tet1Qvr7OBHe6+NmEs5curWTek9D2WaYWeFsysG/A74FvuXg38AjgeOAnYTvzPvlT7lLtPBKYC/2Bmn068M1gTCeUYVYufunAa8FQwlA7L6yPCXD6tMbNbgSbgkWBoOzDM3U8GbgYeNbMeKYyUdq9bM1fx0ZWGlC+vFrrhkFS8xzKt0NtywuoOZWZ5xF+wR9z9GQB33+HuUXePAffzt80EKcvr7luD6wrg90GGHQc3pQTXFanOFZgKvO3uO4KMoS+vQHuXT8rymdlXgUuBq4MiINiksTuYfov49ukxQYbEzTIdkusoXrdULq9c4AvAEwl5U7q8WuoGUvwey7RCb8sJqztMsI3uAWC1u/9Xwnji9ufPAwf3wM8BZphZgZmNAEYT3xmT7FxFZtb94DTxnWor+OjJu68F/pCQ65pgT/vpQFXCn4Ud4SNrTmEvrwTtXT7zgQvNrHewueHCYCypzGwK8M/ANHffnzBebGaRYHok8eWzPshWbWanB+/RaxL+LcnM1d7XLZW/r5OB99390KaUVC6v1rqBVL/HjmXPbhgX4nuHPyD+f9tbU/zcnyL+J9Ny4N3gcjHwMPDXYHwOcFzCz9waZF3DMe5JP0yukcSPIHgPWHlwuQB9gYXAWuDPQJ9g3IB7glx/BUo7cJkVAbuBngljKV9exP+Hsh1oJL5d8rqjWT7Et2mXBZevdVCuMuLbUQ++x+4N5v1i8Pq+C7wNfDbhcUqJF+w64OcEnwJPcq52v27J/n1tKVcw/hvg683mTeXyaq0bUvoe00f/RUSyRKZtchERkVao0EVEsoQKXUQkS6jQRUSyhApdRCRLqNBFRLKECl1EJEv8f/DTtEYhCV2OAAAAAElFTkSuQmCC\n",
- "text/plain": [
- "
"
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "df = pd.DataFrame({\"x\": x})\n",
- "df[\"label\"] = pd.cut(x, bins=mySimpleBucketer.boundaries_, include_lowest=True)\n",
- "\n",
- "fig, ax = plt.subplots()\n",
- "for label in df.label.unique():\n",
- " df[df.label == label].plot(ax=ax, y=\"x\", legend=False)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "As can be seen, the number of elements in the tails of the data is larger than in the middle:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- "
"
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "counts_agglomerative_quantile, boundaries_agglomerative_quantile = QuantileBucketer.quantile_bins(x_agglomerative, 2)\n",
- "\n",
- "df = pd.DataFrame({\"x\": x_agglomerative})\n",
- "df[\"label\"] = pd.cut(x_agglomerative, bins=boundaries_agglomerative_quantile, include_lowest=True)\n",
- "\n",
- "fig, ax = plt.subplots()\n",
- "for label in df.label.unique():\n",
- " df[df.label == label].hist(ax=ax)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Binning with Decision Trees\n",
- "\n",
- "Binning with decision trees leverages the information of a binary feature or the binary target in order to create buckets that have a significantly different proportion of the binary feature/target. \n",
- "\n",
- "It works by fitting a tree on 1 feature only. \n",
- "It leverages the properties of the split finder algorithm in the decision tree. The splits are done to maximize the gini/entropy. \n",
- "The leaves approximate the optimal bins.\n",
- "\n",
- "The example below shows a distribution defined by a step function"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 15,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAY0AAAD4CAYAAAAQP7oXAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy86wFpkAAAACXBIWXMAAAsTAAALEwEAmpwYAAAWNUlEQVR4nO3df5ClVX3n8fdnpkERlJ9K6czsgsWoYdUEdkQMVVlLNAVGGavWuFibhFhTmaotUROsRMxGzbp/qDGrwQrrOhEUs6zIorV2uZMYC3AttwLFCK7KENZxjDIDykiAKEiGHr/7x32GvQ7T3U/3dN97bvf7VdU1z49zz3MudM1nzjnPc55UFZIk9bFm3A2QJE0OQ0OS1JuhIUnqzdCQJPVmaEiSepsadwMA1qxZU8ccc8y4myFJE+XRRx+tqhrpP/6bCI1jjjmGRx55ZNzNkKSJkuSno76mw1OSpN4MDUlSb4aGJKk3Q0OS1JuhIUnqzdCQpBUqydVJ7k/yrVnOJ8lHkuxK8o0kZ89Xp6EhSSvXJ4EL5jh/IbCx+9kKfHS+Cpt4TkOSxmXfvn187GMfY//+/WO5/mtf+1pe8pKXLEvdVfWVJKfNUWQz8KkavCPjliQnJHl2Vd032wcMDUmr2mc/+1ne9a53AZBk5Nd/znOecyShMZVkx9D+tqratoDPrwPuGdrf0x1buaGxe99Pxt0ELdBzn3ncuJsgPeFgD+OBBx7gpJNOGnNrFmymqjaN8oLOaUha1Q4cOADA2rVrx9ySsdgLbBjaX98dm9XE9zQk6UgcNjR+tGt0DTjljNFd68mmgUuTXAe8FHh4rvkMMDQkrXIruaeR5NPAy4FTkuwB3gMcBVBV/wXYDrwa2AU8CrxpvjoNDUmr2szMDLAyQ6Oq3jjP+QLevJA6ndOQtKqt5J7GcjA0JK1qB0NjzRr/OuzD/0qSVrUDBw6wZs2asTyjMYmc05DUhE984hPcfPPNI7/u7bff7tDUAhgakprwvve9j3vvvZdnPetZI7/2a17zmpFfc1IZGpKaUFVs3ryZa6+9dtxN0RwMDUlNGNz9OYdRPnCnWTkRLknqzdCQ1AzvYGqfoSGpCfMOT6kJhoYkqTdDQ1IzHJ5qn6EhqQkOT02GXqGR5PeS3JnkW0k+neSpSU5PcmuSXUk+k+ToruxTuv1d3fnTlvUbSJJGZt7QSLIOeCuwqapeCKwFLgY+AHy4qs4AHgS2dB/ZAjzYHf9wV06S5uXwVPv6Dk9NAcckmQKexuCl468AbujOXwO8rtve3O3TnT8//iZImofDU5Nh3tCoqr3AnwLfZxAWDwNfAx6qqpmu2B5gXbe9Drin++xMV/7kQ+tNsjXJjiQ7Dr4ERZLUtj7DUycy6D2cDjwHOBa44EgvXFXbqmpTVW2amnI1E0kOT02CPsNTrwS+W1X7qupx4HPAecAJ3XAVwHpgb7e9F9gA0J0/HnhgSVstacVxeGoy9AmN7wPnJnlaNzdxPrATuBl4fVfmEuDz3fZ0t093/qbyt0GSVoQ+cxq3MpjQvh34ZveZbcA7gMuS7GIwZ3FV95GrgJO745cBly9DuyWtQA5Pta/XZEJVvQd4zyGHdwPnHKbsY8CvH3nTJK0mDkhMBp8Il9QMexrtMzQkSb0ZGpKa4PDUZDA0JDXD4an2GRqSmmBPYzIYGpKk3ly/QyO3e99PRnat5z7zuJFdS0fO4an22dOQ1ASHpyaDoSFJ6s3QkNQMh6eWXpILktzdvU31Scs6JflnSW5OckeSbyR59Vz1GRqSmuDw1NJLsha4ErgQOBN4Y5IzDyn2R8D1VXUWg7ey/ue56jQ0JGnlOgfYVVW7q2o/cB2D9yMNK+AZ3fbxwL1zVejdU5Ka4fDUgk0l2TG0v62qtg3tP/Em1c4e4KWH1PHHwN8keQuDl+y9cs4LLr6tkrR0HJ5alJmq2nSEdbwR+GRV/ackLwP+MskLq+pnhyvs8JQkrVxPvEm1M/yW1YO2ANcDVNXfAk8FTpmtQkNDUjMcnlpytwEbk5ye5GgGE93Th5T5PoM3spLkFxiExr7ZKjQ0JDXB4amlV1UzwKXAF4G7GNwldWeS9ya5qCv2duB3kvwf4NPAb8/1im7nNCRpBauq7cD2Q469e2h7J3Be3/rsaUhqhsNT7TM0JDXB4anJYGhIaoY9jfYZGpKaYE9jMhgakqTeDA1JzXB4qn2GhqQmODw1GQwNSVJvhoakZjg81T5DQ1ITHJ6aDIaGJKk3Q0NSMxyeap8LFmpF273vJyO71nOfedzIrrUSOTw1GexpSJJ6MzQkNcPhqfYZGpKa4PDUZDA0JEm9GRqSmuHwVPsMDUlNcHhqMhgakpphT6N9vUIjyQlJbkjyd0nuSvKyJCcl+VKSb3d/ntiVTZKPJNmV5BtJzl7eryBJGpW+PY0rgL+uqhcAvwjcBVwO3FhVG4Ebu32AC4GN3c9W4KNL2mJJK5LDU5Nh3tBIcjzwK8BVAFW1v6oeAjYD13TFrgFe121vBj5VA7cAJyR59hK3W9IK5PBU+/r0NE4H9gGfSHJHko8nORY4taru68r8ADi1214H3DP0+T3dsZ+TZGuSHUl2zMzMLP4bSFoR7GlMhj6hMQWcDXy0qs4CHuH/D0UBUIP/2wv6P15V26pqU1VtmppyCSxJmgR9QmMPsKeqbu32b2AQIj88OOzU/Xl/d34vsGHo8+u7Y5I0J4en2jdvaFTVD4B7kjy/O3Q+sBOYBi7pjl0CfL7bngZ+q7uL6lzg4aFhLEk6LIenJkPfcaG3ANcmORrYDbyJQeBcn2QL8D3gDV3Z7cCrgV3Ao11ZSdIK0Cs0qurrwKbDnDr/MGULePORNUuaPKN8dweszPd3ODzVPp8Il9QEh6cmg6EhSStYkguS3N2t0nH5LGXekGRnkjuT/Le56vNeV0nNcHhqaSVZC1wJvIrBnbC3JZmuqp1DZTYC7wTOq6oHkzxrrjrtaUhqgsNTy+IcYFdV7a6q/cB1DFbtGPY7wJVV9SBAVd3PHAwNSZpcUwdX1uh+th5yvs8KHc8Dnpfkfye5JckFc17wyNssSUvD4akFm6mqw93ZuhBTDBaYfTmDh7G/kuRF3RqDT2JPQ1ITHJ5aFn1W6NgDTFfV41X1XeD/MgiRwzI0JGnlug3YmOT07uHsixms2jHsfzDoZZDkFAbDVbtnq9DhKUmL96NdS1dXFXns4aWtc5WrqpkklwJfBNYCV1fVnUneC+yoqunu3K8m2QkcAH6/qh6YrU5DQ1ITamELZaunqtrOYHmn4WPvHtou4LLuZ14OT0lqhvPg7TM0JEm9GRqSmuDdU5PB0JDUDJ/TaJ+hIakJ9jQmg6EhSerN0JDUDIen2mdoSGqCw1OTwdCQJPVmaEhqRnB4qnWGhqQmODw1GQwNSVJvhoakZnj3VPtc5VZq0Fe/fBNf+qsvzFnmGcccNaLWzOGnDy9ZVY8/PrNkdWn5GBpSg/7iyiv426/+L44/4YRZy6xp4V/ldWDJqnrmKSdx9ov/xZLVp+VhaEgNOvCzA/zSv3wJ13/hS7OWee4zjxthi2bhC5NWHec0pAZ5J5FaZWhIjXJSWC0yNKQGVZWhoSY5pyG1qEdo7N73kxE1ppH5EzXBnobUIHsaapWhITXIiXC1yuEpqVWL7Gkc9dB3lrghcM9DS17lrDaceOzoLqYFs6chNcjhKbXK0JAaZGioVYaG1KCq8t0SapKhIUnqrXdoJFmb5I4kX+j2T09ya5JdST6T5Oju+FO6/V3d+dOWqe3SiuXwlFq1kJ7G24C7hvY/AHy4qs4AHgS2dMe3AA92xz/clZO0EIaGGtUrNJKsB34N+Hi3H+AVwA1dkWuA13Xbm7t9uvPnx99+aUEKQ0Nt6tvT+DPgD4CfdfsnAw9V1cG3puwB1nXb64B7ALrzD3flf06SrUl2JNkxM+PLV6RhVSz6OQ1pOc0bGkleA9xfVV9bygtX1baq2lRVm6amfMZQkpZDkguS3N3NM18+R7l/naSSbJqrvj5/W58HXJTk1cBTgWcAVwAnJJnqehPrgb1d+b3ABmBPkingeOCBHteR1HEiXEshyVrgSuBVDEaEbksyXVU7Dyn3dAbz1rfOV+e8PY2qemdVra+q04CLgZuq6t8CNwOv74pdAny+257u9unO31QupCMtiKGhJXIOsKuqdlfVfuA6BvPOh/qPDG5aemy+Co/kOY13AJcl2cVgzuKq7vhVwMnd8cuAWbtDkg5vEBrjboUmwNTBueHuZ+sh55+YY+4Mzz8DkORsYENV/c9eF1xI66rqy8CXu+3dDFLs0DKPAb++kHolPZk9DfUwU1VzzkHMJcka4EPAb/f9jE+ESw1yRFdL5OAc80HD888ATwdeCHw5yd8D5wLTc02GGxpSi5zT0NK4DdjYreBxNIN56emDJ6vq4ao6papO6+atbwEuqqods1Xova7SCCz4HRczj7Fm5qfL8m4MrR5VNZPkUuCLwFrg6qq6M8l7gR1VNT13DU9maEiNsqehpVBV24Hthxx79yxlXz5ffQ5PSQ0aLI0utcfQkBrkPLhaZWhIDXLBQrXK0JAa5BPhapWhITXK0FCLDA2pQVXl0uhqkqEhNcgnwtUqQ0NqkB0NtcrQkBrlnIZaZGhILaoiPt6nBhkaUoO85VatMjSkBhVOhKtNLlioiTDpdxMttP32NNQqQ0PN+4srr+D9/+GPxt2MkfulF79o3E2QnsTQUPO+8+27Oe64p7Pl371l3E1ZtDWP/cOCP3Phr56/DC2RjoyhoeZVFcc94xm89fffOe6mLJovU9JK4US4JKk3Q0PNc1JYaoehIUnqzdBQ8+xpSO0wNCRJvRkaap49DakdhoYkqTdDQ80bvFvCnobUAkNDE8FlwqU2GBpq34QvViitJIaGJoLDU1IbXHtKzVuuZdFdD0paOHsamgz2NKQmGBpq3qS/gElaSQwNTQTnNKTFSXJBkruT7Epy+WHOX5ZkZ5JvJLkxyT+fqz5DQ83zfdnS4iRZC1wJXAicCbwxyZmHFLsD2FRVLwZuAP5krjoNDU0EOxrSopwD7Kqq3VW1H7gO2DxcoKpurqpHu91bgPVzVThvaCTZkOTmrvtyZ5K3dcdPSvKlJN/u/jyxO54kH+m6Qt9IcvYivqj0BOc0pEVbB9wztL+nOzabLcBfzVVhn57GDPD2qjoTOBd4c9e9uRy4sao2Ajd2+zDoBm3sfrYCH+1xDWlOzmlIhzWVZMfQz9bFVpTkN4BNwAfnvOB8FVXVfcB93faPk9zFIKk2Ay/vil0DfBl4R3f8UzX45+EtSU5I8uyuHmlOh3t2Iv/0Y/Kzx32uQnqymaraNMf5vcCGof313bGfk+SVwL8H/lVV/dNcF1zQnEaS04CzgFuBU4eC4AfAqd12r+5Qkq0H03FmZmYhzdAq5NpT0qLcBmxMcnqSo4GLgenhAknOAj4GXFRV989XYe/QSHIc8Fngd6vqH4fPdb2KBQ08V9W2qtpUVZumpnwwXbNzTkNanKqaAS4FvgjcBVxfVXcmeW+Si7piHwSOA/57kq8nmZ6lOqDnMiJJjmIQGNdW1ee6wz88OOyU5NnAwYTq1R2SFsI5DWlxqmo7sP2QY+8e2n7lQurrc/dUgKuAu6rqQ0OnpoFLuu1LgM8PHf+t7i6qc4GHnc/QkSjKe26lRvTpaZwH/CbwzSRf7479IfB+4PokW4DvAW/ozm0HXg3sAh4F3rSUDZYkjU+fu6e+CrPOQp5/mPIFvPkI2yU9YfCO8HG3QhL4RLgmgPPgUjsMDU0EJ8KlNhgaap9dDakZhoYmgj0NqQ2Ghprnw31SOwwNTQR7GlIbDA01z56G1A5DQxPBBQulNhgaap49DakdhoYmglMaUhsMDTXPfobUDkNDE8G7p6Q2GBpqnnMaUjsMDU0EexpSGwwNNa/KlzBJrTA0JEm9GRpq3uAlTPY0pBYYGpKk3gwNta/KRUSkRhgakqTeDA01zzkNqR2GhiSpN0NDzRs8pmFPQ2qBoSFJ6s3QUPMK5zSkVhgakqTeDA01z7unpHYYGpKk3gwNNW/Q0xh3K6TJlOSCJHcn2ZXk8sOcf0qSz3Tnb01y2lz1TS1bSyVpEe558JGRXWvDiceO7FrjkGQtcCXwKmAPcFuS6araOVRsC/BgVZ2R5GLgA8C/ma3OiQ6Nq6++mvf/yQfH3QwtpQP7n3Roz957OfMXnj+GxkgT7xxgV1XtBkhyHbAZGA6NzcAfd9s3AH+eJDXLKzMnOjROPvlkznjeC8bdDC2hPP6TJx3beMZzueBVrxhDa6TmTSXZMbS/raq2De2vA+4Z2t8DvPSQOp4oU1UzSR4GTgZ+dNgLHnGTx2jz5s286JfPH3cztISOeug7426CNElmqmrTKC/oRLgkrVx7gQ1D++u7Y4ctk2QKOB54YLYKDQ1JWrluAzYmOT3J0cDFwPQhZaaBS7rt1wM3zTafARM+PCVJml03R3Ep8EVgLXB1Vd2Z5L3AjqqaBq4C/jLJLuAfGATLrDJHoIzMscceW488srjb7Hbve/LEqSaXcxoapSZuuT3ljEV/NMmjVTXSL+HwlCSpt2UJjfmeQJQkTaYlD42hJxAvBM4E3pjkzKW+jiRp9Jajp/HEE4hVtR84+ASiJGnCLcfdU32eQCTJVmBrt1tJfrrI600BM4v87KTyO68OfufV4Ui+8zFL2ZA+xnbLbfeo+7Z5C84jyY5RPxE5bn7n1cHvvDpM2ndejuGpPk8gSpIm0HKERp8nECVJE2jJh6dmewJxqa8z5IiHuCaQ33l18DuvDhP1nZt4IlySNBl8IlyS1JuhIUnqbaJDY7UtV5JkQ5Kbk+xMcmeSt427TaOQZG2SO5J8YdxtGYUkJyS5IcnfJbkrycvG3ablluT3ut/pbyX5dJKnjrtNSy3J1UnuT/KtoWMnJflSkm93f544zjb2MbGhsUqXK5kB3l5VZwLnAm9eBd8Z4G3AXeNuxAhdAfx1Vb0A+EVW+HdPsg54K7Cpql7I4AaaOZfnnlCfBC445NjlwI1VtRG4sdtv2sSGBqtwuZKquq+qbu+2f8zgL5N1423V8kqyHvg14OPjbssoJDke+BUG7zigqvZX1UNjbdRoTAHHdG+Oexpw75jbs+Sq6isM3lcxbDNwTbd9DfC6UbZpMSY5NA63XMmK/gt0WJLTgLOAW8fclOX2Z8AfAD8bcztG5XRgH/CJbkju40kaeOnD8qmqvcCfAt8H7gMerqq/GW+rRubUqrqv2/4BcOo4G9PHJIfGqpXkOOCzwO9W1T+Ouz3LJclrgPur6mvjbssITQFnAx+tqrOAR5iAIYsj0Y3jb2YQmM8Bjk3yG+Nt1eh1r1ht/hmISQ6NVblcSZKjGATGtVX1uXG3Z5mdB1yU5O8ZDD++Isl/HW+Tlt0eYE9VHexB3sAgRFayVwLfrap9VfU48Dngl8fcplH5YZJnA3R/3j/m9sxrkkNj1S1XkiQMxrrvqqoPjbs9y62q3llV66vqNAb/f2+qqhX9L9Cq+gFwT5Lnd4fOB3aOsUmj8H3g3CRP637Hz2eFT/4PmQYu6bYvAT4/xrb0MrZVbo/UGJYracF5wG8C30zy9e7YH1bV9vE1ScvgLcC13T+GdgNvGnN7llVV3ZrkBuB2BncI3sGELa3RR5JPAy8HTkmyB3gP8H7g+iRbgO8BbxhfC/txGRFJUm+TPDwlSRoxQ0OS1JuhIUnqzdCQJPVmaEiSejM0JEm9GRqSpN7+H+fClEnhaYFAAAAAAElFTkSuQmCC\n",
- "text/plain": [
- "
"
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "def make_step_function(x):\n",
- " if x < 4:\n",
- " return 0.001\n",
- " elif x < 6:\n",
- " return 0.3\n",
- " elif x < 8:\n",
- " return 0.5\n",
- " elif x < 9:\n",
- " return 0.95\n",
- " else:\n",
- " return 0.9999\n",
- "\n",
- "\n",
- "x = np.arange(0, 10, 0.001)\n",
- "probs = [make_step_function(x_) for x_ in x]\n",
- "\n",
- "y = np.array([1 if np.random.rand() < prob else 0 for prob in probs])\n",
- "\n",
- "fig, ax = plt.subplots()\n",
- "ax2 = ax.twinx()\n",
- "\n",
- "ax.hist(x[y == 0], alpha=0.15)\n",
- "ax.hist(x[y == 1], alpha=0.15)\n",
- "ax2.plot(x, probs, color=\"black\")\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The light blue histogram indicates the distribution of class 0 (`y=0`), while the light orange histogram indicates the distribution of class 1 (`y=1`). \n",
- "The black line indicates the probability function that isused to assign class 0 or 1. In this toy example, it's a step function."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 16,
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████| 5/5 [00:00<00:00, 17985.87it/s]\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "counts by TreeBucketer: [4000 1998 2001 936 1065]\n",
- "counts by QuantileBucketer: [625 625 625 625 625 625 625 625 625 625 625 625 625 625 625 625]\n"
- ]
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- "
"
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "# Try a tree bucketer\n",
- "tb = TreeBucketer(\n",
- " inf_edges=True,\n",
- " max_depth=4,\n",
- " criterion=\"entropy\",\n",
- " min_samples_leaf=400, # Minimum number of entries in the bins\n",
- " min_impurity_decrease=0.001,\n",
- ").fit(x, y)\n",
- "\n",
- "counts_tree, boundaries_tree = tb.counts_, tb.boundaries_\n",
- "\n",
- "df_tree = pd.DataFrame({\"x\": x, \"y\": y, \"probs\": probs})\n",
- "\n",
- "df_tree[\"label\"] = pd.cut(x, bins=boundaries_tree, include_lowest=True)\n",
- "\n",
- "# Try a quantile bucketer\n",
- "myQuantileBucketer = QuantileBucketer(bin_count=16)\n",
- "myQuantileBucketer.fit(x)\n",
- "q_boundaries = myQuantileBucketer.boundaries_\n",
- "q_counts = myQuantileBucketer.counts_\n",
- "\n",
- "df_q = pd.DataFrame({\"x\": x, \"y\": y, \"probs\": probs})\n",
- "df_q[\"label\"] = pd.cut(x, bins=q_boundaries, include_lowest=True)\n",
- "\n",
- "\n",
- "fig, ax = plt.subplots(1, 2, figsize=(12, 5))\n",
- "\n",
- "for label in df_tree.label.unique():\n",
- " df_tree[df_tree.label == label].plot(ax=ax[0], x=\"x\", y=\"probs\", legend=False)\n",
- " ax[0].scatter(df_tree[df_tree.label == label][\"x\"].mean(), df_tree[df_tree.label == label][\"y\"].mean())\n",
- " ax[0].set_title(\"Tree bucketer\")\n",
- "\n",
- "for label in df_q.label.unique():\n",
- " df_q[df_q.label == label].plot(ax=ax[1], x=\"x\", y=\"probs\", legend=False)\n",
- " ax[1].scatter(df_q[df_q.label == label][\"x\"].mean(), df_q[df_q.label == label][\"y\"].mean())\n",
- " ax[1].set_title(\"Quantile bucketer\")\n",
- "\n",
- "print(f\"counts by TreeBucketer: {counts_tree}\")\n",
- "print(f\"counts by QuantileBucketer: {q_counts}\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Comparing the `TreeBucketer` and the `QuantileBucketer` (the dots compare the average distribution of class 1 in the bin): \n",
- "Each buckets obtained by the `TreeBucketer` follow the probability distribution (i.e. the entries in the bucket have the same probability of being class 1). \n",
- "On the contrary, the `QuantileBucketer` splits the values below 4 in 6 buckets, which all have the same probability of being class 1. \n",
- "Note also that the tree is grown with the maximum depth of 4, which potentially lets it grow up to 16 buckets ($2^4$). \n",
- "\n",
- "The learned tree is visualized below, whreere the splitting according to the step function is visualized clearly.\n",
- "\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 17,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- "
"
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "from sklearn.tree import plot_tree\n",
- "\n",
- "fig, ax = plt.subplots(figsize=(12, 5))\n",
- "tre_out = plot_tree(tb.tree, ax=ax)"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.8.3-final"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/docs/tutorials/nb_distribution_statistics.ipynb b/docs/tutorials/nb_distribution_statistics.ipynb
deleted file mode 100644
index 9c1e87a2..00000000
--- a/docs/tutorials/nb_distribution_statistics.ipynb
+++ /dev/null
@@ -1,513 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Univariate Distribution Similarity\n",
- "\n",
- "[![open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ing-bank/probatus/blob/master/docs/tutorials/nb_distribution_statistics.ipynb)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "There are many situations when you want to perform univariate distribution comparison of a given feature, e.g. stability of the feature over different months.\n",
- "\n",
- "In order to do that, you can use statistical tests. In this tutorial we present how to easily do this using the `DistributionStatistics` class, and with the statistical tests directly.\n",
- "\n",
- "Available tests:\n",
- "- `'ES'`: Epps-Singleton\n",
- "- `'KS'`: Kolmogorov-Smirnov\n",
- "- `'PSI'`: Population Stability Index\n",
- "- `'SW'`: Shapiro-Wilk\n",
- "- `'AD'`: Anderson-Darling\n",
- "\n",
- "Details on the available tests can be found [here](https://ing-bank.github.io/probatus/api/stat_tests.html#available-tests).\n",
- "\n",
- "You can perform all these tests using a convenient wrapper class called `DistributionStatistics`.\n",
- "\n",
- "In this tutorial we will focus on how to perform two useful tests: Population Stability Index (widely applied in banking industry) and Kolmogorov-Smirnov."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Setup"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [],
- "source": [
- "%%capture\n",
- "!pip install probatus"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [],
- "source": [
- "%load_ext autoreload\n",
- "%autoreload 2\n",
- "\n",
- "import matplotlib.pyplot as plt\n",
- "import numpy as np\n",
- "import pandas as pd\n",
- "\n",
- "from probatus.binning import QuantileBucketer\n",
- "from probatus.stat_tests import DistributionStatistics, ks, psi"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Let's define some test distributions and visualize them. For these examples, we will use a normal distribution and a shifted version of the same distribution."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [],
- "source": [
- "counts = 1000\n",
- "np.random.seed(0)\n",
- "d1 = pd.Series(np.random.normal(size=counts), name=\"feature_1\")\n",
- "d2 = pd.Series(np.random.normal(loc=0.5, size=counts), name=\"feature_1\")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- "
"
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "from probatus.utils.plots import plot_distributions_of_feature\n",
- "\n",
- "feature_distributions = [d1, d2]\n",
- "sample_names = [\"expected\", \"actual\"]\n",
- "plot_distributions_of_feature(feature_distributions, sample_names=sample_names, plot_perc_outliers_removed=0.01)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Binning - QuantileBucketer\n",
- "\n",
- "To visualize the data, we will bin the data using a quantile bucketer, available in the `probatus.binning` module.\n",
- "\n",
- "Binning is used by all the `stats_tests` in order to group observations."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Bincounts for d1 and d2:\n",
- "[100 100 100 100 100 100 100 100 100 100]\n",
- "[ 25 62 50 68 76 90 84 169 149 227]\n"
- ]
- }
- ],
- "source": [
- "bins = 10\n",
- "myBucketer = QuantileBucketer(bins)\n",
- "d1_bincounts = myBucketer.fit_compute(d1)\n",
- "d2_bincounts = myBucketer.compute(d2)\n",
- "\n",
- "print(\"Bincounts for d1 and d2:\")\n",
- "print(d1_bincounts)\n",
- "print(d2_bincounts)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Let's plot the distribution for which we will calculate the statistics."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- "
"
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.figure(figsize=(20, 5))\n",
- "plt.bar(range(0, len(d1_bincounts)), d1_bincounts, label=\"d1: expected\")\n",
- "plt.bar(range(0, len(d2_bincounts)), d2_bincounts, label=\"d2: actual\", alpha=0.5)\n",
- "plt.title(\"PSI (bucketed)\", fontsize=16, fontweight=\"bold\")\n",
- "plt.legend(fontsize=15)\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "By visualizing the bins, we can already notice that the distributions are different.\n",
- "\n",
- "Let's use the statistical test to prove that."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## PSI - Population Stability Index\n",
- "The population stability index ([Karakoulas, 2004](https://cms.rmau.org/uploadedFiles/Credit_Risk/Library/RMA_Journal/Other_Topics_(1998_to_present)/Empirical%20Validation%20of%20Retail%20Credit-Scoring%20Models.pdf)) has long been used to evaluate distribution similarity in the banking industry, while developing credit decision models.\n",
- "\n",
- "In `probatus` we have implemented the PSI according to [Yurdakul 2018](https://scholarworks.wmich.edu/cgi/viewcontent.cgi?article=4249&context=dissertations), which derives a p-value, based on the hard to interpret PSI statistic. Using the p-value is a more reliable choice, because the banking industry-standard PSI critical values of 0.1 and 0.25 are unreliable heuristics as there is a strong dependency on sample sizes and number of bins. Aside from these heuristics, the PSI value is not easily interpretable in the context of common statistical frameworks (like a p-value or confidence levels)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "\n",
- "PSI = 0.33942407655561885\n",
- "\n",
- "PSI: Critical values defined according to de facto industry standard:\n",
- "PSI > 0.25: Significant distribution change; investigate.\n",
- "\n",
- "PSI: Critical values defined according to Yurdakul (2018):\n",
- "99.9% confident distributions have changed.\n"
- ]
- }
- ],
- "source": [
- "psi_value, p_value = psi(d1_bincounts, d2_bincounts, verbose=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Based on the above test, the distribution between the two samples significantly differ.\n",
- "Not only is the PSI statistic above the commonly used critical value, but also the p-value shows a very high confidence."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## PSI with DistributionStatistics "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Using the `DistributionStatistics` class one can apply the above test without the need to manually perform the binning. We initialize a `DistributionStatistics` instance with the desired test, binning_strategy (or choose `\"default\"` to choose the test's most appropriate binning strategy) and the number of bins. Then we start the test with the unbinned values as input."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "\n",
- "PSI = 0.33942407655561885\n",
- "\n",
- "PSI: Critical values defined according to de facto industry standard:\n",
- "PSI > 0.25: Significant distribution change; investigate.\n",
- "\n",
- "PSI: Critical values defined according to Yurdakul (2018):\n",
- "99.9% confident distributions have changed.\n"
- ]
- }
- ],
- "source": [
- "distribution_test = DistributionStatistics(\"psi\", binning_strategy=\"default\", bin_count=10)\n",
- "psi_value, p_value = distribution_test.compute(d1, d2, verbose=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## KS: Kolmogorov-Smirnov with DistributionStatistics\n",
- "The Kolmogorov-Smirnov test compares two distributions by calculating the maximum difference of the two samples' distribution functions, as illustrated by the black arrow in the following figure. The KS test is available in `probatus.stat_tests.ks`.\n",
- "\n",
- "\n",
- "\n",
- "The main advantage of this method is its sensitivity to differences in both location and shape of the empirical cumulative distribution functions of the two samples.\n",
- "\n",
- "The main disadvantages are that: it works for continuous distributions (unless modified, e.g. see ([Jeng 2006](https://bmcmedresmethodol.biomedcentral.com/track/pdf/10.1186/1471-2288-6-45))); in large samples, small and unimportant differences can be statistically significant ([Taplin & Hunt 2019](https://www.mdpi.com/2227-9091/7/2/53/pdf)); and finally in small samples, large and important differences can be statistically insignificant ([Taplin & Hunt 2019](https://www.mdpi.com/2227-9091/7/2/53/pdf))."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "As before, using the test requires you to perform the binning beforehand"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {
- "scrolled": true
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "\n",
- "KS: pvalue = 2.104700973377179e-27\n",
- "\n",
- "KS: Null hypothesis rejected with 99% confidence. Distributions very different.\n"
- ]
- }
- ],
- "source": [
- "k_value, p_value = ks(d1, d2, verbose=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Again, we can also choose to combine the binning and the statistical test using the `DistributionStatistics` class."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "\n",
- "KS: pvalue = 2.104700973377179e-27\n",
- "\n",
- "KS: Null hypothesis rejected with 99% confidence. Distributions very different.\n"
- ]
- }
- ],
- "source": [
- "distribution_test = DistributionStatistics(\"ks\", binning_strategy=None)\n",
- "ks_value, p_value = distribution_test.compute(d1, d2, verbose=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## AutoDist"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "metadata": {},
- "outputs": [],
- "source": [
- "from probatus.stat_tests import AutoDist"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Multiple statistics can automatically be calculated using `AutoDist`. To show this, let's create two new dataframes with two features each."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 12,
- "metadata": {},
- "outputs": [],
- "source": [
- "size, n_features = 100, 2\n",
- "\n",
- "df1 = pd.DataFrame(np.random.normal(size=(size, n_features)), columns=[f\"feat_{x}\" for x in range(n_features)])\n",
- "df2 = pd.DataFrame(np.random.normal(size=(size, n_features)), columns=[f\"feat_{x}\" for x in range(n_features)])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can now specify the statistical tests we want to perform and the binning strategies to perform. We can also set both of these variables to `'all'` or binning strategies to `'default'` to use the default binning strategy for every chosen statistical test."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 13,
- "metadata": {},
- "outputs": [],
- "source": [
- "statistical_tests = [\"KS\", \"PSI\"]\n",
- "binning_strategies = \"default\""
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Let's compute the statistics and their p_values:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████| 2/2 [00:00<00:00, 141.92it/s]\n",
- "100%|██████████| 2/2 [00:00<00:00, 139.13it/s]\n"
- ]
- },
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
column
\n",
- "
p_value_KS_no_bucketing_0
\n",
- "
p_value_PSI_quantilebucketer_10
\n",
- "
statistic_KS_no_bucketing_0
\n",
- "
statistic_PSI_quantilebucketer_10
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
0
\n",
- "
feat_0
\n",
- "
0.815415
\n",
- "
0.443244
\n",
- "
0.09
\n",
- "
0.192113
\n",
- "
\n",
- "
\n",
- "
1
\n",
- "
feat_1
\n",
- "
0.281942
\n",
- "
0.010922
\n",
- "
0.14
\n",
- "
0.374575
\n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " column p_value_KS_no_bucketing_0 p_value_PSI_quantilebucketer_10 \\\n",
- "0 feat_0 0.815415 0.443244 \n",
- "1 feat_1 0.281942 0.010922 \n",
- "\n",
- " statistic_KS_no_bucketing_0 statistic_PSI_quantilebucketer_10 \n",
- "0 0.09 0.192113 \n",
- "1 0.14 0.374575 "
- ]
- },
- "execution_count": 14,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "myAutoDist = AutoDist(statistical_tests=statistical_tests, binning_strategies=binning_strategies, bin_count=10)\n",
- "myAutoDist.compute(df1, df2)"
- ]
- }
- ],
- "metadata": {
- "environment": {
- "name": "common-cpu.m48",
- "type": "gcloud",
- "uri": "gcr.io/deeplearning-platform-release/base-cpu:m48"
- },
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.8"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/docs/tutorials/nb_imputation_comparison.ipynb b/docs/tutorials/nb_imputation_comparison.ipynb
deleted file mode 100644
index c6cf2cd2..00000000
--- a/docs/tutorials/nb_imputation_comparison.ipynb
+++ /dev/null
@@ -1,324 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Imputation Comparison\n",
- "\n",
- "[![open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ing-bank/probatus/blob/master/docs/tutorials/nb_imputation_comparison.ipynb)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "This notebook explains how the `ImputationSelector` class works in `probatus`. With `ImputationSelector` you can compare multiple imputation strategies\n",
- "and choose a strategy which works the best for a given model and a dataset.\n",
- "Currently `ImputationSelector` supports any [scikit-learn](https://scikit-learn.org/stable/) compatible imputation strategy. For categorical variables the missing values are replaced by a `missing` token and `OneHotEncoder` is applied. The user-supplied imputation strategies are applied to numerical columns only. \n",
- "Support for user-supplied imputation strategies for categorical columns can be added in the future releases.\n",
- "\n",
- "Let us look at an example and start by importing all the required classes and methods.\n",
- "\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [],
- "source": [
- "###Install the packages\n",
- "# %%capture\n",
- "#!pip install probatus\n",
- "#!pip install lightgbm"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)\n"
- ]
- }
- ],
- "source": [
- "%matplotlib inline\n",
- "%load_ext autoreload\n",
- "%autoreload 2\n",
- "import pandas as pd\n",
- "\n",
- "pd.set_option(\"display.max_columns\", 100)\n",
- "pd.set_option(\"display.max_row\", 500)\n",
- "pd.set_option(\"display.max_colwidth\", 200)\n",
- "import lightgbm as lgb\n",
- "from sklearn.datasets import make_classification\n",
- "from sklearn.experimental import enable_iterative_imputer\n",
- "\n",
- "from sklearn.impute import IterativeImputer, KNNImputer, SimpleImputer\n",
- "from sklearn.linear_model import LogisticRegression\n",
- "\n",
- "from probatus.missing_values.imputation import ImputationSelector\n",
- "from probatus.utils.missing_helpers import generate_MCAR"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Let's create a classification dataset to apply the various imputation strategies.\n",
- "\n",
- "We'll use the `probatus.utils.missing_helpers.generate_MCAR` function to randomly add missing values to the dataset."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Shape of X,y : (2000, 20),(2000,)\n"
- ]
- }
- ],
- "source": [
- "n_features = 20\n",
- "X, y = make_classification(n_samples=2000, n_features=n_features, random_state=123, class_sep=0.3)\n",
- "X = pd.DataFrame(X, columns=[\"f_\" + str(i) for i in range(0, n_features)])\n",
- "print(f\"Shape of X,y : {X.shape},{y.shape}\")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
0
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
f_0
\n",
- "
0.2080
\n",
- "
\n",
- "
\n",
- "
f_1
\n",
- "
0.1960
\n",
- "
\n",
- "
\n",
- "
f_2
\n",
- "
0.1990
\n",
- "
\n",
- "
\n",
- "
f_3
\n",
- "
0.2095
\n",
- "
\n",
- "
\n",
- "
f_4
\n",
- "
0.2150
\n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " 0\n",
- "f_0 0.2080\n",
- "f_1 0.1960\n",
- "f_2 0.1990\n",
- "f_3 0.2095\n",
- "f_4 0.2150"
- ]
- },
- "execution_count": 4,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "X_missing = generate_MCAR(X, missing=0.2)\n",
- "missing_stats = pd.DataFrame(X_missing.isnull().mean())\n",
- "\n",
- "missing_stats.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The data has approximately 20% missing values in each feature."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Imputation Strategies"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Create a dictionary with all the strategies to compare. Also, create a classifier to use for evaluating various strategies.\n",
- "If the model supports handling of missing features by default then the model performance on an unimputed dataset is calculated. You can indicate that the model supports handling missing values by setting the parameter `model_na_support=True`.\n",
- "The model performance against the unimputed dataset can be found in `No Imputation` results."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "",
- "text/plain": [
- "
"
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "from sklearn.pipeline import Pipeline\n",
- "from sklearn.preprocessing import StandardScaler\n",
- "\n",
- "steps = [(\"scaler\", StandardScaler()), (\"LR\", LogisticRegression())]\n",
- "clf = Pipeline(steps)\n",
- "cmp = ImputationSelector(clf=clf, strategies=strategies, cv=5, model_na_support=False)\n",
- "cmp.fit_compute(X_missing, y)\n",
- "result_plot = cmp.plot()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "jp-MarkdownHeadingCollapsed": true
- },
- "source": [
- "## Scikit Learn Compatible Imputers. \n",
- "\n",
- "You can also use any other scikit-learn compatible imputer as an imputing strategy.\n",
- "e.g. [feature engine](https://feature-engine.readthedocs.io/en/latest/index.html) library provides a host of other imputing stratgies as well. You can pass them for comparision as well."
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3 (ipykernel)",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.9.10"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/docs/tutorials/nb_metric_volatility.ipynb b/docs/tutorials/nb_metric_volatility.ipynb
deleted file mode 100644
index 24b036b8..00000000
--- a/docs/tutorials/nb_metric_volatility.ipynb
+++ /dev/null
@@ -1,342 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Metric Volatility Estimation\n",
- "\n",
- "[![open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ing-bank/probatus/blob/master/docs/tutorials/nb_metric_volatility.ipynb)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The estimation of AUC of your model could be influenced by, for instance, how you split your data. If another random seed was used, your AUC could be 3% lower. In order to understand how stable your model evaluation is, and what performance you can expect on average from your model, you can use the `metric_volatility` module.\n",
- "\n",
- "### Setup"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "%%capture\n",
- "!pip install probatus"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "metadata": {},
- "outputs": [],
- "source": [
- "from sklearn.datasets import make_classification\n",
- "from sklearn.ensemble import RandomForestClassifier\n",
- "\n",
- "from probatus.metric_volatility import BootstrappedVolatility, SplitSeedVolatility, TrainTestVolatility\n",
- "\n",
- "X, y = make_classification(n_samples=1000, n_features=10, random_state=1)\n",
- "clf = RandomForestClassifier(n_estimators=2, max_depth=2, random_state=0)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### TrainTestVolatility\n",
- "The class that provides a wide functionality for experimentation with metric volatility is TrainTestVolatility. Please refer to the API reference for full description of available parameters.\n",
- "\n",
- "By default, the class performs a simple experiment, in which it computes the metrics on data split into train and test set with a different random seed at each iteration. Having computed the mean and standard deviation of the metrics, you can analyse the impact of random seed setting on your results and get a better estimation of performance on this dataset.\n",
- "\n",
- "When you run the `fit()` and `compute()` or `fit_compute()`, the experiment described above is performed and the report is returned. The `train_mean` and and `test_mean` show an averaged performance of the model, and `delta_mean` indicates on average how much the model overfits on the data. \n",
- "\n",
- "By looking at `train_std`, `test_std`, `delta_std`, you can assess the stability of these scores overall. High volatility on some of the splits may indicate the need to change the sizes of these splits or make changes to the model."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 15,
- "metadata": {
- "scrolled": true
- },
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
train_mean
\n",
- "
train_std
\n",
- "
test_mean
\n",
- "
test_std
\n",
- "
delta_mean
\n",
- "
delta_std
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
roc_auc
\n",
- "
0.831818
\n",
- "
0.036407
\n",
- "
0.816538
\n",
- "
0.043732
\n",
- "
0.01528
\n",
- "
0.027516
\n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " train_mean train_std test_mean test_std delta_mean delta_std\n",
- "roc_auc 0.831818 0.036407 0.816538 0.043732 0.01528 0.027516"
- ]
- },
- "execution_count": 15,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Basic functionality\n",
- "volatility = TrainTestVolatility(clf, iterations=50)\n",
- "volatility.fit_compute(X, y)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The results above show quite unstable results, due to high `train_std` and `test_std`. However, the `delta_mean` is relatively, which indicates that the model might underfit and increasing the complexity of the model could bring improvements to the results.\n",
- "\n",
- "One can also present the distributions of train, test and deltas for each metric. The plots allows for a sensitivity analysis."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 16,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- "
"
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "axs = volatility.plot()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "In order to simplify the use of this class for the user, two convenience classes have been created to perform the main types of analyses with less parameters needed to be set by the user."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### SplitSeedVolatility \n",
- "\n",
- "The estimation of volatility is done in the same way as the default analysis described in TrainTestVolatility. The main advantage of using that class is a lower number of parameters to set."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 17,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
train_mean
\n",
- "
train_std
\n",
- "
test_mean
\n",
- "
test_std
\n",
- "
delta_mean
\n",
- "
delta_std
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
roc_auc
\n",
- "
0.827796
\n",
- "
0.039356
\n",
- "
0.804926
\n",
- "
0.040501
\n",
- "
0.02287
\n",
- "
0.019264
\n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " train_mean train_std test_mean test_std delta_mean delta_std\n",
- "roc_auc 0.827796 0.039356 0.804926 0.040501 0.02287 0.019264"
- ]
- },
- "execution_count": 17,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "volatility = SplitSeedVolatility(clf, iterations=50, test_prc=0.5)\n",
- "volatility.fit_compute(X, y)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### BootstrappedVolatility\n",
- "\n",
- "This class allows to perform a different experiment. At each iteration, the train-test split is the same, however, the samples in both splits are bootstrapped (sampled with replacement). Thus, some of the samples might be omitted, and some will be used multiple times in a given run. \n",
- "\n",
- "With this experiment, you can estimate an average performance for a specific train-test split, as well as indicate how volatile the scores are towards certain samples within your splits. Moreover, you can experiment with the amount of data sampled in each split, to tweak the test split size."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 18,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "