Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] ENH: Hellinger distance tree split criterion for imbalanced data classification #437

Open
wants to merge 61 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
e58f628
added cython implementation of hellinger distance criterion compatibl…
EvgeniDubov Jul 11, 2018
649e204
added usage example
EvgeniDubov Jul 11, 2018
689a41b
added README
EvgeniDubov Jul 11, 2018
e4e13a7
update license
EvgeniDubov Jul 11, 2018
e33d8a3
Fixed pep8 issues in the example
EvgeniDubov Jul 12, 2018
f08534b
Fixed pep8 issues in the setup
EvgeniDubov Jul 12, 2018
d655b60
added support for cython build based on https://github.com/jakevdp/cy…
EvgeniDubov Jul 16, 2018
8d779e0
Merge branch 'hellinger_distance_criterion' of https://github.com/Evg…
EvgeniDubov Jul 16, 2018
b94ed53
updated 'whats new'
EvgeniDubov Jul 25, 2018
3e265ce
updated the example
EvgeniDubov Jul 25, 2018
097a582
updates user guide and api
EvgeniDubov Jul 25, 2018
0517276
fixed LGTM issues
EvgeniDubov Jul 25, 2018
65a4c62
Merge branch 'master' into hellinger_distance_criterion
EvgeniDubov Jul 30, 2018
a700cbd
Merged with https://github.com/glemaitre/imbalanced-learn/commit/27ff…
EvgeniDubov Oct 2, 2018
13bc07d
Merge branch 'hellinger_distance_criterion' of https://github.com/Evg…
EvgeniDubov Oct 2, 2018
f6f402f
Merge branch 'master' into hellinger_distance_criterion
EvgeniDubov Oct 2, 2018
d8eb231
fixed setup
EvgeniDubov Oct 2, 2018
118bf23
adding __init__.py to tree
EvgeniDubov Oct 6, 2018
01506e6
added Cython as dependency in appveyor
EvgeniDubov Oct 8, 2018
817182f
Merge branch 'master' into hellinger_distance_criterion
EvgeniDubov Oct 13, 2018
5953be8
renamed tree_split to tree in documentation and example
EvgeniDubov Oct 13, 2018
1896002
restored criterion pxd
EvgeniDubov Oct 13, 2018
443a91b
doc update
EvgeniDubov Dec 30, 2018
eec018d
Merge remote-tracking branch 'remotes/upstream/master' into hellinger…
EvgeniDubov Dec 30, 2018
12df098
add Cython to travis install list
EvgeniDubov Dec 30, 2018
2bdc7dd
fixed travis config
EvgeniDubov Dec 30, 2018
e06caea
added cython special install in travis conda
EvgeniDubov Dec 30, 2018
55f67e7
added cython to travis ubuntu
EvgeniDubov Dec 30, 2018
6049db4
- fixed Hellinger tree example to pass travis
EvgeniDubov Dec 30, 2018
aa7348f
added pandas to travis install list
EvgeniDubov Dec 30, 2018
699ce53
added cython to appveyor
EvgeniDubov Dec 30, 2018
f4a9bfa
- changed tree example to pass travus
EvgeniDubov Dec 30, 2018
f21c208
turned appveyor build on
EvgeniDubov Dec 30, 2018
c022715
turned appveyor build off
EvgeniDubov Dec 30, 2018
005a7b7
Merge pull request #1 from scikit-learn-contrib/master
EvgeniDubov May 27, 2019
9a5b32b
appveyor.yml - trying to fix appveyor errors by imblearn build install
EvgeniDubov May 27, 2019
0cd8d1f
Revert appveyor change
EvgeniDubov May 27, 2019
3dfc304
Merge remote-tracking branch 'remotes/origin/master' into hellinger_d…
EvgeniDubov Jun 30, 2019
d6914c0
Merge remote-tracking branch 'upstream/master' into hellinger_distanc…
EvgeniDubov Aug 15, 2019
810498e
Synced __check_build\setup.py with sklearn script
EvgeniDubov Aug 15, 2019
64db93f
aligned to master
EvgeniDubov Aug 15, 2019
e72f2d6
Fixed travis issue according to [MRG] 👽 Maintenance for `imblearn.sho…
EvgeniDubov Aug 15, 2019
5a4e53e
fixed versions
EvgeniDubov Aug 15, 2019
e1316b2
fixed travis
EvgeniDubov Aug 15, 2019
4af0af8
commented out hellinger usage example to narrow down travis failure r…
EvgeniDubov Aug 15, 2019
97ef77f
added hellinger usage example to tree.rst
EvgeniDubov Aug 15, 2019
a7855a7
- added Cython temp files to git ignore
EvgeniDubov Oct 10, 2019
ccbbaaf
Merge remote-tracking branch 'upstream/master' into hellinger_distanc…
EvgeniDubov Oct 10, 2019
21e6909
documentation update
EvgeniDubov Oct 10, 2019
9268ee9
added cython installation to travis
EvgeniDubov Oct 10, 2019
e4c5360
fix few LGTM issues
EvgeniDubov Oct 10, 2019
fc9e483
fix LGTM issue
EvgeniDubov Oct 10, 2019
1eee5af
Merge remote-tracking branch 'upstream/master' into hellinger_distanc…
EvgeniDubov Dec 23, 2019
a3cfa7d
travis fix
EvgeniDubov Dec 23, 2019
0bb474c
fix appveyor
EvgeniDubov Dec 23, 2019
2fc250e
updated MANIFEST.in
EvgeniDubov Dec 23, 2019
008b808
aligned setup file to master
EvgeniDubov Dec 23, 2019
6e67b96
fixed lint issues
EvgeniDubov Dec 23, 2019
3afdbb4
fix lint issues
EvgeniDubov Dec 23, 2019
806cc7b
fixed lint issues
EvgeniDubov Dec 23, 2019
7acac1f
fix lint issues
EvgeniDubov Dec 23, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -119,4 +119,5 @@ cythonize.dat
doc/_build/
doc/auto_examples/
doc/generated/
doc/bibtex/auto
doc/bibtex/auto
/imblearn/tree/*.c
6 changes: 3 additions & 3 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,13 +33,13 @@ matrix:
- env: DISTRIB="ubuntu" TEST_DOC="true" TEST_NUMPYDOC="false"
# Latest release
- env: DISTRIB="conda" PYTHON_VERSION="3.7"
NUMPY_VERSION="*" SCIPY_VERSION="*" SKLEARN_VERSION="master"
NUMPY_VERSION="*" SCIPY_VERSION="*" SKLEARN_VERSION="master" CYTHON_VERSION="*"
OPTIONAL_DEPS="keras" TEST_DOC="true" TEST_NUMPYDOC="false"
- env: DISTRIB="conda" PYTHON_VERSION="3.7"
NUMPY_VERSION="*" SCIPY_VERSION="*" SKLEARN_VERSION="master"
NUMPY_VERSION="*" SCIPY_VERSION="*" SKLEARN_VERSION="master" CYTHON_VERSION="*"
OPTIONAL_DEPS="tensorflow" TEST_DOC="true" TEST_NUMPYDOC="false"
- env: DISTRIB="conda" PYTHON_VERSION="3.7"
NUMPY_VERSION="*" SCIPY_VERSION="*" SKLEARN_VERSION="master"
NUMPY_VERSION="*" SCIPY_VERSION="*" SKLEARN_VERSION="master" CYTHON_VERSION="*"
OPTIONAL_DEPS="false" TEST_DOC="false" TEST_NUMPYDOC="true"

install: source build_tools/travis/install.sh
Expand Down
3 changes: 2 additions & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@

recursive-include doc *
recursive-include examples *
include imblearn/tree *.pyx
include imblearn/tree *.pyd
include AUTHORS.rst
include CONTRIBUTING.ms
include LICENSE
Expand Down
1 change: 1 addition & 0 deletions appveyor.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ install:
- activate testenv
- conda install scipy numpy joblib -y -q
- pip install --pre -f https://sklearn-nightly.scdn8.secure.raxcdn.com scikit-learn
- conda install -c anaconda cython -y -q
- conda install %OPTIONAL_DEP% -y -q
- conda install pytest pytest-cov -y -q
- pip install codecov
Expand Down
4 changes: 3 additions & 1 deletion build_tools/travis/install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ if [[ "$DISTRIB" == "conda" ]]; then
# provided versions
conda create -n testenv --yes python=$PYTHON_VERSION pip
source activate testenv
conda install --yes numpy=$NUMPY_VERSION scipy=$SCIPY_VERSION
conda install --yes numpy=$NUMPY_VERSION scipy=$SCIPY_VERSION cython=$CYTHON_VERSION

if [[ "$OPTIONAL_DEPS" == "keras" ]]; then
conda install --yes pandas keras tensorflow=1
Expand Down Expand Up @@ -66,12 +66,14 @@ elif [[ "$DISTRIB" == "ubuntu" ]]; then
pip install --pre -f https://sklearn-nightly.scdn8.secure.raxcdn.com scikit-learn
pip3 install pandas
pip3 install pytest pytest-cov codecov sphinx numpydoc
pip3 install cython

fi

python --version
python -c "import numpy; print('numpy %s' % numpy.__version__)"
python -c "import scipy; print('scipy %s' % scipy.__version__)"
python -c "import Cython; print('Cython %s' % Cython.__version__)"

pip install -e .
ccache --show-stats
Expand Down
1 change: 1 addition & 0 deletions build_tools/travis/test_script.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ run_tests(){
python --version
python -c "import numpy; print('numpy %s' % numpy.__version__)"
python -c "import scipy; print('scipy %s' % scipy.__version__)"
python -c "import Cython; print('Cython %s' % Cython.__version__)"
python -c "import multiprocessing as mp; print('%d CPUs' % mp.cpu_count())"

pytest --cov=$MODULE -r sx --pyargs $MODULE
Expand Down
23 changes: 23 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -196,6 +196,29 @@ Imbalance-learn provides some fast-prototyping tools.

.. _metrics_ref:

:mod:`imblearn.tree`: Tree split criterion
==================================

.. automodule:: imblearn.tree
:no-members:
:no-inherited-members:

.. currentmodule:: imblearn

.. autosummary::
:toctree: generated/
:template: class.rst

tree.criterion.HellingerDistanceCriterion

.. autosummary::
:toctree: generated/
:template: function.rst

pipeline.make_pipeline

.. _metrics_ref:

:mod:`imblearn.metrics`: Metrics
================================

Expand Down
2 changes: 1 addition & 1 deletion doc/miscellaneous.rst
Original file line number Diff line number Diff line change
Expand Up @@ -169,4 +169,4 @@ will be passed to ``fit_generator``::

.. topic:: References

* :ref:`sphx_glr_auto_examples_applications_porto_seguro_keras_under_sampling.py`
* :ref:`sphx_glr_auto_examples_applications_porto_seguro_keras_under_sampling.py`
26 changes: 26 additions & 0 deletions doc/tree.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
.. _tree-split:

==============
Tree-split
==============

.. currentmodule:: imblearn.tree

.. _cluster_centroids:


Hellinger Distance split
====================

Hellinger Distance is used to quantify the similarity between two probability distributions.
When used as split criterion in Decision Tree Classifier it makes it skew insensitive and helps tackle the imbalance problem.

>>> import numpy as np
>>> from sklearn.ensemble import RandomForestClassifier
>>> from imblearn.tree.criterion import HellingerDistanceCriterion

>>> hdc = HellingerDistanceCriterion(1, np.array([2],dtype='int64'))
>>> clf = RandomForestClassifier(criterion=hdc)

:class:`HellingerDistanceCriterion` offers a Cython implementation of Hellinger Distance
as a criterion for decision tree split compatible with sklearn tree based classification models.
1 change: 1 addition & 0 deletions doc/user_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ User Guide
introduction.rst
over_sampling.rst
under_sampling.rst
tree.rst
combine.rst
ensemble.rst
miscellaneous.rst
Expand Down
3 changes: 3 additions & 0 deletions doc/whats_new/v0.4.rst
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,9 @@ Enhancement
:class:`BorderlineSMOTE` and :class:`SVMSMOTE`.
:issue:`440` by :user:`Guillaume Lemaitre <glemaitre>`.

- Add support for Hellinger Distance as sklearn classification tree split criterion.
By :user: `Evgeni Dubov <EvgeniDubov>`.

- Allow :class:`imblearn.over_sampling.RandomOverSampler` can return indices
using the attributes ``return_indices``.
:issue:`439` by :user:`Hugo Gascon<hgascon>` and
Expand Down
9 changes: 9 additions & 0 deletions examples/tree/README.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
.. _tree_examples:

Example using Hellinger Distance as tree split criterion
========================================================

Hellinger Distance is used to quantify the similarity between two probability distributions.
When used as split criterion in Decision Tree Classifier it makes it skew insensitive and helps tackle the imbalance problem.
This is Cython implementation of Hellinger Distance as a criterion for decision tree split compatible with sklearn tree based classification models.

17 changes: 17 additions & 0 deletions examples/tree/train_model_with_hellinger_distance_criterion.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
import numpy as np

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

from imblearn.tree.criterion import HellingerDistanceCriterion

X, y = make_classification(
n_samples=10000, n_features=40, n_informative=5,
n_classes=2, weights=[0.05, 0.95], random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

hdc = HellingerDistanceCriterion(1, np.array([2], dtype='int64'))
clf = RandomForestClassifier(criterion=hdc, max_depth=4, n_estimators=100)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
Empty file added imblearn/tree/__init__.py
Empty file.
92 changes: 92 additions & 0 deletions imblearn/tree/criterion.pyx
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Author: Evgeni Dubov <[email protected]>
#
# License: BSD 3 clause

#cython: language_level=3, boundscheck=False

from libc.math cimport sqrt, pow
from libc.math cimport abs

import numpy as np

from sklearn.tree._criterion cimport ClassificationCriterion
from sklearn.tree._criterion cimport SIZE_t

cdef double INFINITY = np.inf


cdef class HellingerDistanceCriterion(ClassificationCriterion):
"""Hellinger distance criterion.


"""

cdef double proxy_impurity_improvement(self) nogil:
cdef:
double impurity_left
double impurity_right

self.children_impurity(&impurity_left, &impurity_right)

return impurity_right + impurity_left

cdef double impurity_improvement(self, double impurity) nogil:
cdef:
double impurity_left
double impurity_right

self.children_impurity(&impurity_left, &impurity_right)

return impurity_right + impurity_left

cdef double node_impurity(self) nogil:
cdef:
SIZE_t* n_classes = self.n_classes
double* sum_total = self.sum_total
double hellinger = 0.0
double sq_count
double count_k
SIZE_t k, c

for k in range(self.n_outputs):
for c in range(n_classes[k]):
hellinger += 1.0

return hellinger / self.n_outputs

cdef void children_impurity(self, double* impurity_left,
double* impurity_right) nogil:
cdef:
SIZE_t* n_classes = self.n_classes
double* sum_left = self.sum_left
double* sum_right = self.sum_right
double hellinger_left = 0.0
double hellinger_right = 0.0
double count_k1 = 0.0
double count_k2 = 0.0
SIZE_t k, c

# stop splitting in case reached pure node with 0 samples of second
# class
if sum_left[1] + sum_right[1] == 0:
impurity_left[0] = -INFINITY
impurity_right[0] = -INFINITY
return

for k in range(self.n_outputs):
if(sum_left[0] + sum_right[0] > 0):
count_k1 = sqrt(sum_left[0] / (sum_left[0] + sum_right[0]))
if(sum_left[1] + sum_right[1] > 0):
count_k2 = sqrt(sum_left[1] / (sum_left[1] + sum_right[1]))

hellinger_left += pow((count_k1 - count_k2), 2)

if(sum_left[0] + sum_right[0] > 0):
count_k1 = sqrt(sum_right[0] / (sum_left[0] + sum_right[0]))
if(sum_left[1] + sum_right[1] > 0):
count_k2 = sqrt(sum_right[1] / (sum_left[1] + sum_right[1]))

hellinger_right += pow((count_k1 - count_k2), 2)

impurity_left[0] = hellinger_left / self.n_outputs
impurity_right[0] = hellinger_right / self.n_outputs
18 changes: 18 additions & 0 deletions imblearn/tree/setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
import numpy


def configuration(parent_package='', top_path=None):
from numpy.distutils.misc_util import Configuration
config = Configuration('tree', parent_package, top_path)
libraries = []
config.add_extension('criterion',
sources=['criterion.c'],
include_dirs=[numpy.get_include()],
libraries=libraries)

return config


if __name__ == "__main__":
from numpy.distutils.core import setup
setup(**configuration().todict())
Loading