Skip to content

Commit

Permalink
Merge branch 'master' of github.com:cgnorthcutt/cleanlab
Browse files Browse the repository at this point in the history
  • Loading branch information
cgnorthcutt committed Nov 26, 2019
2 parents 67e6a87 + c8dd81e commit f8f131f
Show file tree
Hide file tree
Showing 2 changed files with 57 additions and 106 deletions.
8 changes: 8 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
Contributions to cleanlab are welcome. When contributing, please keep the following in mind:

1. Avoid adding blank lines or spaces to the ends of lines.
2. Follow PEP-8 coding style.
3. Always add tests to check the code you added. Tests go in `cleanlab/tests/`.
4. Submit a pull request (it will automatically run your code and make sure nothing breaks).

Thanks!
155 changes: 49 additions & 106 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

|
``cleanlab`` is a machine learning python package for **learning with noisy labels** and **finding label errors in datasets**. ``cleanlab`` CLEANs LABels. It is is powered by the theory of **confident learning**, published in `this paper <https://arxiv.org/abs/1911.00068>`__ and explained in `this blog <https://l7.curtisnorthcutt.com/confident-learning>`__. Using the `confidentlearning-reproduce <https://github.com/cgnorthcutt/confidentlearning-reproduce>`__ repo, ``cleanlab`` v0.1.0 reproduces results in `the CL paper <https://arxiv.org/abs/1911.00068>`__.
``cleanlab`` is a machine learning python package for **learning with noisy labels** and **finding label errors in datasets**. ``cleanlab`` CLEANs LABels. It is powered by the theory of **confident learning**, published in `this paper <https://arxiv.org/abs/1911.00068>`__ and explained in `this blog <https://l7.curtisnorthcutt.com/confident-learning>`__. Using the `confidentlearning-reproduce <https://github.com/cgnorthcutt/confidentlearning-reproduce>`__ repo, ``cleanlab`` v0.1.0 reproduces results in `the CL paper <https://arxiv.org/abs/1911.00068>`__.


|pypi| |py_versions| |build_status| |coverage|
Expand All @@ -19,15 +19,19 @@
.. |coverage| image:: https://codecov.io/gh/cgnorthcutt/cleanlab/branch/master/graph/badge.svg
:target: https://codecov.io/gh/cgnorthcutt/cleanlab

``cleanlab`` finds and cleans label errors in any dataset using state-of-the-art algorithms for learning with noisy labels by characterizing label noise. ``cleanlab`` is fast: its built on optimized algorithms and parallelized across CPU threads automatically. ``cleanlab`` implements the family of theory and algorithms called `confident learning <https://arxiv.org/abs/1911.00068>`__ with provable guarantees of exact noise estimation and label error finding (even when model output probabilities are noisy/imperfect).
``cleanlab`` **documentation is available in** `this blog post <https://l7.curtisnorthcutt.com/cleanlab-python-package>`__.

How does **confident learning** work? Find out here: `TUTORIAL: confident learning with just numpy and for-loops <https://github.com/cgnorthcutt/cleanlab/blob/master/examples/simplifying_confident_learning_tutorial.ipynb>`__.

``cleanlab`` supports multi-label, multiclass, sparse matrices, and more.
So fresh, so ``cleanlab``
=========================

``cleanlab`` finds and cleans label errors in any dataset using `state-of-the-art algorithms <https://arxiv.org/abs/1911.00068>`__ to find label errors, characterize noise, and learn in spite of it. ``cleanlab`` is fast: its built on optimized algorithms and parallelized across CPU threads automatically. ``cleanlab`` is powered by `provable guarantees <https://arxiv.org/abs/1911.00068>`__ of exact noise estimation and label error finding in realistic cases when model output probabilities are erroneous. ``cleanlab`` supports multi-label, multiclass, sparse matrices, etc. By default, ``cleanlab`` requires no hyper-parameters.

Its called ``cleanlab`` because it CLEANs LABels.
=================================================
``cleanlab`` finds and cleans label errors in any dataset using state-of-the-art algorithms for learning with noisy labels by characterizing label noise. ``cleanlab`` is fast: its built on optimized algorithms and parallelized across CPU threads automatically. ``cleanlab`` implements the family of theory and algorithms called `confident learning <https://arxiv.org/abs/1911.00068>`__ with provable guarantees of exact noise estimation and label error finding (even when model output probabilities are noisy/imperfect).

**How does confident learning work?** See: `TUTORIAL: confident learning with just numpy and for-loops <https://github.com/cgnorthcutt/cleanlab/blob/master/examples/simplifying_confident_learning_tutorial.ipynb>`__.

``cleanlab`` supports multi-label, multiclass, sparse matrices, and more.

``cleanlab`` is:

Expand All @@ -37,7 +41,7 @@ Its called ``cleanlab`` because it CLEANs LABels.
4. unique - The only package for multiclass learning with noisy labels or finding label errors for any dataset / classifier.


Find label errors with PyTorch, Tensorflow, MXNet, etc. in 1 line of code!
Find label errors with PyTorch, Tensorflow, MXNet, etc. in 1 line of code.
==========================================================================

.. code:: python
Expand All @@ -54,6 +58,7 @@ Find label errors with PyTorch, Tensorflow, MXNet, etc. in 1 line of code!
sorted_index_method='normalized_margin', # Orders label errors
)
Pre-computed out-of-sample predicted probabilities for CIFAR-10 train set are available here: [`LINK <https://github.com/cgnorthcutt/confidentlearning-reproduce/blob/master/README.md#need-out-of-sample-predicted-probabilities-for-cifar-10-train-set>`__].

Learning with noisy labels in 3 lines of code!
==============================================
Expand Down Expand Up @@ -154,7 +159,7 @@ See `cleanlab/examples <https://github.com/cgnorthcutt/cleanlab/tree/master/exam
``cleanlab``: Find Label Errors in ImageNet
-------------------------------------------

We use ``cleanlab`` to automatically identify ~100,000 label errors in the 2012 ImageNet training dataset.
Use ``cleanlab`` to identify ~100,000 label errors in the 2012 ImageNet training dataset.

.. figure:: https://raw.githubusercontent.com/cgnorthcutt/cleanlab/master/img/imagenet_train_label_errors_32.jpg
:align: center
Expand All @@ -165,25 +170,27 @@ Top label issues in the 2012 ILSVRC ImageNet train set identified using ``cleanl
``cleanlab``: Find Label Errors in MNIST
----------------------------------------

We use ``cleanlab`` to automatically identify ~50 label errors in the MNIST dataset.
Use ``cleanlab`` to identify ~50 label errors in the MNIST dataset.

.. figure:: https://raw.githubusercontent.com/cgnorthcutt/cleanlab/master/img/mnist_training_label_errors24_prune_by_noise_rate.png
:align: center
:alt: Image depicting label errors in MNIST train set

Label errors of the original MNIST **train** dataset identified algorithmically using the rankpruning algorithm. Depicts the 24 least confident labels, ordered left-right, top-down by increasing self-confidence (probability of belonging to the given label), denoted conf in teal. The label with the largest predicted probability is in green. Overt errors are in red.
Label errors of the original MNIST **train** dataset identified algorithmically using cleanlab. Depicts the 24 least confident labels, ordered left-right, top-down by increasing self-confidence (probability of belonging to the given label), denoted conf in teal. The label with the largest predicted probability is in green. Overt errors are in red.


``cleanlab`` Generality: View performance across 4 distributions and 9 classifiers.
-----------------------------------------------------------------------------------

We use ``cleanlab`` to automatically learn with noisy labels regardless of dataset distribution or classifier.
Use ``cleanlab`` to learn with noisy labels regardless of dataset distribution or classifier.

.. figure:: https://raw.githubusercontent.com/cgnorthcutt/cleanlab/master/img/demo_cleanlab_across_datasets_and_classifiers.png
:align: center
:alt: Image depicting generality of cleanlab across datasets and classifiers

Each figure depicts the decision boundary learned using ``cleanlab.classification.LearningWithNoisyLabels`` in the presence of extreme (~35%) label errors. Label errors are circled in green. Label noise is class-conditional (not simply uniformly random). Columns are organized by the classifier used, except the left-most column which depicts the ground-truth dataset distribution. Rows are organized by dataset used. A matrix characterizing the label noise for the first row is shown below.
Each sub-figure in the figure above depicts the decision boundary learned using ``cleanlab.classification.LearningWithNoisyLabels`` in the presence of extreme (\~35%) label errors. Label errors are circled in green. Label noise is class-conditional (not simply uniformly random). Columns are organized by the classifier used, except the left-most column which depicts the ground-truth dataset distribution. Rows are organized by dataset used.

The code to reproduce this figure is available `here <https://github.com/cgnorthcutt/cleanlab/blob/master/examples/classifier_comparison.ipynb>`__.

Each figure depicts accuracy scores on a test set as decimal values:

Expand All @@ -202,8 +209,6 @@ s=2 0.12 0.04 0.64 0.38
s=3 0.11 0.08 0.05 0.54
====== ==== ==== ==== ====

The code to reproduce this figure is available `here <https://github.com/cgnorthcutt/cleanlab/blob/master/examples/classifier_comparison.ipynb>`__.


Get started with easy, quick examples.
======================================
Expand Down Expand Up @@ -299,40 +304,6 @@ Documentation by Example
Many of these methods have default parameters that won’t be covered
here. Check out the method docstrings for full documentation.

Multiclass learning with noisy labels (in **3** lines of code):
---------------------------------------------------------------

**rankpruning** is a fast, general, robust algorithm for multiclass
learning with noisy labels. It adds minimal overhead, needing only
*O(nm2)* time for n training examples and m classes, works with any
classifier, and is easy to use. Here is the example from above, with
added commments for clarity.

.. code:: python
# LearningWithNoisyLabels implements a faster,
# cross-platform and more-compatible version of the RankPruning
# algorithm for learning with noisy labels. Unlike the original
# algorithm which only worked for binary classification,
# LearningWithNoisyLabels generalizes the theory and algorithms
# of RankPruning for any number of classes.
from cleanlab.classification import LearningWithNoisyLabels
# LearningWithNoisyLabels uses logreg by default, so this is unnecessary.
# We include it here for clarity, but this step is omitted below.
from sklearn.linear_model import LogisticRegression as logreg
# 1.
# Wrap around any classifier. Yup, neural networks work, too.
lnl = LearningWithNoisyLabels(clf=logreg())
# 2.
# X_train is numpy matrix of training examples (integers for large data)
# train_labels_with_errors is a numpy array of labels of length n (# of examples), usually denoted 's'.
lnl.fit(X_train, train_labels_with_errors)
# 3.
# Estimate the predictions you would have gotten by training with *no* label errors.
predicted_test_labels = lnl.predict(X_test)

Estimate the confident joint, the latent noisy channel matrix, *P(s \| y)* and inverse, *P(y \| s)*, the latent prior of the unobserved, actual true labels, *p(y)*, and the predicted probabilities.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Expand Down Expand Up @@ -402,81 +373,53 @@ Option 3: Skip computing the predicted probabilities if you already have them.
psx=psx,
)
Estimate label errors in a dataset:
-----------------------------------

With the ``cleanlab`` package, we can instantly fetch the indices of all
estimated label errors, with nothing provided by the user except a
classifier, examples, and their noisy labels. Like the previous example,
there are various levels of granularity.

.. code:: python
from cleanlab.pruning import get_noise_indices
# We computed psx, est_inv, confident_joint in the previous example.
label_errors = get_noise_indices(
s=train_labels_with_errors, # required
psx=psx, # required
inverse_noise_matrix=est_inv, # not required, include to avoid recomputing
confident_joint=confident_joint, # not required, include to avoid recomputing
)
Estimate the latent joint probability distribution matrix of the noisy and true labels, *P(s,y)*:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Completely characterize label noise in a dataset:
-------------------------------------------------

To compute *P(s,y)*, the complete-information
distribution matrix that captures the number of pairwise label flip
errors when multipled by the total number of examples as *n* P(s,y)*.
Using `cleanlab.latent_estimation.calibrate_confident_joint`,
this method guarantees the rows of *P(s,y)* correctly sum to *p(s)*,
and np.sum(confident_joint) == n (the number of labels).

This method occurs when hyperparameter prune_count_method =
‘inverse_nm_dot_s’ in LearningWithNoisyLabels.fit() and get_noise_indices().
The joint probability distribution of noisy and true labels, *P(s,y)*, completely characterizes label noise with a class-conditional *m x m* matrix.

.. code:: python
from cleanlab.latent_estimation import compute_confident_joint
joint = compute_confident_joint(s=noisy_labels, psx=probabilities)
If you've already computed the confident joint, then you can
estimate the complete joint distribution of label noise by:
from cleanlab.latent_estimation import estimate_joint
joint = compute_confident_joint(
s=noisy_labels,
psx=probabilities,
confident_joint=None, # Provide if you have it already
)
.. code:: python
from cleanlab.latent_estimation import estimate_joint
joint = estimate_joint(confident_joint=cj, s=noisy_labels)
Methods to Standardize Research with Noisy Labels
-------------------------------------------------

Generate valid, class-conditional, unformly random noisy channel matrices:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
``cleanlab`` supports a number of functions to generate noise for benchmarking and standardization in research. This next example shows how to generate valid, class-conditional, unformly random noisy channel matrices:

.. code:: python
# Generate a valid (necessary conditions for learnability are met) noise matrix for any trace > 1
from cleanlab.noise_generation import generate_noise_matrix_from_trace
noise_matrix = generate_noise_matrix_from_trace(
K = number_of_classes,
trace = float_value_greater_than_1_and_leq_K,
py = prior_of_y_actual_labels_which_is_just_an_array_of_length_K,
frac_zero_noise_rates = float_from_0_to_1_controlling_sparsity,
)
# Generate a valid (necessary conditions for learnability are met) noise matrix for any trace > 1
from cleanlab.noise_generation import generate_noise_matrix_from_trace
noise_matrix = generate_noise_matrix_from_trace(
K = number_of_classes,
trace = float_value_greater_than_1_and_leq_K,
py = prior_of_y_actual_labels_which_is_just_an_array_of_length_K,
frac_zero_noise_rates = float_from_0_to_1_controlling_sparsity,
)
# Check if a noise matrix is valid (necessary conditions for learnability are met)
from cleanlab.noise_generation import noise_matrix_is_valid
is_valid = noise_matrix_is_valid(noise_matrix, prior_of_y_which_is_just_an_array_of_length_K)
# Check if a noise matrix is valid (necessary conditions for learnability are met)
from cleanlab.noise_generation import noise_matrix_is_valid
is_valid = noise_matrix_is_valid(noise_matrix, prior_of_y_which_is_just_an_array_of_length_K)
Support for numerous *weak supervision* and *learning with noisy labels* functionalities:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For a given noise matrix, this example shows how to generate noisy labels. Methods can be seeded for reproducibility.

.. code:: python
# Generate noisy labels using the noise_marix. Guarantees exact amount of noise in labels.
from cleanlab.noise_generation import generate_noisy_labels
s_noisy_labels = generate_noisy_labels(y_hidden_actual_labels, noise_matrix)
# This package is a full of other useful methods for learning with noisy labels.
# The tutorial stops here, but you don't have to. Inspect method docstrings for full docs.
# Generate noisy labels using the noise_marix. Guarantees exact amount of noise in labels.
from cleanlab.noise_generation import generate_noisy_labels
s_noisy_labels = generate_noisy_labels(y_hidden_actual_labels, noise_matrix)
# This package is a full of other useful methods for learning with noisy labels.
# The tutorial stops here, but you don't have to. Inspect method docstrings for full docs.
The Polyplex
------------
Expand Down

0 comments on commit f8f131f

Please sign in to comment.