Merge branch 'master' of github.com:cgnorthcutt/cleanlab

ravi-mosaicml · Nov 26, 2019 · f8f131f · f8f131f
2 parents 67e6a87 + c8dd81e
commit f8f131f
Show file tree

Hide file tree

Showing 2 changed files with 57 additions and 106 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,8 @@
+Contributions to cleanlab are welcome. When contributing, please keep the following in mind:
+
+1. Avoid adding blank lines or spaces to the ends of lines.
+2. Follow PEP-8 coding style.
+3. Always add tests to check the code you added. Tests go in `cleanlab/tests/`.
+4. Submit a pull request (it will automatically run your code and make sure nothing breaks).
+
+Thanks!
diff --git a/README.rst b/README.rst
@@ -5,7 +5,7 @@
 
 |  
 
-``cleanlab`` is a machine learning python package for **learning with noisy labels** and **finding label errors in datasets**. ``cleanlab`` CLEANs LABels. It is is powered by the theory of **confident learning**, published in  `this paper <https://arxiv.org/abs/1911.00068>`__ and explained in  `this blog <https://l7.curtisnorthcutt.com/confident-learning>`__. Using the `confidentlearning-reproduce <https://github.com/cgnorthcutt/confidentlearning-reproduce>`__ repo, ``cleanlab`` v0.1.0 reproduces results in `the CL paper <https://arxiv.org/abs/1911.00068>`__.
+``cleanlab`` is a machine learning python package for **learning with noisy labels** and **finding label errors in datasets**. ``cleanlab`` CLEANs LABels. It is powered by the theory of **confident learning**, published in  `this paper <https://arxiv.org/abs/1911.00068>`__ and explained in  `this blog <https://l7.curtisnorthcutt.com/confident-learning>`__. Using the `confidentlearning-reproduce <https://github.com/cgnorthcutt/confidentlearning-reproduce>`__ repo, ``cleanlab`` v0.1.0 reproduces results in `the CL paper <https://arxiv.org/abs/1911.00068>`__.
 
 
 |pypi| |py_versions| |build_status| |coverage|
@@ -19,15 +19,19 @@
 .. |coverage| image:: https://codecov.io/gh/cgnorthcutt/cleanlab/branch/master/graph/badge.svg
     :target: https://codecov.io/gh/cgnorthcutt/cleanlab
 
-``cleanlab`` finds and cleans label errors in any dataset using state-of-the-art algorithms for learning with noisy labels by characterizing label noise. ``cleanlab`` is fast: its built on optimized algorithms and parallelized across CPU threads automatically. ``cleanlab`` implements the family of theory and algorithms called `confident learning <https://arxiv.org/abs/1911.00068>`__ with provable guarantees of exact noise estimation and label error finding (even when model output probabilities are noisy/imperfect). 
+``cleanlab`` **documentation is available in** `this blog post <https://l7.curtisnorthcutt.com/cleanlab-python-package>`__.
 
-How does **confident learning** work? Find out here:  `TUTORIAL: confident learning with just numpy and for-loops <https://github.com/cgnorthcutt/cleanlab/blob/master/examples/simplifying_confident_learning_tutorial.ipynb>`__.
 
-``cleanlab`` supports multi-label, multiclass, sparse matrices, and more. 
+So fresh, so ``cleanlab`` 
+=========================
 
+``cleanlab`` finds and cleans label errors in any dataset using `state-of-the-art algorithms <https://arxiv.org/abs/1911.00068>`__ to find label errors, characterize noise, and learn in spite of it. ``cleanlab`` is fast: its built on optimized algorithms and parallelized across CPU threads automatically. ``cleanlab`` is powered by `provable guarantees <https://arxiv.org/abs/1911.00068>`__ of exact noise estimation and label error finding in realistic cases when model output probabilities are erroneous. ``cleanlab`` supports multi-label, multiclass, sparse matrices, etc. By default, ``cleanlab`` requires no hyper-parameters.
 
-Its called ``cleanlab`` because it CLEANs LABels.
-================================================= 
+``cleanlab`` finds and cleans label errors in any dataset using state-of-the-art algorithms for learning with noisy labels by characterizing label noise. ``cleanlab`` is fast: its built on optimized algorithms and parallelized across CPU threads automatically. ``cleanlab`` implements the family of theory and algorithms called `confident learning <https://arxiv.org/abs/1911.00068>`__ with provable guarantees of exact noise estimation and label error finding (even when model output probabilities are noisy/imperfect). 
+
+**How does confident learning work?** See:  `TUTORIAL: confident learning with just numpy and for-loops <https://github.com/cgnorthcutt/cleanlab/blob/master/examples/simplifying_confident_learning_tutorial.ipynb>`__.
+
+``cleanlab`` supports multi-label, multiclass, sparse matrices, and more. 
 
 ``cleanlab`` is:
 
@@ -37,7 +41,7 @@ Its called ``cleanlab`` because it CLEANs LABels.
 4. unique - The only package for multiclass learning with noisy labels or finding label errors for any dataset / classifier.
 
 
-Find label errors with PyTorch, Tensorflow, MXNet, etc. in 1 line of code!
+Find label errors with PyTorch, Tensorflow, MXNet, etc. in 1 line of code.
 ==========================================================================
 
 .. code:: python
@@ -54,6 +58,7 @@ Find label errors with PyTorch, Tensorflow, MXNet, etc. in 1 line of code!
        sorted_index_method='normalized_margin', # Orders label errors
     )
 
+Pre-computed out-of-sample predicted probabilities for CIFAR-10 train set are available here: [`LINK <https://github.com/cgnorthcutt/confidentlearning-reproduce/blob/master/README.md#need-out-of-sample-predicted-probabilities-for-cifar-10-train-set>`__].
 
 Learning with noisy labels in 3 lines of code!
 ==============================================
@@ -154,7 +159,7 @@ See `cleanlab/examples <https://github.com/cgnorthcutt/cleanlab/tree/master/exam
 ``cleanlab``: Find Label Errors in ImageNet
 -------------------------------------------
 
-We use ``cleanlab`` to automatically identify ~100,000 label errors in the 2012 ImageNet training dataset. 
+Use ``cleanlab`` to identify ~100,000 label errors in the 2012 ImageNet training dataset. 
 
 .. figure:: https://raw.githubusercontent.com/cgnorthcutt/cleanlab/master/img/imagenet_train_label_errors_32.jpg
    :align: center
@@ -165,25 +170,27 @@ Top label issues in the 2012 ILSVRC ImageNet train set identified using ``cleanl
 ``cleanlab``: Find Label Errors in MNIST
 ----------------------------------------
 
-We use ``cleanlab`` to automatically identify ~50 label errors in the MNIST dataset. 
+Use ``cleanlab`` to identify ~50 label errors in the MNIST dataset. 
 
 .. figure:: https://raw.githubusercontent.com/cgnorthcutt/cleanlab/master/img/mnist_training_label_errors24_prune_by_noise_rate.png
    :align: center
    :alt: Image depicting label errors in MNIST train set 
 
-Label errors of the original MNIST **train** dataset identified algorithmically using the rankpruning algorithm. Depicts the 24 least confident labels, ordered left-right, top-down by increasing self-confidence (probability of belonging to the given label), denoted conf in teal. The label with the largest predicted probability is in green. Overt errors are in red.
+Label errors of the original MNIST **train** dataset identified algorithmically using cleanlab. Depicts the 24 least confident labels, ordered left-right, top-down by increasing self-confidence (probability of belonging to the given label), denoted conf in teal. The label with the largest predicted probability is in green. Overt errors are in red.
 
 
 ``cleanlab`` Generality: View performance across 4 distributions and 9 classifiers.
 -----------------------------------------------------------------------------------
 
-We use ``cleanlab`` to automatically learn with noisy labels regardless of dataset distribution or classifier. 
+Use ``cleanlab`` to learn with noisy labels regardless of dataset distribution or classifier. 
 
 .. figure:: https://raw.githubusercontent.com/cgnorthcutt/cleanlab/master/img/demo_cleanlab_across_datasets_and_classifiers.png
    :align: center
    :alt: Image depicting generality of cleanlab across datasets and classifiers 
 
-Each figure depicts the decision boundary learned using ``cleanlab.classification.LearningWithNoisyLabels`` in the presence of extreme (~35%) label errors. Label errors are circled in green. Label noise is class-conditional (not simply uniformly random). Columns are organized by the classifier used, except the left-most column which depicts the ground-truth dataset distribution. Rows are organized by dataset used. A matrix characterizing the label noise for the first row is shown below. 
+Each sub-figure in the figure above depicts the decision boundary learned using ``cleanlab.classification.LearningWithNoisyLabels`` in the presence of extreme (\~35%) label errors. Label errors are circled in green. Label noise is class-conditional (not simply uniformly random). Columns are organized by the classifier used, except the left-most column which depicts the ground-truth dataset distribution. Rows are organized by dataset used.
+
+The code to reproduce this figure is available `here <https://github.com/cgnorthcutt/cleanlab/blob/master/examples/classifier_comparison.ipynb>`__.
 
 Each figure depicts accuracy scores on a test set as decimal values: 
 
@@ -202,8 +209,6 @@ s=2     0.12  0.04  0.64  0.38
 s=3     0.11  0.08  0.05  0.54
 ======  ====  ====  ====  ====
 
-The code to reproduce this figure is available `here <https://github.com/cgnorthcutt/cleanlab/blob/master/examples/classifier_comparison.ipynb>`__.
-
 
 Get started with easy, quick examples.
 ======================================
@@ -299,40 +304,6 @@ Documentation by Example
 Many of these methods have default parameters that won’t be covered
 here. Check out the method docstrings for full documentation.
 
-Multiclass learning with noisy labels (in **3** lines of code):
----------------------------------------------------------------
-
-**rankpruning** is a fast, general, robust algorithm for multiclass
-learning with noisy labels. It adds minimal overhead, needing only
-*O(nm2)* time for n training examples and m classes, works with any
-classifier, and is easy to use. Here is the example from above, with
-added commments for clarity.
-
-.. code:: python
-   
-   # LearningWithNoisyLabels implements a faster,
-   # cross-platform and more-compatible version of the RankPruning
-   # algorithm for learning with noisy labels. Unlike the original
-   # algorithm which only worked for binary classification,
-   # LearningWithNoisyLabels generalizes the theory and algorithms
-   # of RankPruning for any number of classes.
-   from cleanlab.classification import LearningWithNoisyLabels
-   # LearningWithNoisyLabels uses logreg by default, so this is unnecessary. 
-   # We include it here for clarity, but this step is omitted below.
-   from sklearn.linear_model import LogisticRegression as logreg
-
-   # 1.
-   # Wrap around any classifier. Yup, neural networks work, too.
-   lnl = LearningWithNoisyLabels(clf=logreg()) 
-
-   # 2.
-   # X_train is numpy matrix of training examples (integers for large data)
-   # train_labels_with_errors is a numpy array of labels of length n (# of examples), usually denoted 's'.
-   lnl.fit(X_train, train_labels_with_errors) 
-
-   # 3.
-   # Estimate the predictions you would have gotten by training with *no* label errors.
-   predicted_test_labels = lnl.predict(X_test)
 
 Estimate the confident joint, the latent noisy channel matrix, *P(s \| y)* and inverse, *P(y \| s)*, the latent prior of the unobserved, actual true labels, *p(y)*, and the predicted probabilities.
 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
@@ -402,81 +373,53 @@ Option 3: Skip computing the predicted probabilities if you already have them.
        psx=psx,
    )
 
-Estimate label errors in a dataset:
------------------------------------
-
-With the ``cleanlab`` package, we can instantly fetch the indices of all
-estimated label errors, with nothing provided by the user except a
-classifier, examples, and their noisy labels. Like the previous example,
-there are various levels of granularity.
-
-.. code:: python
-
-   from cleanlab.pruning import get_noise_indices
-   # We computed psx, est_inv, confident_joint in the previous example.
-   label_errors = get_noise_indices(
-       s=train_labels_with_errors, # required
-       psx=psx, # required
-       inverse_noise_matrix=est_inv, # not required, include to avoid recomputing
-       confident_joint=confident_joint, # not required, include to avoid recomputing
-   )
 
-Estimate the latent joint probability distribution matrix of the noisy and true labels, *P(s,y)*:
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Completely characterize label noise in a dataset:
+-------------------------------------------------
 
-To compute *P(s,y)*, the complete-information
-distribution matrix that captures the number of pairwise label flip
-errors when multipled by the total number of examples as *n* P(s,y)*.
-Using `cleanlab.latent_estimation.calibrate_confident_joint`, 
-this method guarantees the rows of *P(s,y)* correctly sum to *p(s)*, 
-and np.sum(confident_joint) == n (the number of labels).
-
-This method occurs when hyperparameter prune_count_method =
-‘inverse_nm_dot_s’ in LearningWithNoisyLabels.fit() and get_noise_indices().
+The joint probability distribution of noisy and true labels, *P(s,y)*, completely characterizes label noise with a class-conditional *m x m* matrix. 
 
 .. code:: python
 
-   from cleanlab.latent_estimation import compute_confident_joint
-   joint = compute_confident_joint(s=noisy_labels, psx=probabilities)
-
-If you've already computed the confident joint, then you can
-estimate the complete joint distribution of label noise by:
+    from cleanlab.latent_estimation import estimate_joint
+    joint = compute_confident_joint(
+        s=noisy_labels,
+        psx=probabilities,
+        confident_joint=None,  # Provide if you have it already
+    )
 
-.. code:: python
 
-   from cleanlab.latent_estimation import estimate_joint
-   joint = estimate_joint(confident_joint=cj, s=noisy_labels)
+Methods to Standardize Research with Noisy Labels
+-------------------------------------------------
 
-Generate valid, class-conditional, unformly random noisy channel matrices:
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+``cleanlab`` supports a number of functions to generate noise for benchmarking and standardization in research. This next example shows how to generate valid, class-conditional, unformly random noisy channel matrices:
 
 .. code:: python
 
-   # Generate a valid (necessary conditions for learnability are met) noise matrix for any trace > 1
-   from cleanlab.noise_generation import generate_noise_matrix_from_trace
-   noise_matrix = generate_noise_matrix_from_trace(
-       K = number_of_classes, 
-       trace = float_value_greater_than_1_and_leq_K,
-       py = prior_of_y_actual_labels_which_is_just_an_array_of_length_K,
-       frac_zero_noise_rates = float_from_0_to_1_controlling_sparsity,
-   )
+    # Generate a valid (necessary conditions for learnability are met) noise matrix for any trace > 1
+    from cleanlab.noise_generation import generate_noise_matrix_from_trace
+    noise_matrix = generate_noise_matrix_from_trace(
+        K = number_of_classes, 
+        trace = float_value_greater_than_1_and_leq_K,
+        py = prior_of_y_actual_labels_which_is_just_an_array_of_length_K,
+        frac_zero_noise_rates = float_from_0_to_1_controlling_sparsity,
+    )
 
-   # Check if a noise matrix is valid (necessary conditions for learnability are met)
-   from cleanlab.noise_generation import noise_matrix_is_valid
-   is_valid = noise_matrix_is_valid(noise_matrix, prior_of_y_which_is_just_an_array_of_length_K)
+    # Check if a noise matrix is valid (necessary conditions for learnability are met)
+    from cleanlab.noise_generation import noise_matrix_is_valid
+    is_valid = noise_matrix_is_valid(noise_matrix, prior_of_y_which_is_just_an_array_of_length_K)
 
-Support for numerous *weak supervision* and *learning with noisy labels* functionalities:
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+For a given noise matrix, this example shows how to generate noisy labels. Methods can be seeded for reproducibility.
 
 .. code:: python
 
-   # Generate noisy labels using the noise_marix. Guarantees exact amount of noise in labels.
-   from cleanlab.noise_generation import generate_noisy_labels
-   s_noisy_labels = generate_noisy_labels(y_hidden_actual_labels, noise_matrix)
-
-   # This package is a full of other useful methods for learning with noisy labels.
-   # The tutorial stops here, but you don't have to. Inspect method docstrings for full docs.
+    # Generate noisy labels using the noise_marix. Guarantees exact amount of noise in labels.
+    from cleanlab.noise_generation import generate_noisy_labels
+    s_noisy_labels = generate_noisy_labels(y_hidden_actual_labels, noise_matrix)
    
+    # This package is a full of other useful methods for learning with noisy labels.
+    # The tutorial stops here, but you don't have to. Inspect method docstrings for full docs.
+
 
 The Polyplex
 ------------