Cleanlab — Repo Review

5 min readJan 14, 2023

Working in real-life data is always challenging. Checking data quality, clean the data, understand unimaginable patterns are key concept for data science. With using right tools, it gets easier to getting insight about data. So, story of Cleanlab starts here . . .

Cleanlab

CleanLab is a python library for machine learning with noisy labels, it provides algorithms and utilities for training models with noisy or unreliable labels. It was developed by researchers at the “University of California, Berkeley”, and is available on Github[1]. The library can be used to train models with a variety of different types of noise and provides methods for estimating the noise in a dataset, as well as for cleaning the labels to reduce the impact of the noise on model performance. The library is specifically designed to work with image and text classification tasks, and is compatible with popular deep learning libraries such as PyTorch and TensorFlow.

Behind The Scene

The Cleanlab library uses a method called “confident learning” to train models with noisy labels. Confident learning is a general framework for learning with noisy labels that is based on the idea of identifying and correcting label errors in a dataset.

Top 32 (ordered automatically by normalized margin) identified label issues in the 2012 ILSVRC ImageNet train set using CL: PBNR. Errors are boxed in red. Ontological issues are boxed in green. Multi-label images are boxed in blue[1].

The basic idea behind confident learning is to estimate the noise rate in the dataset and use this estimate to weight the training examples in order to reduce the impact of the noise on the model’s performance. The algorithm estimates the noise rate by comparing the predictions of a model trained on the noisy data to the true labels, and then uses this estimate to weight the training examples in order to reduce the impact of the noise on the model’s performance.

The Cleanlab library also includes a number of other algorithms, such as “prune method” and “joint optimization” that can be used to further improve the model’s performance. Prune method identify and remove the most likely label errors from a dataset, while joint optimization is a more advanced method that uses a combination of confident learning and pruning to achieve even better results.

The library also make use of other techniques such as “variational inference” and “expectation maximization” to estimate the noise rate in the dataset. All of these concepts are issues of another day.

Code Example

import numpy as np
from sklearn.linear_model import LogisticRegression
from cleanlab.classification import LearningWithNoisyLabels

# Generate some synthetic data
np.random.seed(0)
X = np.random.randn(100, 10)
y = np.random.randint(0, 2, size=(100,))

# Add some noise to the labels
y_noisy = y.copy()
y_noisy[np.random.rand(len(y)) < 0.1] = 1 - y_noisy[np.random.rand(len(y)) < 0.1]

# Create a noisy labels classifier
clf = LearningWithNoisyLabels(clf=LogisticRegression(), seed=0)

# Fit the classifier to the noisy data
clf.fit(X, y_noisy)

# Predict the labels
y_pred = clf.predict(X)

# Print the accuracy of the model
print("Accuracy:", (y_pred == y).mean())

LearningWithNoisyLabels class from the Cleanlab library to train a logistic regression classifier on synthetic data that has had some noise added to the labels. The example also shows how to compute the accuracy of the model.

This is a simple example, but keep in mind that the library provides many more advanced functionalities such as estimating the noise rate in the dataset and make use of it to improve model accuracy.

Pros

Confident learning: The library uses a method called “confident learning” to train models with noisy labels, which is a general framework for learning with noisy labels that is based on the idea of identifying and correcting label errors in a dataset.
Prune method and joint optimization: The library includes a number of other algorithms such as “prune method” and “joint optimization” that can be used to further improve the model’s performance.
Variational inference and expectation maximization: The library also make use of other techniques such as “variational inference” and “expectation maximization” to estimate the noise rate in the dataset, which can be useful for datasets with high noise rate.
Compatibility: The library is compatible with popular deep learning libraries such as PyTorch and TensorFlow and also It can be used for image and text classification tasks.
Open-Source: The library is open-source and available on Github, which makes it easy to access and use.

Cons

Computational expense: The library can be computationally expensive, especially when working with large datasets or when using the more advanced algorithms such as joint optimization.
Noise rate estimation: The library relies on the noise rate estimation, which might not be accurate for all datasets and tasks, and this can affect the performance of the model.
Limited to classification tasks: The library is mainly designed for classification tasks, so it may not be suitable for other types of machine learning problems.

Alternatives

There are several alternative libraries and methods for training models with noisy labels:

MentorNet: This is a library that uses a curriculum learning approach to train models with noisy labels. It uses a teacher-student framework where a model trained on the noisy data is used to guide the training of a student model on the cleaned data.
Decoupling: This is a method that uses a two-stage training process to train models with noisy labels. The first stage trains a model on the noisy data, and the second stage fine-tunes the model on a cleaned version of the data.
Co-teaching: This is a method that uses two models to train on the noisy data and the models are used to identify and correct label errors in the dataset.
Bootstrapping: This is a method that uses a technique called “self-training” to train models with noisy labels. It involves training a model on a small subset of the data, and then using the model’s predictions to select additional examples to add to the training set.
Loss correction: This is a method that uses a combination of techniques such as sample re-weighting, label smoothing and knowledge distillation to train models with noisy labels.

These are just a few examples of alternative libraries and methods for training models with noisy labels. Each method has its own strengths and weaknesses, and the best approach will depend on dataset you are working with.

Conclusion

Beside the open source part, CleanLab supported with a product called Cleanlab Studio. This makes the repo more reliable and solid. Additionally, its iseasy to use and integrated with Tensorflow and Pytorch. When we see from this perspective, ıt can be clearly say that repo should stay in the datascience toolbox.

References

1 — https://github.com/cleanlab/cleanlab

2 — https://arxiv.org/abs/1911.00068