GSoC Blog | Activeloop | Week 3

Published: 07/17/2022


This week, I've been mainly working on making Hub datasets compatible with cleanlab. I implemented three tools to benchmark how cleanlab would work on the same dataset fetched from different sources.

Hub Dataset + Dataloader + Skorch

The first tool allows to run cleanlab with Hub dataset format and allows to directly pass custom Hub Dataloader. As cleanlab features leverage scikit-learn compatibility, I wrap the PyTorch neural net using skorch, which makes it scikit-learn-compatible. However, I had to overwrite a few of the methods such as get_dataset, get_iterator, train_step_single, evaluation_step and validation_step in the generic NeuralNet class to make Hub datasets work with skorch.

Pytorch Dataset + Pytorch Dataloader + Skorch

The second tool fetches the same data from torch.datasets, however, this time I didn't need to overwrite any scorch NeuralNet methods as they support standard PyTorch datasets and Dataloader by default. This step was mainly to ensure that I'm handling the Hub dataset format properly and to compare that the metrics for training match the one with the Hub dataset format.

Computing Out-of-sample Probabilities with Cross Validation

The third tool works with Hub datasets, however, it doesn't use skorch. Instead, this tool computes out-of-sample probabilities using cross validation for Hub dataset format. As skorch doesn't include functionality for cross-validation that's required by cleanlab, this week I focused on implementing cross-validation from scratch.