This week, I've been mainly working on making Hub datasets compatible with cleanlab. I implemented three tools to benchmark how cleanlab would work on the same dataset fetched from different sources.
Hub Dataset + Dataloader + Skorch
The first tool allows to run cleanlab with Hub dataset format and allows to directly pass custom Hub Dataloader. As cleanlab features leverage scikit-learn compatibility, I wrap the PyTorch neural net using skorch, which makes it scikit-learn-compatible. However, I had to overwrite a few of the methods such as get_dataset,
validation_step in the generic
NeuralNet class to make Hub datasets work with skorch.
Pytorch Dataset + Pytorch Dataloader + Skorch
The second tool fetches the same data from torch.datasets, however, this time I didn't need to overwrite any scorch
NeuralNet methods as they support standard PyTorch datasets and Dataloader by default. This step was mainly to ensure that I'm handling the Hub dataset format properly and to compare that the metrics for training match the one with the Hub dataset format.
Computing Out-of-sample Probabilities with Cross Validation
The third tool works with Hub datasets, however, it doesn't use skorch. Instead, this tool computes out-of-sample probabilities using cross validation for Hub dataset format. As skorch doesn't include functionality for cross-validation that's required by cleanlab, this week I focused on implementing cross-validation from scratch.