This line of work in collaboration with Mark Collier pursued models and training algorithms which could better handle label noise, by capturing aleatoric uncertainty via heteroscedastic modeling assumptions.
Massively Scaling Heteroscedastic Classifiers
Mark Collier, Rodolphe Jenatton, Basil Mustafa, Neil Houlsby, Jesse Berent, Effrosyni Kokiopoulou
ICLR 2023 · arxiv
Heteroscedastic classifiers, which learn a multivariate Gaussian distribution over prediction logits, have been shown to perform well on image classification problems with hundreds to thousands of classes. However, compared to standard classifiers, they introduce extra parameters that scale linearly with the number of classes. This makes them infeasible to apply to larger-scale problems. In addition heteroscedastic classifiers introduce a critical temperature hyperparameter which must be tuned. We propose HET-XL, a heteroscedastic classifier whose parameter count when compared to a standard classifier scales independently of the number of classes. In our large-scale settings, we show that we can remove the need to tune the temperature hyperparameter, by directly learning it on the training data. On large image classification datasets with up to 4B images and 30k classes our method requires 14X fewer additional parameters, does not require tuning the temperature on a held-out set and performs consistently better than the baseline heteroscedastic classifier. HET-XL improves ImageNet 0-shot classification in a multimodal contrastive learning setup which can be viewed as a 3.5 billion class classification problem.
Large scale image classification datasets often contain noisy labels. We take a principled probabilistic approach to modelling input-dependent, also known as heteroscedastic, label noise in these datasets. We place a multivariate Normal distributed latent variable on the final hidden layer of a neural network classifier. The covariance matrix of this latent variable, models the aleatoric uncertainty due to label noise. We demonstrate that the learned covariance structure captures known sources of label noise between semantically similar and co-occurring classes. Compared to standard neural network training and other baselines, we show significantly improved accuracy on Imagenet ILSVRC 2012 79.3% (+2.6%), Imagenet-21k 47.0% (+1.1%) and JFT 64.7% (+1.6%). We set a new state-of-the-art result on WebVision 1.0 with 76.6% top-1 accuracy. These datasets range from over 1M to over 300M training examples and from 1k classes to more than 21k classes. Our method is simple to use, and we provide an implementation that is a drop-in replacement for the final fully-connected layer in a deep classifier.
A simple probabilistic method for deep classification under input-dependent label noise
Mark Collier, Basil Mustafa, Efi Kokiopoulou, Rodolphe Jenatton, Jesse Berent
Datasets with noisy labels are a common occurrence in practical applications of classification methods. We propose a simple probabilistic method for training deep classifiers under input-dependent (heteroscedastic) label noise. We assume an underlying heteroscedastic generative process for noisy labels. To make gradient based training feasible we use a temperature parameterized softmax as a smooth approximation to the assumed generative process. We illustrate that the softmax temperature controls a bias-variance trade-off for the approximation. By tuning the softmax temperature, we improve accuracy, log-likelihood and calibration on both image classification benchmarks with controlled label noise as well as Imagenet-21k which has naturally occurring label noise. For image segmentation, our method increases the mean IoU on the PASCAL VOC and Cityscapes datasets by more than 1% over the state-of-the-art model.