Workshop: Shared Visual Representations in Human and Machine Intelligence (SVRHM)

Teacher-generated pseudo human spatial-attention labels boost contrastive learning models

Yushi Yao · Chang Ye · Junfeng He · Gamaleldin Elsayed


Human spatial attention conveys information about the regions of scenes that are important for performing visual tasks. Prior work has shown that the spatial distribution of human attention can be leveraged to benefit various supervised vision tasks. Might providing this weak form of supervision be useful for self-supervised representation learning? One reason why this question has not been previously addressed is that self-supervised models require large datasets, and no large dataset exists with ground-truth human attentional labels. We therefore construct an auxiliary teacher model to predict human attention, trained on a relatively small labeled dataset. This human-attention model allows us to provide an image (pseudo) attention labels for ImageNet. We then train a model with a primary contrastive objective; to this standard configuration, we add a simple output head trained to predict the attentional map for each image. We measured the quality of learned representations by evaluating classification performance from the frozen learned embeddings. We find that our approach improves accuracy of the contrastive models on ImageNet and its attentional map readout aligns better with human attention compared to vanilla contrastive learning models.

Chat is not available.