Workshop: Shared Visual Representations in Human and Machine Intelligence (SVRHM)
Towards Disentangling the Roles of Vision & Language in Aesthetic Experience with Multimodal DNNs
Colin Conwell · Christopher Hamblin
When we experience a visual stimulus as beautiful, how much of that response is the product of ineffable perceptual computations we cannot readily describe versus semantic or conceptual knowledge we can easily translate into natural language? Disentangling perception from language in any experience (especially aesthetics) through behavior or neuroimaging is empirically laborious, and prone to debate over precise definitions of terms. In this work, we attempt to bypass these difficulties by using the learned representations of deep neural network models trained exclusively on vision, exclusively on language, or a hybrid combination of the two, to predict human ratings of beauty for a diverse set of naturalistic images by way of linear decoding. We first show that while the vast majority (~75%) of explainable variance in human beauty ratings can be explained with unimodal vision models (e.g. SEER), multimodal models that learn via language alignment (e.g. CLIP) do show meaningful gains (~10%) over their unimodal counterparts (even when controlling for dataset and architecture). We then show, however, that unimodal language models (e.g. GPT2) whose outputs are conditioned directly on visual representations provide no discernible improvement in prediction, and that machine-generated linguistic descriptions of the stimuli explain a far smaller fraction (~40%) of the explainable variance in ratings compared to vision alone. Taken together, these results showcase a general methodology for disambiguating perceptual and linguistic abstractions in aesthetic judgments using models that computationally separate one from the other.