Emergence of Text Semantics in CLIP Image Encoders
Sreeram Vennam · Shashwat Singh · Anirudh Govil · Ponnurangam Kumaraguru
Abstract
Certain self-supervised approaches to train image encoders, like CLIP, align images with their text captions. However, these approaches do not have an a priori incentive to learn to associate text inside the image with the semantics of the text. Humans process text visually; our work studies the semantics of text rendered in images. We show that the semantic information captured by image representations can decisively classify the sentiment of sentences and is robust against visual attributes like font and not based on simple character frequency associations.
Chat is not available.
Successful Page Load