Skip to yearly menu bar Skip to main content

Workshop: Symmetry and Geometry in Neural Representations

On the Information Geometry of Vision Transformers

Sonia Joseph · Kumar Krishna Agrawal · Arna Ghosh · Blake Richards

Abstract: Understanding the structure of high-dimensional representations learned by Vision Transformers (ViTs) provides a pathway toward developing a mechanistic understanding and further improving architecture design. In this work, we leverage tools from informationgeometry to characterize representation quality at a per-token (intra-token) level as well as across pairs of tokens (inter-token) in ViTs pretrained for object classification. In particular, we observe that these high-dimensional tokens exhibit a characteristic spectral decay inthe feature covariance matrix. By measuring the rate of this decay (denoted by $\alpha$) for each token across transformer blocks, we discover an $\alpha$ signature, indicative of a transition from lower to higher effective dimensionality. We also demonstrate that tokens can be clustered based on their $\alpha$ signature, revealing that tokens corresponding to nearby spatial patches of the original image exhibit similar $\alpha$ trajectories. Furthermore, for measuring the complexity at the sequence level, we aggregate the correlation between pairs of tokens independently at each transformer block. A higher average correlation indicates a significant overlap between token representations and lower effective complexity. Notably, we observe a U-shaped trend across the model hierarchy, suggesting that token representations are more expressive in the intermediate blocks. Our findings provide a framework for understanding information processing in ViTs while providing tools to prune/merge tokens across blocks, thereby making the architectures more efficient.

Chat is not available.