Workshop: Gaze meets ML
Learning to count visual objects by combining "what" and "where" in recurrent memory
Jessica Thompson · Hannah Sheahan · Christopher Summerfield
Counting the number of objects in a visual scene is easy for humans but challenging for modern deep neural networks. Here we explore what makes this problem hard and study the neural computations that allow transfer of counting ability to new objects and contexts. Previous work has implicated posterior parietal cortex (PPC) in numerosity perception and in visual scene understanding more broadly. It has been proposed that action-related saccadic signals computed in PPC provide object-invariant information about the number and arrangement of scene elements, and may contribute to relational reasoning in visual displays. Here, we built a glimpsing recurrent neural network that combines gaze contents ("what") and gaze location ("where") to count the number of items in a visual array. The network successfully learns to count and generalizes to several out-of-distribution test sets, including images with novel items. Through ablations and comparison to control models, we establish the contribution of brain-inspired computational principles to this generalization ability. This work provides a proof-of-principle demonstration that a neural network that combines "what" and "where" can learn a generalizable concept of numerosity and points to a promising approach for other visual reasoning tasks.