One of the key challenges for AI is to understand, predict, and model data over time. Pretrained networks should be able to temporally generalize, or adapt to shifts in data distributions that occur over time. Our current state-of-the-art (SOTA) still struggles to model and understand data over long temporal durations – for example, SOTA models are limited to processing several seconds of video, and powerful transformer models are still fundamentally limited by their attention spans. On the other hand, humans and other biological systems are able to flexibly store and update information in memory to comprehend and manipulate multimodal streams of input. Cognitive neuroscientists propose that they do so via the interaction of multiple memory systems with different neural mechanisms. What types of memory systems and mechanisms already exist in our current AI models? First, there are extensions of the classic proposal that memories are formed via synaptic plasticity mechanisms – information can be stored in the static weights of a pre-trained network, or in fast weights that more closely resemble short-term plasticity mechanisms. Then there are persistent memory states, such as those in LSTMs or in external differentiable memory banks, which store information as neural activations that can change over time. Finally, there are models augmented with static databases of knowledge, akin to a high-precision long-term memory or semantic memory in humans. When is it useful to store information in each one of these mechanisms, and how should models retrieve from them or modify the information therein? How should we design models that may combine multiple memory mechanisms to address a problem? Furthermore, do the shortcomings of current models require some novel memory systems that retain information over different timescales, or with different capacity or precision? Finally, what can we learn from memory processes in biological systems that may advance our models in AI? We aim to explore how a deeper understanding of memory mechanisms can improve task performance in many different application domains, such as lifelong / continual learning, reinforcement learning, computer vision, and natural language processing.
Fri 6:30 a.m. - 6:45 a.m.
|
Opening remarks
(
Talk
)
SlidesLive Video » |
Vy Vo 🔗 |
Fri 6:45 a.m. - 7:30 a.m.
|
Sepp Hochreiter: "Modern Hopfield Networks"
(
Keynote
)
SlidesLive Video » |
Sepp Hochreiter 🔗 |
Fri 7:30 a.m. - 8:15 a.m.
|
Sainbayar Sukhbaatar: "Brain-inspired memory models"
(
Keynote
)
SlidesLive Video » |
Sainbayar Sukhbaatar 🔗 |
Fri 8:15 a.m. - 8:20 a.m.
|
The Emergence of Abstract and Episodic Neurons in Episodic Meta-RL
(
Spotlight
)
link »
SlidesLive Video » In this work, we analyze the reinstatement mechanism introduced by Ritter et al. (2018) to reveal two classes of neurons that emerge in the agent's working memory (an epLSTM cell) when trained using episodic meta-RL on an episodic variant of the Harlow visual fixation task. Specifically, Abstract neurons encode knowledge shared across tasks, while Episodic neurons carry information relevant for a specific episode's task. |
Badr AlKhamissi · Muhammad ElNokrashy · Michael Spranger 🔗 |
Fri 8:20 a.m. - 8:25 a.m.
|
Learning to Reason and Memorize with Self-Questioning
(
Spotlight
)
link »
SlidesLive Video » Large language models have been shown to struggle with limited context memory and multi-step reasoning [1]. We propose a simple method for solving both of these problems by allowing the model to ask questions and answer them. Unlike recent scratchpad approaches, the model can deviate from the input context at any time for self-questioning. This allows the model to recall information and perform reasoning on the fly as it reads the context, thus extending its memory and enabling multi-step reasoning. Our experiments on two synthetic tasks demonstrate that our method can successfully generalize to more complicated instances from their training setup by performing self-questioning at inference time. |
Jack Lanchantin · Shubham Toshniwal · Jason E Weston · arthur szlam · Sainbayar Sukhbaatar 🔗 |
Fri 8:25 a.m. - 8:30 a.m.
|
Recall-gated plasticity as a principle of systems memory consolidation
(
Spotlight
)
link »
SlidesLive Video » In many species, behaviors, and neural circuits, learning and memory formation involves plasticity in two distinct neural pathways, and a process of consolidation between them. Here, we propose a model that captures common computational principles underlying these phenomena. The key component of our model is recall-gated consolidation, in which a long-term pathway prioritizes the storage of memory traces that are familiar to the short-term pathway. This mechanism shields long-term memory from spurious synaptic changes, enabling it to focus on reliable signal in the environment. We show that this model has significant advantages, substantially amplifying the signal-to-noise ratio with which intermittently reinforced memories are stored. In fact, we demonstrate mathematically that these advantages surpass what is achievable by synapse-local mechanisms alone, providing a normative motivation for systems (as opposed to synaptic) consolidation. We describe neural circuit implementations of our abstract model for different types of learning problems. These implementations involve learning rate modulation by factors such as prediction accuracy, confidence, or familiarity. Our model gives rise to a number of phenomena that are present in biological learning, such as spacing effects, task-dependent rates of consolidation, and different representations in the short and long-term pathways. |
Jack Lindsey · Ashok Litwin-Kumar 🔗 |
Fri 8:30 a.m. - 8:35 a.m.
|
The Opportunistic PFC: Downstream Modulation of a Hippocampus-inspired Network is Optimal for Contextual Memory Recall
(
Spotlight
)
link »
SlidesLive Video » Episodic memory serves as a store of individual experiences and allows for flexible adaptation to environment volatility and goal changes. The selection of episodic memories to recall is often considered to be driven by external sensory cues. Experimental studies suggest that this process is also influenced by internal cues, and that projections from the medial prefrontal cortex to the hippocampus play a role in this contextual modulation. In order to make sense of the biological configuration of prefrontal-to-hippocampus connectivity, we investigate the effectiveness of modulating various layers of a hippocampus-inspired neural network in a contextual memory task. Our results reveal that providing context information to the most downstream regions (i.e. last layers) of the model leads to better performance. In addition, the best average performance is obtained when contextual connections target the regions corresponding to the biological subfields that receive information from the prefrontal cortex, which provides a normative account of the biological connectivity. We relate this work to the need for augmenting reinforcement learning with flexible episodic memory. |
Hugo Chateau-Laurent · Frederic Alexandre 🔗 |
Fri 8:35 a.m. - 10:55 a.m.
|
Poster session + Lunch
(
In-person poster session
)
|
🔗 |
Fri 11:00 a.m. - 11:45 a.m.
|
Hava Siegelmann: "Lifelong learning and supporting memory"
(
Keynote
)
SlidesLive Video » |
Hava Siegelmann 🔗 |
Fri 11:45 a.m. - 12:30 p.m.
|
Ida Mommenejad: "Neuro-inspired Memory in Reinforcement Learning: State of the art, Challenges, and Opportunities"
(
Keynote
)
SlidesLive Video » |
Ida Momennejad 🔗 |
Fri 12:30 p.m. - 12:45 p.m.
|
Afternoon break
|
🔗 |
Fri 12:45 p.m. - 1:30 p.m.
|
Janice Chen: "Memory for narratives"
(
Keynote
)
SlidesLive Video » |
Janice Chen 🔗 |
Fri 1:30 p.m. - 2:55 p.m.
|
Panel Discussion: Opportunities and Challenges
(
Discussion panel (in-person)
)
SlidesLive Video » A discussion panel moderated by Prof. Ken Norman (Princeton) Panelists: Janice Chen (Johns Hopkins), Sam Gershman (Harvard), Albert Gu (Stanford / Carnegie Mellon Univ), Sepp Hochreiter (Johannes Kepler Univ), Ida Mommenejad (Microsoft Research), Hava Siegelmann (UMass Amherst), Sainbayar Sukhbaatar (Meta AI) |
Kenneth Norman · Janice Chen · Samuel J Gershman · Albert Gu · Sepp Hochreiter · Ida Momennejad · Hava Siegelmann · Sainbayar Sukhbaatar 🔗 |
Fri 2:55 p.m. - 3:00 p.m.
|
Closing remarks
(
Talk
)
SlidesLive Video » |
Mariya Toneva 🔗 |
-
|
Learning to Control Rapidly Changing Synaptic Connections: An Alternative Type of Memory in Sequence Processing Artificial Neural Networks
(
Poster
)
link »
Short-term memory in standard, general-purpose, sequence-processing recurrent neural networks (RNNs) is stored as activations of nodes or ''neurons.'' Generalizing feedforward NNs (FNNs) to such RNNs is mathematically straightforward and natural, and even historical: already in 1943, McCulloch and Pitts proposed this as a surrogate to ''synaptic modifications,'' generalizing the Lenz-Ising model, the first RNN architecture of 1925. A lesser known alternative approach to storing short-term memory in ''synaptic connections''---by parameterising and controlling the dynamics of a context-sensitive time-varying weight matrix through another NN---yields another ''natural'' type of short-term memory in sequence processing NNs: the Fast Weight Programmers (FWPs) of the early 1990s. FWPs have seen a recent revival as generic sequence processors, achieving competitive performance across various tasks. They are formally closely related to the now popular Transformers. Here we present them in the context of artificial NNs as an abstraction of biological NNs---a perspective that has not been stressed enough in previous FWP work. We first review aspects of FWPs for pedagogical purposes, then discuss connections to related works motivated by insights from neuroscience. |
Kazuki Irie · Jürgen Schmidhuber 🔗 |
-
|
Constructing compressed number lines of latent variables using a cognitive model of memory and deep neural networks
(
Poster
)
link »
Humans use log-compressed number lines to represent different quantities, including elapsed time, traveled distance, numerosity, sound frequency, etc. Inspired by recent cognitive science and computational neuroscience work, we developed a neural network that learns to construct log-compressed number lines from a cognitive model of working memory. The network computes a discrete approximation of a real-domain Laplace transform using an RNN with analytically derived weights giving rise to a log-compressed timeline of the past. The network learns to extract latent variables from the input and uses them for global modulation of the recurrent weights turning a timeline into a number line over relevant dimensions. The number line representation greatly simplifies learning on a set of problems that require learning associations in different spaces - problems that humans can typically solve easily. This approach illustrates how combining deep learning with cognitive models can result in systems that learn to represent latent variables in a brain-like manner and exhibit human-like behavior manifested through Weber-Fechner law. |
Sahaj Singh Maini · James Mochizuki-Freeman · Chirag Shankar Indi · Brandon Jacques · Per B Sederberg · Marc Howard · Zoran Tiganj 🔗 |
-
|
Experiences from the MediaEval Predicting Media Memorability Task
(
Poster
)
link »
The Predicting Media Memorability task in the MediaEval evaluation campaign has been running annually since 2018 and several different tasks and data sets have been used in this time. This has allowed us to compare the performance of many memorability prediction techniques on the same data and in a reproducible way and to refine and improve on those techniques. The resources created to compute media memorability are now being used by researchers well beyond the actual evaluation campaign. In this paper we present a summary of the evaluation campaign including the collective lessons we have learned for the research community. |
Alba Garcia Seco de Herrera · Mihai Gabriel Constantin · Claire-Helene Demarty · Camilo Fosco · Sebastian Halder · Graham Healy · Bogdan Ionescu · Ana Matran-Fernandez · Alan F Smeaton · Mushfika Sultana 🔗 |
-
|
Meta-Learning General-Purpose Learning Algorithms with Transformers
(
Poster
)
link »
Modern machine learning requires system designers to specify aspects of the learning pipeline, such as losses, architectures, and optimizers. Meta-learning, or learning-to-learn, instead aims to learn those aspects, and promises to unlock greater capabilities with less manual effort. One particularly ambitious goal of meta-learning is to train general purpose learning algorithms from scratch, using only black box models with minimal inductive bias. A general purpose learning algorithm is one which takes in training data, and produces test-set predictions across a wide range of problems, without any explicit definition of an inference model, training loss, or optimization algorithm. In this paper we show that Transformers and other black-box models can be meta-trained to act as general purpose learning algorithms, and can generalize to learn on different datasets than used during meta-training. We characterize phase transitions between algorithms that generalize, algorithms that memorize, and algorithms that fail to meta-train at all, induced by changes in model size, number of tasks used during meta-training, and meta-optimization hyper-parameters. We further show that the capabilities of meta-trained algorithms are bottlenecked by the accessible state size (memory) determining the next prediction, unlike standard models which are thought to be bottlenecked by parameter count. |
Louis Kirsch · Luke Metz · James Harrison · Jascha Sohl-Dickstein 🔗 |
-
|
Evidence accumulation in deep RL agents powered by a cognitive model
(
Poster
)
link »
Evidence accumulation is thought to be fundamental for decision-making in humans and other mammals. Neuroscience studies suggest that the hippocampus encodes a low-dimensional ordered representation of evidence through sequential neural activity. Cognitive modelers have proposed a mechanism by which such sequential activity could emerge through the modulation of recurrent weights with a change in the amount of evidence. Here we integrated a cognitive science model inside a deep Reinforcement Learning (RL) agent and trained the agent to perform a simple evidence accumulation task inspired by the behavioral experiments on animals. We compared the agent's performance with the performance of agents equipped with GRUs and RNNs. We found that the agent based on a cognitive model was able to learn much faster and generalize better while having significantly fewer parameters. This study illustrates how integrating cognitive models and deep learning systems can lead to brain-like neural representations that can improve learning. |
James Mochizuki-Freeman · Sahaj Singh Maini · Zoran Tiganj 🔗 |
-
|
Using Hippocampal Replay to Consolidate Experiences in Memory-Augmented Reinforcement Learning
(
Poster
)
link »
Reinforcement Learning (RL) agents traditionally face difficulties to learn in sparse reward settings. Go-Explore is a state-of-the-art algorithm that learns well in spite of sparse reward, largely due to storing experiences in external memory and updating this memory with better trajectories. We improve upon this method and introduce a more efficient count-based approach for both state selection ( |
Chong Min John Tan · Mehul Motani 🔗 |
-
|
Training language models for deeper understanding improves brain alignment
(
Poster
)
link »
Building systems that understand information across long contexts is one important goal in natural language processing (NLP). One approach in recent works is to scale up model architectures to accept very long inputs and then train them on datasets to learn to extract critical information from long input texts. However, it is still an open question whether these models are simply learning a heuristic to solve the tasks, or really learning to understand information across long contexts. This work investigates this further by turning to the one system with truly long-range and deep language understanding: the human brain. We show that training language models for long-range narrative understanding results in richer representations that have improved alignment to human brain activity. This suggests they have indeed improved understanding across long contexts. However, although these models can take in thousands of input words, their brain alignment peaks after only 500 words. This suggests possible limitations with either model training or architecture. Overall, our findings have consequences both for cognitive neuroscience by revealing some of the significant factors behind brain-NLP alignment, and for NLP by highlighting limitations with existing approaches for longer-range understanding. |
Khai Loong Aw · Mariya Toneva 🔗 |
-
|
Transformer needs NMDA receptor nonlinearity for long-term memory
(
Poster
)
link »
The NMDA receptor (NMDAR) in the hippocampus is essential for learning and memory. We find an interesting resemblance between deep models' nonlinear activation function and the NMDAR's nonlinear dynamics. In light of a recent study that compared the transformer architecture to the formation of hippocampal memory, this paper presents new findings that NMDAR-like nonlinearity may be essential for consolidating short-term working memory into long-term reference memory. We design a navigation task assessing these two memory functions and show that manipulating the activation function (i.e., mimicking the Mg$^{2+}$-gating of NMDAR) disrupts long-term memory formation. Our experimental data suggest that the concept of place cells and reference memory may reside in the feed-forward network and that nonlinearity plays a key role in these processes. Our findings propose that the transformer architecture and hippocampal spatial representation resemble by sharing the overlapping concept of NMDAR nonlinearity.
|
Dong-Kyum Kim · Jea Kwon · Meeyoung Cha · C. Lee 🔗 |
-
|
Neural networks learn an environment's geometry in latent space by performing predictive coding on visual scenes
(
Poster
)
link »
Humans navigate complex environments using only visual cues and self-motion. Mapping an environment is an essential task for navigation within a physical space; neuroscientists and cognitive scientists also postulate that mapping algorithms underlie cognition by mapping concepts, memories, and other nonspatial variables. Despite the broad importance of mapping algorithms in neuroscience, it is not clear how neural networks can build spatial maps exclusively from sensor observations without access to the environment’s coordinates through reinforcement learning or supervised learning. Path integration, for example, implicitly needs the environment’s coordinates to predict how past velocities translate into the current position. Here we show that predicting sensory observations—called predictive coding—extends path integration from implicitly requiring the environment’s coordinates. Specifically, a neural network constructs an environmental map in its latent space by predicting visual input. As the network traverses complex environments in Minecraft, spatial proximity between object positions affects distances in the network's latent space. The relationship depends on the uniqueness of the environment’s visual scene as measured by the mutual information between the images and spatial position. Predictive coding extends to any sequential dataset. Observations from paths traversing a manifold can generate such sequential data. We anticipate neural networks that perform predictive coding identify the underlying manifold without requiring the manifold’s coordinates. |
James Gornet · Matt Thomson 🔗 |
-
|
Low Resource Retrieval Augmented Adaptive Neural Machine Translation
(
Poster
)
link »
We propose KNN-Kmeans MT, a sample efficient algorithm that improves retrieval based augmentation performance in low resource settings by adding an additional K-means filtering layer after the KNN step. KNN-Kmeans MT like its predecessor retrieval augmented machine translation approaches doesn't require any additional training and outperforms the existing methods in low resource settings. The additional K-means step makes the model more robust to noise. We benchmark our proposed approach on EMEA and JTRC-Acquis dataset and see 0.2 points improvement in BLEU score on an average in low resource settings. More importantly, the trend of improvement from high to low resource setting is consistently obvious across both the datasets. We conjecture that the observed improvement is a consequence of eliminating bad neighbors as their retrieval databases are small and retrieving a fixed number of neighbors leads to adding noise to the model. The simplicity of the approach makes it a promising direction in opening up the use of retrieval augmentation in low resource setting. |
Vivek Harsha Lakkamaneni · Swair Shah · Anurag Beniwal · Narayanan Sadagopan 🔗 |
-
|
Exploring The Precision of Real Intelligence Systems at Synapse Resolution
(
Poster
)
link »
Synapses are the fundamental units of storage of information in neural circuits and their structure and strength are being adjusted through synaptic plasticity. Hence, exploring different aspects of synaptic plasticity processes in the hippocampus is crucial to understanding mechanisms of learning and memory, improving artificial intelligence algorithms, and neuromorphic computers. The scope of this manuscript is to explore the precision of this synaptic plasticity. Here we measured the precision of multiple synaptic features (Spine head volume, post synaptic density, spine neck diameter, spine neck length and number of docked vesicles). We concluded by suggesting our proposal for the surrogate of synaptic weight/strength and formulating a new hypothesis on synaptic plasticity precision. Results shows synaptic plasticity is highly precise and sub cellular resources such as mitochondria have impact on it. |
Mohammad Samavat · Tom Bartol · Kristen Harris · Terrence Sejnowski 🔗 |
-
|
Cache-memory gated graph neural networks
(
Poster
)
link »
While graph neural networks (GNNs) provide a powerful way to learn structured representations, it remains challenging to learn long-range dependencies between graph nodes. Recurrent gated GNNs only partly address this problem. We introduce a memory augmentation to a gated GNN which simply stores the previous hidden states in a cache. We show that the cache-memory gated GNN outperforms other models on a synthetic task that requires long-range information, as well as tasks on real-world datasets. |
Guixiang Ma · Vy Vo · Nesreen K. Ahmed · Theodore Willke 🔗 |
-
|
CL-LSG: Continual Learning via Learnable Sparse Growth
(
Poster
)
link »
Continual learning (CL) has been developed to learn new tasks sequentially and perform knowledge transfer from the old tasks to the new ones without forgetting, which is well known as catastrophic forgetting. While recent structure-based learning methods show the capability of alleviating the forgetting problem, these methods require a complex learning process to gradually grow-and-prune of a full-size network for each task, which is inefficient. To address this problem and enable efficient network expansion for new tasks, to the best of our knowledge, we are the first to develop a learnable sparse growth (LSG) method, which explicitly optimizes the model growth to only select important and necessary channels for growing. Building on the LSG, we then propose CL-LSG, a novel end-to-end CL framework to grow the model for each new task dynamically and sparsely. Different from all previous structure-based CL methods that start from and then prune (i.e., two-step) a full-size network, our framework starts from a compact seed network with a much smaller size and grows to the necessary model size (i.e., one-step) for each task, which eliminates the need of additional pruning in previous structure-based growing methods. |
Li Yang · Sen Lin · Junshan Zhang · Deliang Fan 🔗 |
-
|
Toward Semantic History Compression for Reinforcement Learning
(
Poster
)
link »
Agents interacting under partial observability require access to past observations via a memory mechanism in order to approximate the true state of the environment.Recent work suggests that leveraging language as abstraction provides benefits for creating a representation of past events.History Compression via Language Models (HELM) leverages a pretrained Language Model (LM) for representing the past. It relies on a randomized attention mechanism to translate environment observations to token embeddings.In this work, we show that the representations resulting from this attention mechanism can collapse under certain conditions. This causes blindness of the agent to certain subtleties in the environment. We propose a solution to this problem consisting of two parts. First, we improve upon HELM by substituting the attention mechanism with a feature-wise centering-and-scaling operation. Second, we take a step toward semantic history compression by encoding the observations with a pretrained multimodal model such as CLIP, which further improves performance. With these improvements our model is able to solve the challenging MiniGrid-Memory environment.Surprisingly, however, our experiments suggest that this is not due to the semantic enrichment of the representation presented to the LM but only due to the discriminative power provided by CLIP. |
Fabian Paischer · Thomas Adler · Andreas Radler · Markus Hofmarcher · Sepp Hochreiter 🔗 |
-
|
Learning at Multiple Timescales
(
Poster
)
link »
Natural environments have temporal structure at multiple timescales, a property that is reflected in biological learning and memory but typically not in machine learning systems. This paper advances a multiscale learning model in which each weight in a neural network is a sum of subweights learning independently at different timescales. A special case of this model is a fast-weights scheme, in which each original weight is augmented with a fast weight that rapidly learns and decays, enabling adaptation to distribution shifts during online learning. We then prove that more complicated models that assume coupling between timescales are equivalent to the multiscale learner, via a reparameterization that eliminates the coupling. Finally, we prove that momentum learning is equivalent to fast weights with a negative learning rate, offering a new perspective on how and when momentum is beneficial. |
Matt Jones 🔗 |
-
|
Neural Network Online Training with Sensitivity to Multiscale Temporal Structure
(
Poster
)
link »
Many online-learning domains in artificial intelligence involve data with nonstationarities spanning a wide range of timescales. Heuristic approaches to nonstationarity include retraining models frequently with only the freshest data and using iterative gradient-based updating methods that implicitly discount older data. We propose an alternative approach based on Bayesian inference over $1/f$ noise. The method is cast as a Kalman filter that posits latent variables with various characteristic timescales and maintains a joint posterior over them. We also derive a variational approximation that tracks these variables independently. The variational method can be implemented as a drop-in optimizer for any neural network architecture, which works by decomposing each weight as a sum of subweights with different decay rates. We test these methods on two synthetic, online-learning tasks with environmental parameters varying across time according to $1/f$ noise. Baseline methods based on finite memory show a nonmonotonic relationship between memory horizon and performance, a signature of data going ``stale.'' The Bayesian and variational methods perform significantly better by leveraging all past data and performing appropriate inference at all timescales.
|
Matt Jones · Tyler Scott · Gamaleldin Elsayed · Mengye Ren · Katherine Hermann · David Mayo · Michael Mozer 🔗 |
-
|
Mixed-Memory RNNs for Learning Long-term Dependencies in Irregularly-sampled Time Series
(
Poster
)
link »
Recurrent neural networks (RNNs) with continuous-time hidden states are a natural fit for modeling irregularly-sampled time series. These models, however, face difficulties when the input data possess long-term dependencies. We show that similar to standard RNNs, the underlying reason for this issue is the vanishing or exploding of the gradient during training. This phenomenon is expressed by the ordinary differential equation (ODE) representation of the hidden state, regardless of the ODE solver's choice. We provide a solution by equipping arbitrary continuous-time networks with a memory compartment separated from their time-continuous state. This way, we encode a continuous-time dynamic flow within the RNN, allowing it to respond to inputs arriving at arbitrary time lags while ensuring a constant error propagation through the memory path. We call these models Mixed-Memory-RNNs (mmRNNs). We experimentally show that Mixed-Memory-RNNs outperform recently proposed RNN-based counterparts on non-uniformly sampled data with long-term dependencies. |
Mathias Lechner · Ramin Hasani 🔗 |
-
|
Transformers generalize differently from information stored in context vs in weights
(
Poster
)
link »
Transformer models can use two fundamentally different kinds of information: information stored in weights during training, and information provided ``in-context'' at inference time. In this work, we show that transformers exhibit different inductive biases in how they represent and generalize from the information in these two sources. In particular, we characterize whether they generalize via parsimonious rules (rule-based generalization) or via direct comparison with observed examples (exemplar-based generalization). This is of important practical consequence, as it informs whether to encode information in weights or in context, depending on how we want models to use that information. In transformers trained on controlled stimuli, we find that generalization from weights is more rule-based whereas generalization from context is largely exemplar-based. In contrast, we find that in transformers pre-trained on natural language, in-context learning is significantly rule-based, with larger models showing more rule-basedness. We hypothesise that rule-based generalization from in-context information might be an emergent consequence of large-scale training on language, which has sparse rule-like structure. Using controlled stimuli, we verify that transformers pretrained on data containing sparse rule-like structure exhibit more rule-based generalization. |
Stephanie Chan · Ishita Dasgupta · Junkyung Kim · Dharshan Kumaran · Andrew Lampinen · Felix Hill 🔗 |
-
|
Memory in humans and deep language models: Linking hypotheses for model augmentation
(
Poster
)
link »
The computational complexity of the self-attention mechanism in Transformer models significantly limits their ability to generalize over long temporal durations. Memory-augmentation, or the explicit storing of past information in external memory for subsequent predictions, has become a constructive avenue for mitigating this limitation. We argue that memory-augmented Transformers can benefit substantially from considering insights from the memory literature in humans. We detail an approach to integrating evidence from the human memory system through the specification of cross-domain linking hypotheses. We then provide an empirical demonstration to evaluate the use of surprisal as a linking hypothesis, and further identify the limitations of this approach to inform future research. |
Omri Raccah · Phoebe Chen · Theodore Willke · David Poeppel · Vy Vo 🔗 |
-
|
Characterizing Verbatim Short-Term Memory in Neural Language Models
(
Poster
)
link »
When a language model is trained to predict natural language sequences, its prediction at each moment depends on a representation of prior context. What kind of information about the prior context can language models retrieve? We tested whether language models could retrieve the exact words that occurred previously in a text. In our paradigm, language models (transformers and LSTMs) processed English text in which a list of nouns occurred twice. We operationalized memory retrieval as the reduction in surprisal from the first to the second list. We found that the transformers retrieved both the identity and ordering of nouns from the first list. Further, the transformers' retrieval was markedly enhanced when they were trained on a larger corpus and with greater model depth. Lastly, their ability to index prior tokens was dependent on learned attention patterns. In contrast, LSTMs exhibited less precise retrieval, which was limited to list-initial tokens and to short intervening texts. The LSTMs' retrieval was not sensitive to the order of nouns and it improved when the list was semantically coherent. We conclude that large transformer-style language models implement something akin to a working memory system that can flexibly retrieve individual token representations across arbitrary delays; conversely, conventional LSTMs maintain a coarser semantic gist of prior tokens, weighted toward the earliest items. |
Kristijan Armeni · Christopher J Honey · Tal Linzen 🔗 |
-
|
Informing generative replay for continual learning with long-term memory formation in the fruit fly
(
Poster
)
link »
Continual learning without catastrophic forgetting is a challenge for artificial systems but it is done naturally across a range of biological systems, including in insects. A recurrent circuit has been identified in the fruit fly mushroom body to consolidate long term memories (LTM), but there is not currently an algorithmic understanding of this LTM formation. We hypothesize that generative replay is occurring to consolidate memories in this recurrent circuit, and find anatomical evidence in synapse-level connectivity that supports this hypothesis. Next, we introduce a computational model which combines a short-term memory (STM) and LTM phase to perform generative replay based continual learning. When evaluated on a CIFAR-100 class-incremental continual learning task, the modeled LTM phase increases classification performance by 20% and approaches within 2% of the performance for a non-incremental upper bound baseline. Unique elements of the proposed generative replay model include: 1) coupling high dimensional sparse activation patterns with generative replay and 2) sampling and reconstructing higher level representations for training generative replay (as opposed to reconstructing sensory-level or processed sensory-level representations). Additionally, we make the experimentally testable prediction that a specific set of synapses would need to undergo experience-dependent plasticity during LTM formation to support our generative replay-based model. |
Brian Robinson · Justin Joyce · Raphael Norman-Tenazas · Gautam Vallabha · Erik Johnson 🔗 |
-
|
Learning to Reason and Memorize with Self-Questioning
(
Poster
)
link »
Large language models have been shown to struggle with limited context memory and multi-step reasoning [1]. We propose a simple method for solving both of these problems by allowing the model to ask questions and answer them. Unlike recent scratchpad approaches, the model can deviate from the input context at any time for self-questioning. This allows the model to recall information and perform reasoning on the fly as it reads the context, thus extending its memory and enabling multi-step reasoning. Our experiments on two synthetic tasks demonstrate that our method can successfully generalize to more complicated instances from their training setup by performing self-questioning at inference time. |
Jack Lanchantin · Shubham Toshniwal · Jason E Weston · arthur szlam · Sainbayar Sukhbaatar 🔗 |
-
|
Recall-gated plasticity as a principle of systems memory consolidation
(
Poster
)
link »
In many species, behaviors, and neural circuits, learning and memory formation involves plasticity in two distinct neural pathways, and a process of consolidation between them. Here, we propose a model that captures common computational principles underlying these phenomena. The key component of our model is recall-gated consolidation, in which a long-term pathway prioritizes the storage of memory traces that are familiar to the short-term pathway. This mechanism shields long-term memory from spurious synaptic changes, enabling it to focus on reliable signal in the environment. We show that this model has significant advantages, substantially amplifying the signal-to-noise ratio with which intermittently reinforced memories are stored. In fact, we demonstrate mathematically that these advantages surpass what is achievable by synapse-local mechanisms alone, providing a normative motivation for systems (as opposed to synaptic) consolidation. We describe neural circuit implementations of our abstract model for different types of learning problems. These implementations involve learning rate modulation by factors such as prediction accuracy, confidence, or familiarity. Our model gives rise to a number of phenomena that are present in biological learning, such as spacing effects, task-dependent rates of consolidation, and different representations in the short and long-term pathways. |
Jack Lindsey · Ashok Litwin-Kumar 🔗 |
-
|
The Opportunistic PFC: Downstream Modulation of a Hippocampus-inspired Network is Optimal for Contextual Memory Recall
(
Poster
)
link »
Episodic memory serves as a store of individual experiences and allows for flexible adaptation to environment volatility and goal changes. The selection of episodic memories to recall is often considered to be driven by external sensory cues. Experimental studies suggest that this process is also influenced by internal cues, and that projections from the medial prefrontal cortex to the hippocampus play a role in this contextual modulation. In order to make sense of the biological configuration of prefrontal-to-hippocampus connectivity, we investigate the effectiveness of modulating various layers of a hippocampus-inspired neural network in a contextual memory task. Our results reveal that providing context information to the most downstream regions (i.e. last layers) of the model leads to better performance. In addition, the best average performance is obtained when contextual connections target the regions corresponding to the biological subfields that receive information from the prefrontal cortex, which provides a normative account of the biological connectivity. We relate this work to the need for augmenting reinforcement learning with flexible episodic memory. |
Hugo Chateau-Laurent · Frederic Alexandre 🔗 |
-
|
Multiple Modes for Continual Learning
(
Poster
)
link »
Adapting model parameters to incoming streams of data is a crucial factor to deep learning scalability. Interestingly, prior continual learning strategies in online settings inadvertently anchor their updated parameters to a local parameter subspace to remember old tasks, else drift away from the subspace and forget. From this observation, we formulate a trade-off between constructing multiple parameter modes and allocating tasks per mode. Mode-Optimized Task Allocation (MOTA), our contributed adaptation strategy, trains multiple modes in parallel, then optimizes task allocation per mode. We empirically demonstrate improvements over baseline continual learning strategies and across varying distribution shifts, namely sub10 population, domain, and task shift. |
Siddhartha Datta · Nigel Shadbolt 🔗 |
-
|
The Emergence of Abstract and Episodic Neurons in Episodic Meta-RL
(
Poster
)
link »
In this work, we analyze the reinstatement mechanism introduced by Ritter et al. (2018) to reveal two classes of neurons that emerge in the agent's working memory (an epLSTM cell) when trained using episodic meta-RL on an episodic variant of the Harlow visual fixation task. Specifically, Abstract neurons encode knowledge shared across tasks, while Episodic neurons carry information relevant for a specific episode's task. |
Badr AlKhamissi · Muhammad ElNokrashy · Michael Spranger 🔗 |
-
|
Self-recovery of memory via generative replay
(
Poster
)
link »
A remarkable capacity of the brain is its ability to autonomously reorganize memories during offline periods. Memory replay, a mechanism hypothesized to underlie biological offline learning, has inspired offline methods for reducing forgetting in artificial neural networks in continual learning settings. A memory-efficient and neurally-plausible method is generative replay, which achieves state of the art performance on continual learning benchmarks. However, unlike the brain, standard generative replay does not self-reorganize memories when trained offline on its own replay samples. We propose a novel architecture that augments generative replay with a brain-like capacity to autonomously recover memories. We demonstrate this capacity of the architecture across several continual learning tasks and environments. |
Zhenglong Zhou · Geshi Yeung · Anna Schapiro 🔗 |
-
|
Associative memory via covariance-learning predictive coding networks
(
Poster
)
link »
Classical models of biological memory assume that associative memory (AM) in the hippocampus is achieved by learning a covariance matrix of simulated neural activities. However, it has been also proposed that AM in the hippocampus could be explained in the predictive coding framework. These two seemingly disparate computational principles pose difficulties for developing a unitary theory of memory storage and recall in the brain. In this work, we address this dichotomy using a family of covariance-learning predictive coding networks (covPCNs). We show that earlier predictive coding networks (PCNs) explicitly learning the covariance matrix perform AM, but their learning rule is non-local and unstable. We propose a novel model that implicitly learns the covariance matrix with Hebbian plasticity and stably converges to the same memory retrieval as the earlier models. We further show that this model can be combined with hierarchical PCNs to model the hippocampo-neocortical interactions. In practice, our models can store a large number of memories of structured images and retrieve them with high fidelity. |
Mufeng Tang · Tommaso Salvatori · Yuhang Song · Beren Millidge · Thomas Lukasiewicz · Rafal Bogacz 🔗 |
-
|
Interpolating Compressed Parameter Subspaces
(
Poster
)
link »
Though distribution shifts have caused growing concern for machine learning scalability, solutions tend to specialize towards a specific type of distribution shift. We learn that constructing a Compressed Parameter Subspaces (CPS), a geometric structure representing distance-regularized parameters mapped to a set of train-time distributions, can maximize average accuracy over a broad range of distribution shifts concurrently. We show sampling parameters within a CPS can mitigate backdoor, adversarial, permutation, stylization and rotation perturbations. Regularizing a hypernetwork with CPS can also reduce task forgetting. |
Siddhartha Datta · Nigel Shadbolt 🔗 |
-
|
A Universal Abstraction for Hierarchical Hopfield Networks
(
Poster
)
link »
Conceptualized as Associative Memory, Hopfield Networks (HNs) are powerful models which describe neural network dynamics converging to a local minimum of an energy function. HNs are conventionally described by a neural network with two layers connected by a matrix of synaptic weights. However, it is not well known that the Hopfield framework generalizes to systems in which many neuron layers and synapses work together as a unified Hierarchical Associative Memory (HAM) model: a single network described by memory retrieval dynamics (convergence to a fixed point) and governed by a global energy function. In this work we introduce a universal abstraction for HAMs using the building blocks of neuron layers (nodes) and synapses (edges) connected within a hypergraph. We implement this abstraction as a software framework, written in JAX, whose autograd feature removes the need to derive update rules for the complicated energy-based dynamics. Our framework, called HAMUX (HAM User eXperience), enables anyone to build and train hierarchical HNs using familiar operations like convolutions and attention alongside activation functions like Softmaxes, ReLUs, and LayerNorms. HAMUX is a powerful tool to study HNs at scale, something that has never been possible before. We believe that HAMUX lays the groundwork for a new type of AI framework built around dynamical systems and energy-based associative memories. |
Benjamin Hoover · Duen Horng Chau · Hendrik Strobelt · Dmitry Krotov 🔗 |
-
|
Differentiable Neural Computers with Memory Demon
(
Poster
)
link »
A Differentiable Neural Computer (DNC) is a neural network with an external memory which allows for iterative content modification via read, write and delete operations.We show that information theoretic properties of the memory contents play an important role in the performance of such architectures. We introduce a novel concept of memory demon to DNC architectures which modifies the memory contents implicitly via additive input encoding. The goal of the memory demon is to maximize the expected sum of mutual information of the consecutive external memory contents. |
Ari Azarafrooz 🔗 |
-
|
Leveraging Episodic Memory to Improve World Models for Reinforcement Learning
(
Poster
)
link »
Poor sample efficiency plagues the practical applicability of deep reinforcement learning (RL) algorithms, especially compared to biological intelligence. In order to close the gap, previous work have proposed to augment the RL framework with an analogue of biological episodic memory, leading to the emerging field of ``episodic control". Episodic memory refers to the ability to recollect individual events independent of the slower process of learning accumulated statistics, and evidence suggests that humans can use episodic memory for planning. Existing attempts to integrate episodic memory components into RL agents have mostly focused on the model-free domain, leaving scope for investigating their roles under the model-based settings. Here we propose the Episodic Memory Module (EMM) to aid learning of world-model transitions, instead of value functions for standard Episodic-RL. The EMM stores latent state transitions that have high prediction-error under the model as memories, and uses linearly interpolated memories when the model shows high epistemic uncertainty. Memories are dynamically forgotten with a timescale reflecting their continuing surprise and uncertainty. Implemented in combination with existing world-model agents, the EMM produces a significant boost in performance over baseline agents on complex Atari games such as Montezuma's Revenge. Our results indicate that the EMM can temporarily fill in gaps while a world model is being learned, giving significant advantages in complex environments where such learning is slow. |
Julian Coda-Forno · Changmin Yu · Qinghai Guo · Zafeirios Fountas · Neil Burgess 🔗 |
-
|
Biological Neurons vs Deep Reinforcement Learning: Sample efficiency in a simulated game-world
(
Poster
)
link »
How do synthetic biological systems and artificial neural networks compete in their performance in a game environment? Reinforcement learning has undergone significant advances, however remains behind biological neural intelligence in terms of sample efficiency. Yet most biological systems are significantly more complicated than most algorithms. Here we compare the inherent intelligence of in vitro biological neuronal networks to state-of-the-art deep reinforcement learning algorithms in the arcade game 'pong'. We employed DishBrain, a system that embodies in vitro neural networks with in silico computation using a high-density multielectrode array. We compared the learning curve and the performance of these biological systems against time-matched learning from DQN, A2C, and PPO algorithms. Agents were implemented in a reward-based environment of the `Pong' game. Key learning characteristics of the deep reinforcement learning agents were tested with those of the biological neuronal cultures in the same game environment. We find that even these very simple biological cultures typically outperform deep reinforcement learning systems in terms of various game performance characteristics, such as the average rally length implying a higher sample efficiency. Furthermore, the human cell cultures proved to have the overall highest relative improvement in the average number of hits in a rally when comparing the initial 5 minutes and the last 15 minutes of each designed gameplay session. |
Forough Habibollahi · Moein Khajehnejad · Amitesh Gaurav · Brett J. Kagan 🔗 |
-
|
Constructing Memory: Consolidation as Teacher-Student Training of a Generative Model
(
Poster
)
link »
Human episodic memories are (re)constructed, share neural substrates with imagination, and show systematic biases that increase as they are consolidated into semantic memory. Here we suggest that these main features of human memory are characteristic of 'teacher-student' training of a neocortical generative model by a one-shot memory system in the hippocampal formation (HF). As we simulate with image datasets, the 'students’ (variational autoencoders) in association cortex develop a compressed 'latent variable’ representation of experience by learning to reconstruct replayed samples from the 'teacher' (a modern Hopfield network). Recall and imagination require these representations to be decoded into sensory experience by the HF and return projections to sensory cortex, whereas semantic memory and inference rely directly on the latent variables without requiring the HF. |
Eleanor Spens · Neil Burgess 🔗 |