Workshop
Attributing Model Behavior at Scale (ATTRIB)
Tolga Bolukbasi · Logan Engstrom · Kelvin Guu · Andrew Ilyas · Sam Park · Ellie Pavlick · Anders Søgaard
Room 271 - 273
Recently-developed algorithmic innovations (e.g., transformers, diffusion models ) and large-scale datasets (e.g., Common Crawl, LAION) have given rise to machine learning models with impressive capabilities. However, there is much left to understand in how these different factors combine to give rise to observed behaviors. For example, we still do not fully understand how the composition of training datasets influence downstream model capabilities (e.g., which data sources within LAION-5B are important for training high-quality CLIP embeddings?), how to attribute model capabilities to subcomponents inside the model(e.g., can we identify which subnetwork of a LLM implements addition ?), and which algorithmic choices really drive performance (e.g., is RL necessary to align language models?).A common theme underlying all these challenges is model behavior attribution. That is, the need to tie model behavior back to factors in the machine learning pipeline---such as the choice of training dataset or particular training algorithm---that we can control or reason about. This workshop aims to bring together researchers and practitioners that advance our understanding of model behavior attribution in the contexts that span: data, models, and learning algorithms.
Schedule
Fri 6:45 a.m. - 7:00 a.m.
|
Welcome and Opening Remarks
(
Remarks
)
>
SlidesLive Video |
🔗 |
Fri 7:00 a.m. - 7:30 a.m.
|
Data attribution for LMMs and beyond (James Zou)
(
In-person presentation
)
>
SlidesLive Video I will discuss DataInf, an efficient influence approximation method that is practical for large-scale generative AI models. Our theoretical analysis shows that DataInf is particularly well-suited for parameter-efficient fine-tuning techniques such as LoRA. In applications to generative models such as Llama-2 and stable-diffusion, DataInf effectively identifies the most influential fine-tuning examples and is substantially faster than previous methods. Moreover, it can help to identify which data points are mislabeled. |
🔗 |
Fri 7:30 a.m. - 8:00 a.m.
|
What does scale give us: Why we are building a ladder to the moon (Sara Hooker)
(
In-person presentation
)
>
SlidesLive Video A talk about what we know about the role of scale at conferring valuable generalization properties. I will present some background, some of our work on understanding the role of scale (both of data and model size) and some thoughts about how we can get away from the painfully inefficient formula of just scaling capacity. |
🔗 |
Fri 8:00 a.m. - 8:30 a.m.
|
Coffee Break and Posters
(
Break
)
>
|
🔗 |
Fri 8:30 a.m. - 9:05 a.m.
|
Contributed papers (4 presentations)
(
Contributed Talk
)
>
SlidesLive Video |
Elan Rosenfeld · Rhys Gould · Nicholas Konz · Theodora Worledge 🔗 |
Fri 9:05 a.m. - 9:50 a.m.
|
The Future of Attribution in ML (Panel)
(
Discussion Panel
)
>
SlidesLive Video |
🔗 |
Fri 9:50 a.m. - 11:00 a.m.
|
Lunch
(
Break
)
>
|
🔗 |
Fri 11:00 a.m. - 12:00 p.m.
|
Poster Session #1
(
Poster Session
)
>
|
🔗 |
Fri 12:00 p.m. - 12:30 p.m.
|
What Neural Networks Memorize and Why (Vitaly Feldman)
(
In-person presentation
)
>
SlidesLive Video Deep learning algorithms tend to fit the entire training dataset (nearly) perfectly including mislabeled examples and outliers. In addition, in extreme cases, complex models appear to memorize entire input examples, including seemingly irrelevant information (social security numbers from text, for example). We provide a simple conceptual explanation and a theoretical model demonstrating that memorization of labels is necessary for achieving close-to-optimal generalization error when learning from long-tailed data distributions. We also describe natural prediction problems for which every sufficiently accurate training algorithm must encode, in the prediction model, essentially all the information about a large subset of its training examples. This remains true even when most of that information is ultimately irrelevant to the task at hand. Finally, we demonstrate the utility of memorization and support our explanation empirically. These results rely on a new technique for efficiently estimating memorization and influence of training data points. |
🔗 |
Fri 12:30 p.m. - 1:00 p.m.
|
Evaluation Beyond Task Performance (Milad Nasr)
(
In-person presentation
)
>
SlidesLive Video As we increasingly release and productionize machine learning models, we focus primarily on their performance on a suite of downstream benchmarking tasks. However, improved performance on these benchmarks does not equate to universal improvement. In this talk, we discuss evaluations that live on a whole separate axis. In particular, we show that as models get larger there are more memorized training examples in the model outputs. These issues are not random artifacts that can be solved by scaling models or can be prevented in production models easily. |
🔗 |
Fri 1:00 p.m. - 2:00 p.m.
|
Poster Session #2
(
Poster Session
)
>
|
🔗 |
Fri 1:00 p.m. - 1:30 p.m.
|
Coffee Break and Posters
(
Break
)
>
|
🔗 |
Fri 2:00 p.m. - 2:30 p.m.
|
Understanding LLMs via their Generative Successes and Shortcomings (Swabha Swayamdipta)
(
In-person presentation
)
>
SlidesLive Video Generative capabilities of large language models have grown beyond the wildest imagination of the broader AI research community, leading many to speculate whether these successes may be attributed to the training data or different factors concerning the model. At the same time however, LLMs continue to exhibit many shortcomings, which might contain important clues to understanding their behavior as well as attribution. I will present some work from my group which has revealed unique successes and shortcomings in the generative capabilities of LLMs, on knowledge-oriented tasks, tasks with human and social utility and tasks that reveal more than surface-level understanding of language. I will end with a brief discussion of the implications for attribution in the peculiar domain that natural language occupies. |
🔗 |
Fri 2:30 p.m. - 3:00 p.m.
|
Talk by Sanjeev Arora
(
In-person presentation
)
>
SlidesLive Video Abstract TBD |
🔗 |
Fri 3:00 p.m. - 3:30 p.m.
|
Poster Session #3 & Closing Remarks
(
Poster Session
)
>
|
🔗 |
-
|
Irreducible Curriculum for Language Model Pretraining
(
Poster
)
>
link
Automatic data selection and curriculum design for training large language models is challenging, with only a few existing methods showing improvements over standard training.Furthermore, current schemes focus on domain-level selection, overlooking the more fine-grained contributions of each individual training point. It is difficult to apply traditional datapoint selection methods on large language models: most online batch selection methods perform two-times forward or backward passes, which introduces considerable extra costs with large-scale models. To mitigate these obstacles, we propose irreducible curriculum as a curriculum learning algorithm for language model pretraining, which prioritizes samples with higher learnability. Specifically, to avoid prohibitive extra computation overhead, we simulate the sample loss along the main model's training trajectory using a small-scale proxy model. Our experiments on the RedPajama-1B dataset demonstrate a consistent improvement on validation perplexity across all 7 domains compared to random uniform baseline and the anti-curriculum strategy. Our method also reduces the sharpness of the network and illustrates a better 5-shot accuracy on MMLU benchmarks. |
Simin Fan · Martin Jaggi 🔗 |
-
|
Evaluating the Utility of Model Explanations for Model Development
(
Poster
)
>
link
One of the motivations for explainable AI is to allow humans to make better and more informed decisions regarding the use and deployment of AI models. But careful evaluations are needed to assess whether this expectation has been fulfilled. Current evaluations mainly focus on algorithmic properties of explanations, and those that involve human subjects often employ subjective questions to test human's perception of explanation usefulness, without being grounded in objective metrics and measurements. In this work, we evaluate whether explanations can improve human decision-making in practical scenarios of machine learning model development. We conduct a mixed-methods user study involving image data to evaluate saliency maps generated by SmoothGrad, GradCAM, and an oracle explanation on two tasks: model selection and counterfactual simulation. To our surprise, we did not find evidence of significant improvement on these tasks when users were provided with any of the saliency maps, even the synthetic oracle explanation designed to be simple to understand and highly indicative of the answer. Nonetheless, explanations did help users more accurately describe the models. These findings suggest caution regarding the usefulness and potential for misunderstanding in saliency-based explanations. |
Shawn Im · Jacob Andreas · Yilun Zhou 🔗 |
-
|
Why do landscape diagnostics matter? Pinpointing the failure mode of generalization
(
Poster
)
>
link
Conventional validation-based and learning-curve-based methods are widely applied for model selection and hyperparameter tuning. In this paper, we consider a novel framework of ``model diagnostics'' to extend these approaches, where a practitioner wants to determine the best way of using a given budget to either collect more data, purchase a larger model, or conduct more careful hyperparameter tuning. We apply our framework to multiple transfer learning scenarios, including tuning on models trained with small data while transferring the tuning decisions to large data and tuning on clean data while transferring the decisions to noisy data. We experimentally demonstrate that generalization measures, especially those motivated by studying the loss landscape of neural networks, play a crucial role in improving the model diagnostic performance compared to classical validation-based and learning-curve-based methods. |
Yefan Zhou · Jianlong Chen · Qinxue Cao · Konstantin Schürholt · Yaoqing Yang 🔗 |
-
|
The Importance of Prompt Tuning for Automated Neuron Explanations
(
Poster
)
>
link
Recent advances have greatly increased the capabilities of large language models (LLMs), but our understanding of the models and their safety has not progressed as fast. In this paper we aim to understand LLMs deeper by studying their individual neurons. We build upon previous work showing large language models such as GPT-4 can be useful in explaining what each neuron in a language model does. Specifically, we analyze the effect of the prompt used to generate explanations and show that reformatting the explanation prompt in a more natural way can significantly improve neuron explanation quality and greatly reduce computational cost. We demonstrate the effects of our new prompts in three different ways, incorporating both automated and human evaluations. |
Justin Lee · Tuomas Oikarinen · Arjun Chatha · Keng-Chi Chang · Yilan Chen · Lily Weng 🔗 |
-
|
Copy Suppression: Comprehensively Understanding an Attention Head
(
Poster
)
>
link
We present a single attention head in GPT-2 Small that has one main role across the entire training distribution. If components in earlier layers predict a certain token, and this token appears earlier in the context, the head suppresses it: we call this copy suppression. Attention Head 10.7 (L10H7) suppresses naive copying be5 havior which improves overall model calibration. This explains why multiple prior works studying certain narrow tasks found negative heads that systematically favored the wrong answer. We uncover the mechanism that the Negative Heads use for copy suppression with weights-based evidence and are able to explain 76.9% of the impact of L10H7 in GPT-2 Small. To the best of our knowledge, this is the most comprehensive description of the complete role of a component in a language model to date. Interactive visualizations of the copy suppression phenomena may be seen at our web app https://copy-suppression.streamlit.app/. |
Callum McDougall · Arthur Conmy · Cody Rushing · Tom McGrath · Neel Nanda 🔗 |
-
|
Does It Know?: Probing and Benchmarking Uncertainty in Language Model Latent Beliefs
(
Poster
)
>
link
Understanding a language model's beliefs about its truthfulness is crucial for building more trustworthy, factually accurate large language models. The recent method of Contrast-Consistent Search (CCS) measures this "latent belief" via a linear probe on intermediate activations of a language model, trained in an unsupervised manner to classify inputs as true or false. As an extension of CCS, we propose Uncertainty-detecting CCS (UCCS), which encapsulates finer-grained notions of truth, such as uncertainty or ambiguity. Concretely, UCCS teaches a probe, using only unlabeled data, to classify a model's latent belief on input text as true, false, or uncertain. We find that UCCS is an effective unsupervised-trained selective classifier, using its uncertainty class to filter out low-confidence truth predictions, leading to improved accuracy across a diverse set of models and tasks. To properly evaluate UCCS predictions of truth and uncertainty, we introduce a toy dataset, named Temporally Measured Events (TYMES), which comprises true or falsified facts, paired with timestamps, extracted from recent news articles from the past several years. TYMES can be combined with any language model's training cutoff date to systematically produce a subset of data beyond (literally, occurring after) the knowledge limitations of the model. TYMES serves as a valuable proof-of-concept for how we can benchmark uncertainty or time-sensitive world knowledge in language models, a setting which includes but extends beyond our UCCS evaluations. |
Brian Huang · Joe Kwon 🔗 |
-
|
Attribution Patching Outperforms Automated Circuit Discovery
(
Poster
)
>
link
Automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models. Existing automated circuit discovery work applies activation patching to identify subnetworks responsible for solving specific tasks (circuits). In this work, we show that a simple method based on attribution patching outperforms all existing methods while requiring just two forward passes and a backward pass. We apply a linear approximation to activation patching to estimate the importance of each edge in the computational subgraph. Using this approximation, we prune the least important edges of the network. We survey the performance and limitations of this method, finding that averaged over all tasks our method has greater AUC from circuit recovery than other methods. |
Aaquib Syed · Can Rager · Arthur Conmy 🔗 |
-
|
On the Support Vector Effect in DNNs: Rethinking Last Layer Sensitivity-based Instance Attribution
(
Poster
)
>
link
As complex predictive models gain popularity, the need for effective explanation techniques has also increased. A line of research is dedicated to instance attribution, which attempts to select training samples that the model capitalized on to make a given test prediction. Many existing methods employing sensitivity-based techniques have been shown to be unreliable on large deep networks, and are often costly during runtime. We rigorously uncover SVM-like behavior in DNNs, which we term the support vector effect (SVE). We use SVE to analyze the limitations of sensitivity-based instance attribution methods, revealing their propensity to behave as class-level methods rather than fulfilling their intended role as instance-level ones. We thus advocate for reconsidering similarity-based methods, and propose a simple yet profoundly effective alternative: using prediction itself as explanation. |
Syed Hasan Amin Mahmood · Rajiv Khanna 🔗 |
-
|
Training Dynamics of Contextual N-Grams in Language Models
(
Poster
)
>
link
Prior work has shown the existence of contextual neurons in language models, including a neuron which activates on text that is in German. We show that one role of this neuron is to unlock what we call contextual n-grams: we find late layer neurons which recognize and continue n-grams common in German text, but which only activate if the German neuron is active. We investigate the formation of this circuit throughout training and find that it is an example of what we call a hierarchical feature. Both the n-grams and the context neuron form independently early in training---the German neuron partially through boosting German unigram statistics, and the n-grams by boosting relevant tokens. Only after both features have already been formed do they fit together in the circuit. Contrary to the hypotheses presented in prior work, we find that the circuits of contextual n-grams and of the contextual neuron itself form gradually rather than in a sudden phase transition. We further present a range of anomalous observations such as a simultaneous phase transition in many tasks coinciding with the learning rate warmup, and evidence that many context neurons form simultaneously early in training, with most later unlearned. |
Lucia Quirke · Lovis Heindrich · Wes Gurnee · Neel Nanda 🔗 |
-
|
SPADE: Sparsity-Guided Debugging for Deep Neural Networks
(
Poster
)
>
link
Interpretability, broadly defined as mechanisms for understanding why and how machine learning models reach their decisions, is one of the key open goals at the intersection of deep learning theory and practice. Towards this goal, multiple tools have been proposed to aid a human examiner in reasoning about a network's behavior in general or on a set of instances. However, the outputs of these tools---such as input saliency maps or neuron visualizations---are frequently difficult for a human to interpret, or even misleading, due, in particular, to the fact that neurons can be multifaceted, i.e., a single neuron can be associated with multiple distinct feature combinations. In this paper, we present a new general approach to address this problem, called SPADE, which, given a trained model and a target sample, uses sample-targeted pruning to provide a "trace" of the network's execution on the sample, reducing the network to the connections that are most relevant to the specific prediction. We demonstrate that preprocessing with SPADE significantly increases both the accuracy of image saliency maps across several interpretability methods and the usefulness of neuron visualizations, aiding humans in reasoning about network behavior. Our findings show that sample-specific pruning of connections can disentangle multifaceted neurons, leading to consistently improved interpretability. |
Arshia Soltani Moakhar · Eugenia Iofinova · Dan Alistarh 🔗 |
-
|
In Search of a Data Transformation that Accelerates Neural Field Training
(
Poster
)
>
link
SlidesLive Video Neural field is a special type of neural network that represents a single datum. We study whether we can speed up the training of such neural networks, by fitting a transformed version of the target datum; one can recover the original signal by inverting back the signal represented by the trained neural field. We empirically find that very simple data transformations, such as color inversion or random pixel shuffling, can substantially speed up or slow down the training. In particular, to our surprise, we observe that an image with randomly shuffled pixels can be fit much faster, despite having a very large frequency. |
Junwon Seo · Sangyoon Lee · Jaeho Lee 🔗 |
-
|
Automatic Discovery of Visual Circuits
(
Poster
)
>
link
To date, most discoveries of subnetworks that implement human-interpretable computations in deep vision models have involved close study of single units and large amounts of human labor. We explore scalable methods for extracting the subgraph of a vision model’s computational graph that underlies a particular capability. In this paper, we formulate capabilities as mappings of human-interpretable visual concepts to intermediate feature representations. We introduce a new method for identifying these subnetworks: specifying a visual concept using a few examples, and then tracing the interdependence of neuron activations across layers, or their functional connectivity. We find that our approach extracts circuits that causally affect model output, and that editing these circuits can defend large pretrained models from adversarial attacks. |
Achyuta Rajaram · Neil Chowdhury · Antonio Torralba · Jacob Andreas · Sarah Schwettmann 🔗 |
-
|
Mining the Diamond Miner: Mechanistic Interpretability on the Video PreTraining Agent
(
Poster
)
>
link
Although decision-making systems based on reinforcement learning (RL) can be widely used in a variety of applications, their lack of interpretability raises concerns, especially in high-stakes scenarios. In contrast, Mechanistic Interpretability (MI) has shown potential in breaking down complex deep neural networks into understandable components in language and vision tasks. Accordingly, in this study, we apply MI to understand the behavior of a Video PreTraining (VPT) agent, exhibiting human-level proficiency in numerous Minecraft tasks. Our exploration is centered on the task of diamond mining and its associated subtasks, such as crafting wooden logs and iron pickaxes. By employing circuit analysis, we aim to decode the network's representation of these tasks and subtasks. We find a significant head in the VPT model encoding for an attacking action, although its ablation doesn't markedly affect the agent's performance. Our findings indicate that this approach can provide useful insights into the agent's behavior. |
Sonia Joseph · Artem Zholus · Mohammad Reza Samsami · Blake Richards 🔗 |
-
|
Threshold KNN-Shapley: A Linear-Time and Privacy-Friendly Approach to Data Valuation (Workshop Version)
(
Poster
)
>
link
Data valuation aims to quantify the usefulness of individual data sources in training machine learning (ML) models, and is a critical aspect of data-centric ML research. However, data valuation faces significant yet frequently overlooked privacy challenges despite its importance. This paper studies these challenges with a focus on KNN-Shapley, one of the most practical data valuation methods nowadays. We first emphasize the inherent privacy risks of KNN-Shapley, and demonstrate the significant technical difficulties in adapting KNN-Shapley to accommodate differential privacy (DP). To overcome these challenges, we introduce \emph{TKNN-Shapley}, a refined variant of KNN-Shapley that is privacy-friendly, allowing for straightforward modifications to incorporate DP guarantee (\emph{DP-}TKNN-Shapley). We show that DP-TKNN-Shapley has several advantages and offers a superior privacy-utility tradeoff compared to naively privatized KNN-Shapley in discerning data quality. Moreover, even non-private TKNN-Shapley achieves comparable performance as KNN-Shapley. Overall, our findings suggest that TKNN-Shapley is a promising alternative to KNN-Shapley, particularly for real-world applications involving sensitive data. |
Jiachen (Tianhao) Wang · Yuqing Zhu · Yu-Xiang Wang · Ruoxi Jia · Prateek Mittal 🔗 |
-
|
Colour versus Shape Goal Misgeneralization in Reinforcement Learning: A Case Study
(
Poster
)
>
link
SlidesLive Video We explore colour versus shape goal misgeneralization originally demonstrated by Di Langosco et al. (2022) in the Procgen Maze environment, where, given an ambiguous choice, the agents seem to prefer generalization based on colour rather than shape. After training over 1,000 agents in a simplified version of the environment and evaluating them on over 10 million episodes, we conclude that the behaviour can be attributed to the agents learning to detect the goal object through a specific colour channel. This choice is arbitrary. Additionally, we show how, due to underspecification, the preferences can change when retraining the agents using exactly the same procedure except for using a different random seed for the training run. Finally, we demonstrate the existence of outliers in out-of-distribution behaviour based on training random seed alone. |
Karolis Ramanauskas · Özgür Şimşek 🔗 |
-
|
Adversarial Attacks on Neuron Interpretation via Activation Maximization
(
Poster
)
>
link
Feature visualization is one of the most popular techniques to interpret the internal behavior of individual units of trained deep neural networks. Based on activation maximization, they consist of finding $\textit{synthetic}$ or $\textit{natural}$ inputs that maximize neuron activations. This paper introduces an optimization framework that aims to deceive feature visualization through adversarial model manipulation. It consists of fine-tuning a pre-trained model with a specifically introduced loss that aims to maintain model performance, while also significantly changing feature visualization. We provide evidence of the success of this manipulation on several pre-trained models for the ImageNet classification task.
|
Alex Fulleringer · Geraldin Nanfack · Jonathan Marty · Michael Eickenberg · Eugene Belilovsky 🔗 |
-
|
Divergence at the Interpolation Threshold: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle
(
Poster
)
>
link
SlidesLive Video Machine learning models misbehave, often in unexpected ways. One prominent misbehavior is when the test loss diverges at the interpolation threshold, perhaps best known from its distinctive appearance in double descent. While considerable theoretical effort has gone into understanding generalization of overparameterized models, less effort has been made at understanding why the test loss misbehaves at the interpolation threshold. Moreover, analytically solvable models in this area employ a range of assumptions and use complex techniques from random matrix theory, statistical mechanics, and kernel methods, making it difficult to assess when and why test error might diverge. In this work, we analytically study the simplest supervised model - ordinary linear regression - and show intuitively and rigorously when and why a divergence occurs at the interpolation threshold using basic linear algebra. We identify three interpretable factors that, when all present, cause the divergence. We demonstrate on real data that linear models' test losses diverge at the interpolation threshold and that the divergence disappears when we ablate any one of the three identified factors. We conclude with insights on recent discoveries in nonlinear models regarding superposition and double descent. |
Rylan Schaeffer · Zachary Robertson · Akhilan Boopathy · Mikail Khona · Ila Fiete · Andrey Gromov · Sanmi Koyejo 🔗 |
-
|
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
(
Poster
)
>
link
We expose a surprising failure of generalization in auto-regressive large language models (LLMs). If a model is trained on a sentence of the form "A is B", it will not automatically generalize to the reverse direction "B is A". This is the Reversal Curse. For instance, if a model is trained on "Olaf Scholz was the ninth Chancellor of Germany", it will not automatically be able to answer the question, "Who was the ninth Chancellor of Germany?". Moreover, the likelihood of the correct answer ("Olaf Scholz") will not be higher than for a random name. Thus, models exhibit a basic failure of logical deduction and do not generalize a prevalent pattern in their training set (i.e. if "A is B" occurs, "B is A" is more likely to occur). We provide evidence for the Reversal Curse by finetuning GPT-3 and Llama-1 on fictitious statements such as "Uriah Hawthorne is the composer of Abyssal Melodies" and showing that they fail to correctly answer "Who composed Abyssal Melodies?". The Reversal Curse is robust across model sizes and model families and is not alleviated by data augmentation. We also evaluate ChatGPT (GPT-3.5 and GPT-4) on questions about real-world celebrities, such as "Who is Tom Cruise's mother? [A: Mary Lee Pfeiffer]" and the reverse "Who is Mary Lee Pfeiffer's son?". GPT-4 correctly answers questions like the former 79% of the time, compared to 33% for the latter. This shows a failure of logical deduction that we hypothesize is caused by the Reversal Curse. |
Lukas Berglund · Meg Tong · Maximilian Kaufmann · Mikita Balesni · Asa Cooper Stickland · Tomasz Korbak · Owain Evans 🔗 |
-
|
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
(
Poster
)
>
link
Large Language Models (LLMs) have impressive capabilities, but are also prone to outputting falsehoods. Recent work has developed techniques for inferring whether a LLM is telling the truth by training probes on the LLM's internal activations. However, this line of work is controversial, with some authors pointing out failures of these probes to generalize in basic ways, among other conceptual issues. In this work, we curate high-quality datasets of true/false statements and use them to study in detail the structure of LLM representations of truth, drawing on three lines of evidence: 1. Visualizations of LLM true/false statement representations, which reveal clear linear structure. 2. Transfer experiments in which probes trained on one dataset generalize to different datasets. 3. Causal evidence obtained by surgicallly intervening in a LLM's forward pass, causing it to treat false statements as true and vice versa. Overall, we present evidence that language models linearly represent the truth or falsehood of factual statements. We also introduce a novel technique, mass-mean probing, which generalizes better and is more causally implicated in model outputs than other probing techniques. |
Samuel Marks · Max Tegmark 🔗 |
-
|
Language Models Linearly Represent Sentiment
(
Poster
)
>
link
Sentiment is a pervasive feature in natural language text, yet it is an open question how sentiment is represented within Large Language Models (LLMs). In this study, we reveal that across a range of models, sentiment is represented linearly: a single direction in activation space mostly captures the feature across a range of tasks with one extreme for positive and the other for negative. Through causal interventions, we isolate this direction and show it is causally relevant in both toy tasks and real world datasets such as Stanford Sentiment Treebank. We further uncover the mechanisms that involve this direction, highlighting the roles of a small subset of attention heads and neurons. Finally, we discover a phenomenon which we term the summarization motif: sentiment is not solely represented on emotionally charged words, but is additionally summarized at intermediate positions without inherent sentiment, such as punctuation and names. We show that in Stanford Sentiment Treebank zero-shot classification, 76\% of above-chance classification accuracy is lost when ablating the sentiment direction, nearly half of which (36\%) is due to ablating the summarized sentiment direction exclusively at comma positions. |
Curt Tigges · Oskar John Hollinsworth · Atticus Geiger · Neel Nanda 🔗 |
-
|
Efficient Data Valuation for Weighted Nearest Neighbor Algorithms
(
Poster
)
>
link
Data Shapley is a principled way to assess the importance of individual training data sources for machine learning (ML) applications. However, it often comes with computational challenges in calculating exact Data Shapley scores. KNN-Shapley \citep{jia2019efficient}, which assigns data value leveraging the efficiently computable Data Shapley score of $K$ nearest neighbors (KNN), has gained popularity as a viable alternative due to its computationally efficient nature. However, \cite{jia2019efficient} only gives a practical algorithm for computing Data Shapley for unweighted KNN, but weighted KNN is more prevalently used in practice. This work addresses the computational challenges of calculating the exact Data Shapley for weighted KNN classifiers (WKNN-Shapley). By making small adjustments to KNN configurations, we recast the computation of WKNN-Shapley into a counting problem and introduce an $O(K^2 N^2)$ algorithm, presenting a notable improvement from the naive, impractical $O(N^K)$ algorithm. We also develop a deterministic approximation algorithm that further improves computational efficiency while maintaining the key fairness properties of the Shapley value. These advancements position WKNN-Shapley as a compelling alternative to KNN-Shapley. In particular, WKNN-Shapley can select high-quality data points and improve the performance of retrieval-augmented language models.
|
Jiachen (Tianhao) Wang · Ruoxi Jia 🔗 |
-
|
How do language models bind entities in context?
(
Poster
)
>
link
Language models (LMs) can recall facts mentioned in context, as shown by their performance on reading comprehension tasks. When the context describes facts about more than one entity, the LM has to correctly bind attributes to their corresponding entity. We show, via causal experiments, that LMs' internal activations represent binding information by exhibiting appropriate \textit{binding ID vectors} at the entity and attribute positions. We further show that binding ID vectors form a subspace and often transfer across tasks. Our results demonstrate that LMs learn interpretable strategies for representing symbolic knowledge in context, and that studying context activations is a fruitful direction for understanding LM cognition. |
Jiahai Feng · Jacob Steinhardt 🔗 |
-
|
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching
(
Poster
)
>
link
Mechanistic interpretability aims to understand model behaviors in terms of specific, interpretable features, often hypothesized to manifest as low-dimensional subspaces of activations.Specifically, recent studies have explored subspace interventions (such as activation patching) as a way to both manipulate model behavior and attribute the features behind it to given subspaces.In this work, we demonstrate that these two aims diverge, potentially leading to an illusory sense of interpretability.Counterintuitively, even if a subspace intervention modifies end-to-end model behavior in the desired way, this effect may be achieved by activating a \emph{dormant parallel pathway} leveraging a component that is \emph{causally disconnected} from model outputs. We demonstrate this phenomenon in a distilled mathematical example, in two real-world domains (the indirect object identification task and factual recall), and present evidence for its prevalence in practice. In the context of factual recall, we further show a link to rank-1 fact editing, providing a mechanistic explanation for previous work observing an inconsistency between fact editing performance and fact localization.Finally, we remark on what a success case of subspace activation patching looks like. |
Aleksandar Makelov · Georg Lange · Atticus Geiger · Neel Nanda 🔗 |
-
|
Object Detection in Deep Neural Networks Differs from Humans in the Periphery
(
Poster
)
>
link
To understand how strategies used by object detection models compare to those in human vision, we simulate peripheral vision in object detection models at the input stage. We collect human data on object change detection in the periphery and compare it to detection models with a simulated periphery. We find that unlike humans, models are highly sensitive to the texture-like transformation in peripheral vision. Not only do models under-perform compared to humans, they do not follow the same clutter effects as humans even when fixing the model task to closely mimic the human one. Training on peripheral input boosts performance on the change detection task, but appears to aid object localization in the periphery much more than object identification. This suggests that human-like performance is not attributable to input data alone, and to fully address the differences we see in human and model detection, farther downstream changes may be necessary. In the future, improving alignment between object detection models and human representations could help us build models with more human-explainable detection strategies. |
Anne Harrington · Vasha DuTell · Mark Hamilton · Ayush Tewari · Simon Stent · Bill Freeman · Ruth Rosenholtz 🔗 |
-
|
Risk Aversion of Online Learning Algorithms
(
Poster
)
>
link
We study a novel bias in online decision-making: Emergent risk aversion. When presented with actions of the same expectation, $\varepsilon$-Greedy chooses the lower-variance action with probability approaching one. Upper Confidence Band avoids this by debiasing their estimates of arm rewards. Risk aversion shapes arm choices in finite time, as we show in experiments.
|
Andreas Haupt · Aroon Narayanan 🔗 |
-
|
Tell, Don't Show: Internalized Reasoning influences how LLMs generalize
(
Poster
)
>
link
We explore how declarative statements in training data influence a language model's generalization. For example, suppose a model is trained on both weather reports up to 2023 and declarative statements about climate change. When prompted to generate weather reports for 2050, will this model incorporate the facts about climate change or simply match the statistics of the previous reports? To investigate this question, we finetune language models on a mix of declarative and non-declarative information and test how the former affects generalization. We find that declarative information has a clear and systematic effect on model predictions, consistent across model families (GPT-3 and Llama-2) and across two domains: predicting weather and demographic features. Through a series of ablations, we show that this effect cannot be explained by simple associative learning (i.e. matching words in the prompt to words in declarative statements). |
Alexander Meinke · Owain Evans 🔗 |
-
|
Formal Definition of Fingerprints Improves Attribution of Generative Models
(
Poster
)
>
link
Recent works have shown that generative models leave traces of their underlyinggenerative process on the generated samples, broadly referred to as fingerprints of agenerative model, and have studied their utility in detecting synthetic images fromreal ones. However, the extent to which these fingerprints can distinguish betweenvarious types of synthetic images and help identify the underlying generativeprocess remain under-explored. In particular, the very definition of a fingerprintremains unclear, to our knowledge. To that end, in this work, we formalize thedefinition of artifact and fingerprint in generative models, propose an algorithm forcomputing them in practice, and finally study how different design parameters affectthe model fingerprints and their attributability. We find that using our proposeddefinition can significantly improve the performance on the task of identifyingthe underlying generative process from samples (model attribution) compared toexisting methods. Additionally, we study the structure of the fingerprints andobserve that it is very predictive of the effect of different design choices on thegenerative process. |
Hae Jin Song · Mahyar Khayatkhoei · Wael Abd-Almageed 🔗 |
-
|
Attributing Learned Concepts in Neural Networks to Training Data
(
Oral
)
>
link
By now there is substantial evidence that deep learning models learn certain human-interpretable features as part of their internal representations of data. As having the right (or wrong) concepts is critical to trustworthy machine learning systems, it is natural to ask which inputs from the model's original training set were most important for learning a concept at a given layer. To answer this, we combine data attribution methods with methods for probing the concepts learned by a model. Training network and probe ensembles for two concept datasets on a range of network layers, we use the recently developed TRAK method for large-scale data attribution. We find some evidence for convergence, where removing the 10,000 top attributing images for a concept and retraining the model does not change the location of the concept in the network nor the probing sparsity of the concept. This suggests that rather than being highly dependent on a few specific examples, the features that inform the development of a concept are spread in a more diffuse manner across its exemplars, implying robustness in concept formation. |
Nicholas Konz · Charles Godfrey · Madelyn Shapiro · Jonathan Tu · Henry Kvinge · Davis Brown 🔗 |
-
|
When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale
(
Poster
)
>
link
Large volumes of text data have contributed significantly to the development of large language models (LLMs) in recent years. To date, efforts to prune these datasets to higher quality subsets have relied on hand-crafted heuristics encoded as rule-based filters. In this work, we explore scalable estimates of data quality that can be used to systematically measure the quality of pretraining data, namely perplexity, the Error L2-Norm, and memorization. These metrics are used to rank and prune pretraining corpora, and we subsequently compare LLMs trained on these pruned datasets. We find that perplexity outperforms other scoring methods and improves over our no-pruning baseline while training on as little as 30\% of the original training dataset. Our work sets a foundation for strategies in automatically curating high quality corpora and suggests that large amounts of pretraining data can be removed while retaining performance. |
Max Marion · Ahmet Üstün · Luiza A Pozzobon · Alex Wang · Marzieh Fadaee · Sara Hooker 🔗 |
-
|
A Simple and Efficient Baseline for Data Attribution on Images
(
Poster
)
>
link
Data attribution methods play a crucial role in understanding machine learning models, providing insight into which training data points are most responsible for model outputs during deployment. However, current state-of-the-art approaches require a large ensemble of as many as 300,000 models to accurately attribute model predictions. These approaches therefore come at a high computational cost, are memory intensive, and are hard to scale to large models or datasets. In this work, we focus on a minimalist baseline that relies on the image features from a pretrained self-supervised backbone to retrieve images from the dataset. Our method is model-agnostic and scales easily to large datasets. We show results on CIFAR-10 and ImageNet, achieving strong performance that rivals or outperforms state-of-the-art approaches at a fraction of the compute or memory cost. Contrary to prior work, our results reinforce the intuition that a model's prediction on one image is most impacted by visually similar training samples. Our approach serves as a simple and efficient baseline for data attribution on images. |
Vasu Singla · Pedro Sandoval-Segura · Micah Goldblum · Jonas Geiping · Tom Goldstein 🔗 |
-
|
Shapley Interactions for Complex Feature Attribution
(
Poster
)
>
link
Feature interaction is an established approach to understanding complex patterns of attribution in many models. In this paper, we use Shapley Taylor interaction indices (STII) to analyze how linguistic structure influences language model output in masked and auto-regressive language models (MLMs and ALMs). We find that ALMs, and to a lesser degree MLMs, tend to combine pairs of tokens with more nonlinear interactions if they co-occur in the same idiomatic multiword expression. We also find that while ALMs tend to become more linear in their interactions at greater positional distances, in MLMs this linearity is scaled by syntactic distance, implying that the learned structure in MLMs relies more on syntax than the recency-based structure favored natively by ALMs. |
Divyansh Singhvi · Andrej Erkelens · Raghav Jain · Diganta Misra · Naomi Saphra 🔗 |
-
|
Sparse Autoencoders Find Highly Interpretable Features in Language Models
(
Poster
)
>
link
One of the roadblocks to a better understanding of neural networks' internals is \textit{polysemanticity}, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is \textit{superposition}, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. This work indicates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our method may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability. |
Hoagy Cunningham · Aidan Ewart · Logan Smith · Robert Huben · Lee Sharkey 🔗 |
-
|
Successor Heads: Recurring, Interpretable Attention Heads In The Wild
(
Oral
)
>
link
In this work we present successor heads: attention heads that increment tokens with a natural ordering, such as numbers, months, and days. For example, successor heads increment ‘Monday’ into ‘Tuesday’. We explain the successor head behavior with an approach rooted in mechanistic interpretability, the field that aims to explain how models complete tasks in human-understandable terms. Existing research in this area has found interpretable language model components in small toy models. However, results in toy models have not yet led to insights that explain the internals of frontier models and little is currently understood about the internal operations of large language models. In this paper, we analyze the behavior of successor heads in large language models (LLMs) and find that they implement abstract representations that are common to different architectures. They form in LLMs with as few as 31 million parameters, and at least as many as 12 billion parameters, such as GPT-2, Pythia, and Llama-2. We find a set of ‘mod 10’ features that underlie how successor heads increment in LLMs across different architectures and sizes. We perform vector arithmetic with these features to edit head behavior and provide insights into numeric representations within LLMs. Additionally, we study the behavior of successor heads on natural language data, identifying interpretable polysemanticity in a Pythia successor head. |
Rhys Gould · Euan Ong · George Ogden · Arthur Conmy 🔗 |
-
|
Exploring Dataset-Scale Indicators of Data Quality
(
Poster
)
>
link
Modern computer vision foundation models are trained on massive amounts of data, incurring large economic and environmental costs. Recent research has suggested that improving data quality can significantly reduce the need for data quantity. But what constitutes data quality in computer vision? We posit that the quality of a given dataset can be decomposed into distinct sample-level and dataset-level constituents, and that the former have been more extensively studied than the latter. We ablate the effects of two important dataset-level constituents: label set design, and class balance. By monitoring these constituents using key indicators we provide, researchers and practitioners can better anticipate model performance, measured in terms of its accuracy and robustness to distribution shifts. |
Benjamin Feuer · Chinmay Hegde 🔗 |
-
|
Self-Select: Optimizing Instruction Selection for Large Language Models
(
Poster
)
>
link
The same question can often be presented in different ways, depending on the audience and the intent with which it is being posed. To determine whether large language models (LLMs) demonstrate preferences for one phrasing over another regardless of semantic content, we introduce \textit{Self-Select}, a method for selection of a preferred instruction template, and generation of high-quality synthetic data samples. This algorithm makes use of a \textit{meta-prompt} to decide on an instruction template, given a task and candidate templates then generates $n$ new samples using the chosen template. We evaluate \textit{Self-Select} on numerical reasoning and sentiment classification tasks, using a variety of instruction-tuned and base models, providing insights into their abilities and biases. We find that permuting the instruction template ordering in the prompt leads to vastly different choice distributions, suggesting that selections of a specific template can be attributed to inductive biases rather than semantic understanding, even after instruction-tuning.
|
Alexander Kyimpopkin · Keshav Ramji 🔗 |
-
|
Speculative Behavior: An Approach to Large Language Model Evaluation and Optimization
(
Poster
)
>
link
SlidesLive Video Trained Large Language Models (LLMs) have gained significant interest due to their ability to interpret natural language instructions and address a wide range of tasks with high proficiency. However, in practice, these models pose multiple challenges. On one hand, it is exceedingly difficult to control and ensure that the model's behavior remains consistent, harmless, and safe. On the other hand, the most advanced models are delivered via APIs as black-box services, making it challenging to guarantee their proper behavior. Addressing these challenges has become an urgent concern, especially in environments where a model's response can impact safety and trustworthiness. Many recent studies focus on the evaluation of models using benchmarks based on community-curated datasets. However, this form of evaluation is prone to data leakage and premature dataset obsolescence. Moreover, it doesn't necessarily align with all the specific goals that may be desired. One alternative for aligning specific objectives with the model behavior is fine-tuning, but this process is time-consuming and might be prohibitively expensive for many organizations. In this study, we propose the idea of measuring the model's behavior towards specific objectives through the concept of Speculative Behavior Equivalence (SBE). We introduce a general, agnostic approach that can be adapted to various models and tailored to the unique metrics of individual cases whilst remaining constrained to specific budgets. Additionally, we formulate the Speculative Behavior-Based Optimization problem (CSBO), which presents an opportunity to leverage AutoML techniques in the field of LLMs for optimizing behavior. |
Hernan C. Vazquez · Jorge Sánchez · Rafael Carrascosa 🔗 |
-
|
Unifying Corroborative and Contributive Attributions in Large Language Models
(
Oral
)
>
link
As businesses, products, and services spring up around large language models, the trustworthiness of these models hinges on the verifiability of their outputs. However, methods for explaining language model outputs largely fall across two distinct fields of study which both use the term "attribution" to refer to entirely separate techniques: citation generation and training data attribution. In many modern applications, such as legal document generation and medical question answering, both types of attributions are important. In this work, we argue for and present a unified framework of large language model attributions. We show how existing methods of different types of attribution fall under the unified framework. We also use the framework to discuss real-world use cases where one or both types of attributions are required. We believe that this unified framework will guide the use case driven development of systems that leverage both types of attribution, as well as the standardization of their evaluation. |
Theodora Worledge · Judy Hanwen Shen · Nicole Meister · Caleb Winston · Carlos Guestrin 🔗 |
-
|
Algorithm Selection with Priority Order for Instances
(
Poster
)
>
link
Reliability in medical image diagnostics is a required trait for any artificial system. Currently, most approaches rely on highly trained and specific models to leverage the feature quality learned from a particular type of medium, such as X-rays, NMR, PET scans and others. While this approach aligns with the standard human expert perspective, it also limits artificial systems to the representations learned from the dataset distribution. To gain a better understanding of how different media affect specific tasks, we explore task-specific feature transfer between domains. In this work, we propose the possibility of merging features from various areas to harness feature transfer in outlier cases. For this purpose, we develop an Algorithm Selection (AS) method that chooses algorithms trained on different sets of medical images and for different classification tasks. The AS system is then applied to a different classification task. The AS represents a set of methods that, given a problem and a range of existing algorithms, selects the best algorithm on a case- by-case basis. The results demonstrate the advantages of incorporating algorithms from different tasks and datasets in a supervised manner. By considering algorithms trained on diverse datasets, we can effectively capture outliers that might otherwise be neglected by more specific algorithms. |
Zhamilya Saparova · Martin Lukac 🔗 |
-
|
Better than Balancing: Debiasing through Data Attribution
(
Poster
)
>
link
Spurious correlations in the training data can cause serious problems for machinelearning deployment. However, common debiasing approaches which interveneon the training procedure (e.g., by adjusting the loss) can be especially sensitiveto regularization and hyperparameter selection. In this paper, we advocate for adata-based perspective on model debiasing by directly targeting the root causes ofthe bias within the training data itself. Specifically, we leverage data attributiontechniques to isolate specific examples that disproportionally drive reliance onthe spurious correlation. We find that removing these training examples canefficiently debias the final classifier. Moreover, our method requires no additionalhyperparameters, and does not require group annotations for the training data. |
Saachi Jain · Kimia Hamidieh · Kristian Georgiev · Marzyeh Ghassemi · Aleksander Madry 🔗 |
-
|
Prototype Generation: Robust Feature Visualisation for Data Independent Interpretability
(
Poster
)
>
link
We introduce Prototype Generation, a stricter and more robust form of feature visualisation for model-agnostic, data-independent interpretability of image classification models. We demonstrate its ability to generate inputs that result in natural activation paths, countering previous claims that feature visualisation algorithms are untrustworthy due to the unnatural internal activations. We substantiate these claims by quantitatively measuring similarity between the internal activations of our generated prototypes and natural images. We also demonstrate how the interpretation of generated prototypes yields important insights, highlighting spurious correlations and biases learned by models which quantitative methods over test-sets cannot identify. |
Arush Tagade · Jessica Rumbelow 🔗 |
-
|
Backtracking Mathematical Reasoning of Language Models to the Pretraining Data
(
Poster
)
>
link
In-context learning and chain-of-thought prompting have demonstrated surprising performance improvements on mathematical reasoning benchmarks.Therefore, understanding the underlying factors enabling these capabilities is crucial.However, the specific aspects of pretraining data that equip models with mathematical reasoning capabilities remain largely unexplored and are less studied systematically.In this study, we identify subsets of model pretraining data that contribute to math reasoning ability of the model, and evaluate it on several mathematical operations (e.g. addition, multiplication) and tasks (e.g. the asdiv dataset).We measure the importance of such subsets by continual training of the model on pretraining data subsets, and then we quantify the change in performance on the mathematical benchmark to assess their importance. If a subset results in an improved performance, we conjecture that such subset contributes to a model's overall mathematical ability.Our results unveil that while training on math-only data contributes to simple arithmetic abilities, it does not solely explain performance on more complex reasoning abilities like chain-of-thought reasoning. We also find that code data contributes to chain-of-thought reasoning while reducing the arithmetic performance. |
Yasaman Razeghi · Hamish Ivison · Sameer Singh · Yanai Elazar 🔗 |
-
|
Intriguing Properties of Data Attribution on Diffusion Models
(
Poster
)
>
link
Data attribution seeks to trace model outputs back to training data. With the recent development of diffusion models, data attribution has become a desired module to properly assign valuations for high-quality or copyrighted training samples, ensuring that data contributors are fairly compensated or credited. Several theoretically motivated methods have been proposed to implement data attribution, in an effort to improve the trade-off between computational scalability and effectiveness. In this work, we conduct extensive experiments and ablation studies on attributing diffusion models, specifically focusing on DDPMs trained on CIFAR-10 and CelebA, as well as a Stable Diffusion model LoRA-finetuned on ArtBench. Intriguingly, we report counter-intuitive observations that theoretically unjustified design choices for attribution empirically outperform previous baselines by a large margin, in terms of both linear datamodeling score and counterfactual evaluation. Our work presents a significantly more efficient approach for attributing diffusion models, while the unexpected findings suggest that at least in non-convex settings, constructions guided by theoretical assumptions may lead to inferior attribution performance. |
Xiaosen Zheng · Tianyu Pang · Chao Du · Jing Jiang · Min Lin 🔗 |
-
|
Forbidden Facts: An Investigation of Competing Objectives in Llama 2
(
Poster
)
>
link
SlidesLive Video LLMs often face competing pressures (for example helpfulness vs. harmlessness). To understand how models resolve such conflicts, we study Llama-2-7b-chat on the \textit{forbidden fact} task. Specifically, we instruct Llama 2 to truthfully complete a factual recall statement while forbidding it from saying the correct answer. This often makes the model give incorrect answers. We decompose Llama 2 into 1057 different components, and rank each one with respect to how useful it is for forbidding the correct answer. We find that in aggregate, 41 components are enough to reliably implement the full suppression behavior. However, we find that these components are fairly heterogeneous and that many operate using faulty heuristics. We find that one of these heuristics can be exploited via manually designed adversarial attacks, which we call California Attacks. Our results highlight some roadblocks standing in the way of being able to successfully interpret advanced ML systems. |
Tony Wang · Miles Wang · Kaivalya Hariharan · Nir Shavit 🔗 |
-
|
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
(
Poster
)
>
link
Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization—identifying the important model components—is a key step. Activation patching, also known as causal tracing or interchange intervention, is a standard technique for this task (Vig et al., 2020), but the literature contains many variants with little consensus on the choice of hyperparameters or methodology. In this work, we systematically examine the impact of methodological details in activation patching, including evaluation metrics and corruption methods. In several settings of localization and circuit discovery in language models, we find that varying these hyperparameters could lead to disparate interpretability results. Backed by empirical observations, we give conceptual arguments for why certain metrics or methods may be preferred. Finally, we provide recommendations for the best practices of activation patching going forwards. |
Fred Zhang · Neel Nanda 🔗 |
-
|
Meta- (out-of-context) learning in neural networks
(
Poster
)
>
link
Brown et al. (2020) famously introduced the phenomenon of in-context learning in large language models (LLMs). We establish the existence of a phenomenon we call meta-out-of-context learning (meta-OCL) via carefully designed synthetic experiments with LLMs. Our results suggest that meta-OCL leads LLMs to more readily “internalize” the semantic content of text that is, or appears to be, broadly useful (such as true statements, or text from authoritative sources) and use it in appropriate circumstances. We further demonstrate meta-OCL in a synthetic computer vision setting, and propose two hypotheses for the emergence of meta-OCL: one relying on the way models store knowledge in their parameters, and another suggesting that the implicit gradient alignment bias of gradient-descent-based optimizers may be responsible. Finally, we reflect on what our results might imply about capabilities of future AI systems, and discuss potential risks. Our code is available at https://github.com/krasheninnikov/internalization. |
Dmitrii Krasheninnikov · Egor Krasheninnikov · Bruno Mlodozeniec · David Krueger 🔗 |
-
|
Transformer-based Causal Language Models from a Meta-Learning Perspective
(
Poster
)
>
link
The Transformer architecture has become prominent for developing large causal language models. However, mechanisms to explain its capabilities are not well understood. Here we establish a meta-learning view of the Transformer architecture when trained for the causal language modeling task, by explicating an inner optimization process that may happen within the Transformer. Further, from within the inner optimization, we discover a special characteristic of the norms of learned token representations within Transformer-based causal language models. Our analysis is supported by experiments conducted on pre-trained large language models and real-world data. |
Xinbo Wu · Lav Varshney 🔗 |
-
|
Outliers with Opposing Signals Have an Outsized Effect on Neural Network Optimization
(
Oral
)
>
link
We identify a new phenomenon in neural network optimization which arises from the interaction of depth and a particular heavy-tailed structure in natural data. Our result offers intuitive explanations for several previously reported observations about network training dynamics and demonstrates how a small number training points can have an unusually large effect on a network's optimization trajectory and predictions. Experimentally, we demonstrate the significant influence of paired groups of outliers in the training data with strong \emph{opposing signals}: consistent, large magnitude features which dominate the network output and occur in both groups with similar frequency. Due to these outliers, early optimization enters a narrow valley which carefully balances the opposing groups; subsequent sharpening causes their loss to rise rapidly, oscillating between high on one group and then the other, until the overall loss spikes. We complement these experiments with a theoretical analysis of a two-layer linear network on a simple model of opposing signals. Our finding enables new qualitative predictions of behavior during and after training which we confirm experimentally. It also provides a new lens through which to study how specific data influence the learned parameters. |
Elan Rosenfeld · Andrej Risteski 🔗 |
-
|
Attention Lens: A Tool for Mechanistically Interpreting the Attention Head Information Retrieval Mechanism
(
Poster
)
>
link
Transformer-based Large Language Models (LLMs) are the state-of-the-art for natural language tasks. Recent work has attempted to decode, by reverse engineering the role of linear layers, the internal mechanisms by which LLMs arrive at their final predictions for text completion tasks. Yet little is known about the specific role of attention heads in producing the final token prediction. We propose Attention Lens, a tool that enables researchers to translate the outputs of attention heads into vocabulary tokens via learned attention-head-specific transformations called lenses. Preliminary findings from our trained lenses indicate that attention heads play highly specialized roles in language models. The code for Attention Lens is available at github.com/anonymized-for-review. |
Mansi Sakarvadia · Arham Khan · Aswathy Ajith · Daniel Grzenda · Nathaniel Hudson · André Bauer · Kyle Chard · Ian Foster 🔗 |
-
|
Estimating the Generalization in Deep Neural Networks via Sparsity
(
Poster
)
>
link
Generalization is the key capability for deep neural networks (DNNs). However, it is challenging to give a reliable measure of the generalization ability of a DNN via only its nature. In this paper, we propose a novel method for estimating the generalization gap based on network sparsity. Two key sparsity quantities are extracted from the training results alone, which could present close relationship with model generalization. Then a simple linear model involving two key quantities are constructed to give accurate estimation of the generalization gap. By training DNNs with a wide range of generalization gap on popular datasets, we show that our key quantities and linear model could be efficient tools for estimating the generalization gap of DNNs. |
Yang Zhao · Hao Zhang · Xiuyuan Hu 🔗 |
-
|
Data Attribution for Segmentation Models
(
Poster
)
>
link
The quality of segmentation models is driven by their training datasets labeled with detailed segmentation masks. How does the composition of such a training dataset contribute to the performance of the resulting segmentation model? In this work, we take a step towards attaining such an understanding by applying the lens of data attribution to it. To this end, We first identify specific behaviors of these models to attribute, and then provide a method for computing such attributions efficiently. We validate the resulting attributions, and leverage them to both identify harmful labeling errors and curate a $50$\% subset of the MS COCO training dataset that leads to a $2.79$\% $\pm$ $0.49$\% increase in mIOU over the full dataset.
|
Albert Tam · Joshua Vendrow · Aleksander Madry 🔗 |
-
|
Summing Up the Facts: Additive Mechanisms behind Factual Recall in LLMs
(
Poster
)
>
link
How do large language models (LLMs) store and retrieve knowledge? We focus on the most basic form of this task -- factual recall, where the model is tasked with explicitly surfacing stored facts in prompts of form \tokens{Fact: The Colosseum is in the country of}. We find that the mechanistic story behind factual recall is more complex than previously thought -- We show there exist four distinct and independent mechanisms that additively combine, constructively interfering on the correct attribute. We term this generic phenomena the \textbf{additive motif}: models compute correct answers through adding together multiple independent contributions; the contributions from each mechanism may be insufficient alone, but together they constructively interfere on the correct attribute when summed. In addition, we extend the method of direct logit attribution to attribute a head's output to individual source tokens. We use this technique to unpack what we call `mixed heads' -- which are themselves a pair of two separate additive updates from different source tokens. |
Bilal Chughtai · Alan Cooney · Neel Nanda 🔗 |