Timezone: »
Despite their wide adoption, the underlying training and memorization dynamics of very large language models is not well understood. We empirically study exact memorization in causal and masked language modeling, across model sizes and throughout the training process. We measure the effects of dataset size, learning rate, and model size on memorization, finding that larger language models memorize training data faster across all settings. Surprisingly, we show that larger models can memorize a larger portion of the data before over-fitting and tend to forget less throughout the training process. We also analyze the memorization dynamics of different parts of speech and find that models memorize nouns and numbers first; we hypothesize and provide empirical evidence that nouns and numbers act as a unique identifier for memorizing individual training examples. Together, these findings present another piece of the broader puzzle of trying to understand what actually improves as models get bigger.
Author Information
Kushal Tirumala (California Institute of Technology)
Aram Markosyan (Facebook)
Luke Zettlemoyer (University of Washington and Facebook)
Armen Aghajanyan (Facebook)
More from the Same Authors
-
2021 : A Granular Method for Finding Anomalous Light Curves and their Analogs »
Kushal Tirumala -
2022 Poster: GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale »
Tim Dettmers · Mike Lewis · Younes Belkada · Luke Zettlemoyer -
2022 Poster: Improving Policy Learning via Language Dynamics Distillation »
Victor Zhong · Jesse Mu · Luke Zettlemoyer · Edward Grefenstette · Tim Rocktäschel -
2021 : Panel Discussion »
Pascal Poupart · Ali Ghodsi · Luke Zettlemoyer · Sameer Singh · Kevin Duh · Yejin Choi · Lu Hou -
2021 : Toward Efficient Training of Large Language Models with Balanced Conditional Compute »
Luke Zettlemoyer -
2021 Poster: Luna: Linear Unified Nested Attention »
Xuezhe Ma · Xiang Kong · Sinong Wang · Chunting Zhou · Jonathan May · Hao Ma · Luke Zettlemoyer -
2021 Poster: SILG: The Multi-domain Symbolic Interactive Language Grounding Benchmark »
Victor Zhong · Austin W. Hanjie · Sida Wang · Karthik Narasimhan · Luke Zettlemoyer -
2020 : Invited talk - De-noising Sequence-to-Sequence Pre-training »
Luke Zettlemoyer -
2020 Poster: Pre-training via Paraphrasing »
Mike Lewis · Marjan Ghazvininejad · Gargi Ghosh · Armen Aghajanyan · Sida Wang · Luke Zettlemoyer -
2017 : End-to-end Learning for Broad Coverage Semantics: SRL, Coreference, and Beyond »
Luke Zettlemoyer -
2008 Poster: Multi-Agent Filtering with Infinitely Nested Beliefs »
Luke Zettlemoyer · Brian Milch · Leslie Kaelbling -
2008 Spotlight: Multi-Agent Filtering with Infinitely Nested Beliefs »
Luke Zettlemoyer · Brian Milch · Leslie Kaelbling