Timezone: »
Length generalization, the ability to generalize from small training context sizes to larger ones, is a critical challenge in the development of Transformer-based language models. Positional encoding (PE) has been identified as a major factor influencing length generalization, but the exact impact of different PE schemes on extrapolation in downstream tasks remains unclear. In this paper, we conduct a systematic empirical study comparing the length generalization performance of decoder-only Transformers with five different position encoding approaches including Absolute Position Embedding (APE), T5's Relative PE, ALiBi, and Rotary, in addition to Transformers without positional encoding (NoPE). Our evaluation encompasses a battery of reasoning and mathematical tasks. Our findings reveal that the most commonly used positional encoding methods, such as ALiBi, Rotary, and APE, are not well suited for length generalization in downstream tasks. More importantly, NoPE outperforms other explicit positional encoding methods while requiring no additional computation. We theoretically demonstrate that NoPE can represent both absolute and relative PEs, but when trained with SGD, it mostly resembles T5's relative PE attention patterns. Finally, we find that scratchpad is not always helpful to solve length generalization and its format highly impacts the model's performance. Overall, our work suggests that explicit position embeddings are not essential for decoder-only Transformers to generalize well to longer sequences.
Author Information
Amirhossein Kazemnejad (Mila / McGill University)
Inkit Padhi (IBM Research)
Karthikeyan Natesan Ramamurthy (IBM Research)
Payel Das (IBM Research)
Siva Reddy (McGill University / Mila)
More from the Same Authors
-
2021 : Accurate Multi-Endpoint Molecular Toxicity Predictions in Humans with Contrastive Explanations »
Bhanushee Sharma · Vijil Chenthamarakshan · Amit Dhurandhar · James Hendler · Jonathan S. Dordick · Payel Das -
2021 : Visually Grounded Reasoning across Languages and Cultures »
Fangyu Liu · Emanuele Bugliarello · Edoardo Ponti · Siva Reddy · Desmond Elliott -
2021 : Sample-Efficient Generation of Novel Photo-acid Generator Molecules using a Deep Generative Model »
Samuel Hoffman · Vijil Chenthamarakshan · Dmitry Zubarev · Daniel Sanders · Payel Das -
2021 : Grapher: Multi-Stage Knowledge Graph Construction using Pretrained Language Models »
Igor Melnyk · Pierre Dognin · Payel Das -
2022 : Reducing Down(stream)time: Pretraining Molecular GNNs using Heterogeneous AI Accelerators »
Jenna A Bilbrey · Kristina Herman · Henry Sprueill · Sotiris Xantheas · Payel Das · Manuel Lopez Roldan · Mike Kraus · Hatem Helal · Sutanay Choudhury -
2022 : Consistent Training via Energy-Based GFlowNets for Modeling Discrete Joint Distributions »
Chanakya Ekbote · Moksh Jain · Payel Das · Yoshua Bengio -
2023 : Influence Based Approaches to Algorithmic Fairness: A Closer Look »
Soumya Ghosh · Prasanna Sattigeri · Inkit Padhi · Manish Nagireddy · Jie Chen -
2023 : AlphaFold Distillation for Protein Design »
Igor Melnyk · Aurelie Lozano · Payel Das · Vijil Chenthamarakshan -
2023 : Characterizing pre-trained and task-adapted molecular representations »
Celia Cintas · Payel Das · Jarret Ross · Brian Belgodere · Girmaw Abebe Tadesse · Vijil Chenthamarakshan · Jannis Born · Skyler D. Speakman -
2023 : Characterizing pre-trained and task-adapted molecular representations »
Celia Cintas · Payel Das · Jarret Ross · Brian Belgodere · Girmaw Abebe Tadesse · Vijil Chenthamarakshan · Jannis Born · Skyler D. Speakman -
2023 Poster: Are Diffusion Models Vision-And-Language Reasoners? »
Benno Krojer · Elinor Poole-Dayan · Vikram Voleti · Chris Pal · Siva Reddy -
2023 Poster: Locally Invariant Explanations: Towards Stable and Unidirectional Explanations through Local Invariant Learning »
Amit Dhurandhar · Karthikeyan Natesan Ramamurthy · Kartik Ahuja · Vijay Arya -
2023 Poster: Cookie Consent Has Disparate Impact on Estimation Accuracy »
Erik Miehling · Rahul Nair · Elizabeth Daly · Karthikeyan Natesan Ramamurthy · Robert Redmond -
2023 Poster: Efficient Equivariant Transfer Learning from Pretrained Models »
Sourya Basu · Pulkit Katdare · Prasanna Sattigeri · Vijil Chenthamarakshan · Katherine Driggs-Campbell · Payel Das · Lav Varshney -
2023 Poster: Pre-Training Protein Encoder via Siamese Sequence-Structure Diffusion Trajectory Prediction »
Zuobai Zhang · Minghao Xu · Aurelie Lozano · Vijil Chenthamarakshan · Payel Das · Jian Tang -
2022 : Panel »
Pin-Yu Chen · Alex Gittens · Bo Li · Celia Cintas · Hilde Kuehne · Payel Das -
2022 : Do we still need inductive biases after Transformer language models? »
Siva Reddy -
2022 : SynBench: Task-Agnostic Benchmarking of Pretrained Representations using Synthetic Data »
Ching-Yun Ko · Pin-Yu Chen · Jeet Mohapatra · Payel Das · Luca Daniel -
2022 Poster: Is this the Right Neighborhood? Accurate and Query Efficient Model Agnostic Explanations »
Amit Dhurandhar · Karthikeyan Natesan Ramamurthy · Karthikeyan Shanmugam -
2022 Poster: Fair Infinitesimal Jackknife: Mitigating the Influence of Biased Training Data Points Without Refitting »
Prasanna Sattigeri · Soumya Ghosh · Inkit Padhi · Pierre Dognin · Kush Varshney -
2022 Expo Demonstration: Real-time Navigation of Chemical Space with Cloud-Based Inference from MoLFormer »
Payel Das · Brian Belgodere -
2021 : Grapher: Multi-Stage Knowledge Graph Construction using Pretrained Language Models »
Igor Melnyk · Pierre Dognin · Payel Das -
2021 : Sample-Efficient Generation of Novel Photo-acid Generator Molecules using a Deep Generative Model »
Samuel Hoffman · Vijil Chenthamarakshan · Dmitry Zubarev · Daniel Sanders · Payel Das -
2021 Poster: Predicting Deep Neural Network Generalization with Perturbation Response Curves »
Yair Schiff · Brian Quanz · Payel Das · Pin-Yu Chen -
2021 Poster: End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering »
Devendra Singh · Siva Reddy · Will Hamilton · Chris Dyer · Dani Yogatama -
2021 Poster: Mean-based Best Arm Identification in Stochastic Bandits under Reward Contamination »
Arpan Mukherjee · Ali Tajer · Pin-Yu Chen · Payel Das -
2020 : Closing Remarks »
Frederic Chazal · Smita Krishnaswamy · Roland Kwitt · Karthikeyan Natesan Ramamurthy · Bastian Rieck · Yuhei Umeda · Guy Wolf -
2020 : Spotlight: Characterizing the Latent Space of Molecular Generative Models with Persistent Homology Metrics »
Yair Schiff · Payel Das · Vijil Chenthamarakshan · Karthikeyan Natesan Ramamurthy -
2020 Workshop: Topological Data Analysis and Beyond »
Bastian Rieck · Frederic Chazal · Smita Krishnaswamy · Roland Kwitt · Karthikeyan Natesan Ramamurthy · Yuhei Umeda · Guy Wolf -
2020 : Opening Remarks »
Frederic Chazal · Smita Krishnaswamy · Roland Kwitt · Karthikeyan Natesan Ramamurthy · Bastian Rieck · Yuhei Umeda · Guy Wolf -
2020 Poster: Finding the Homology of Decision Boundaries with Active Learning »
Weizhi Li · Gautam Dasarathy · Karthikeyan Natesan Ramamurthy · Visar Berisha -
2020 Poster: Model Agnostic Multilevel Explanations »
Karthikeyan Natesan Ramamurthy · Bhanukiran Vinzamuri · Yunfeng Zhang · Amit Dhurandhar -
2020 Poster: A Decentralized Parallel Algorithm for Training Generative Adversarial Nets »
Mingrui Liu · Wei Zhang · Youssef Mroueh · Xiaodong Cui · Jarret Ross · Tianbao Yang · Payel Das -
2020 : Spotlight on women at IBM Research »
Lisa Amini · Francesca Rossi · Celia Cintas · Payel Das -
2020 Poster: CogMol: Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models »
Vijil Chenthamarakshan · Payel Das · Samuel Hoffman · Hendrik Strobelt · Inkit Padhi · Kar Wai Lim · Benjamin Hoover · Matteo Manica · Jannis Born · Teodoro Laino · Aleksandra Mojsilovic -
2020 : CogMol: Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models »
Payel Das -
2020 Poster: Optimizing Mode Connectivity via Neuron Alignment »
Norman J Tatro · Pin-Yu Chen · Payel Das · Igor Melnyk · Prasanna Sattigeri · Rongjie Lai -
2020 Expo Talk Panel: AI against COVID-19 at IBM Research »
Divya Pathak · Payel Das · Michal Rosen-Zvi · Salim Roukos -
2019 : Coffee Break and Poster Session »
Rameswar Panda · Prasanna Sattigeri · Kush Varshney · Karthikeyan Natesan Ramamurthy · Harvineet Singh · Vishwali Mhasawade · Shalmali Joshi · Laleh Seyyed-Kalantari · Matthew McDermott · Gal Yona · James Atwood · Hansa Srinivasan · Yonatan Halpern · D. Sculley · Behrouz Babaki · Margarida Carvalho · Josie Williams · Narges Razavian · Haoran Zhang · Amy Lu · Irene Y Chen · Xiaojie Mao · Angela Zhou · Nathan Kallus -
2018 : Contributed Work »
Thaer Moustafa Dieb · Aditya Balu · Amir H. Khasahmadi · Viraj Shah · Boris Knyazev · Payel Das · Garrett Goh · Georgy Derevyanko · Gianni De Fabritiis · Reiko Hagawa · John Ingraham · David Belanger · Jialin Song · Kim Nicoli · Miha Skalic · Michelle Wu · Niklas Gebauer · Peter Bjørn Jørgensen · Ryan-Rhys Griffiths · Shengchao Liu · Sheshera Mysore · Hai Leong Chieu · Philippe Schwaller · Bart Olsthoorn · Bianca-Cristina Cristescu · Wei-Cheng Tseng · Seongok Ryu · Iddo Drori · Kevin Yang · Soumya Sanyal · Zois Boukouvalas · Rishi Bedi · Arindam Paul · Sambuddha Ghosal · Daniil Bash · Clyde Fare · Zekun Ren · Ali Oskooei · Minn Xuan Wong · Paul Sinz · Théophile Gaudin · Wengong Jin · Paul Leu -
2018 Demonstration: PatentAI: IP Infringement Detection with Enhanced Paraphrase Identification »
Youssef Drissi · Karthikeyan Natesan Ramamurthy · Prasanna Sattigeri -
2018 Poster: Explanations based on the Missing: Towards Contrastive Explanations with Pertinent Negatives »
Amit Dhurandhar · Pin-Yu Chen · Ronny Luss · Chun-Chen Tu · Paishun Ting · Karthikeyan Shanmugam · Payel Das -
2017 Poster: Optimized Pre-Processing for Discrimination Prevention »
Flavio Calmon · Dennis Wei · Bhanukiran Vinzamuri · Karthikeyan Natesan Ramamurthy · Kush Varshney