Skip to yearly menu bar Skip to main content


Poster

Selective Attention: Enhancing Transformer through Principled Context Control

Xuechen Zhang · Xiangyu Chang · Mingchen Li · Amit Roy-Chowdhury · Jiasi Chen · Samet Oymak

East Exhibit Hall A-C #1803
[ ]
Thu 12 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract: The attention mechanism is the central component of the transformer architecture as it enables the model to create learnable weighted combinations of the tokens that are relevant to the query. While self-attention has enjoyed major success, it notably treats all queries $q$ in the same way by applying the mapping $V^\top\text{softmax}(Kq)$, where $V,K$ are the query and key respectively. In this work, we argue that this uniform treatment hinders the ability to control contextual sparsity and relevance. To overcome this, we introduce a Selective Self-Attention (SSA) layer that augments the softmax nonlinearity with a principled temperature scaling strategy. SSA utilizes a query-temperature to adapt the contextual sparsity of the softmax map to the specific query and its position in the context window. Through theory and experiments, we demonstrate that this alleviates attention dilution, aids the optimization process, and enhances the model's ability to assign distinct sparsity levels across queries. To enhance relevance control, we also introduce a value-temperature and show that it boosts the model's ability to suppress irrelevant/noisy tokens. Extensive empirical evaluations corroborate that SSA noticeably improves the language modeling performance: SSA-equipped Pythia and Llama models achieve a respectable and consistent perplexity improvement on language modeling benchmarks while introducing only about 5\% more parameters.

Live content is unavailable. Log in and register to view live content