NeurIPS Poster D-LLM: A Token Adaptive Computing Resource Allocation Strategy for Large Language Models

Poster

D-LLM: A Token Adaptive Computing Resource Allocation Strategy for Large Language Models

Yikun Jiang · Huanyu Wang · Lei Xie · Hanbin Zhao · zhang chao · Hui Qian · John C.S. Lui

East Exhibit Hall A-C #3105

[ Abstract ] [ Project Page ]

Thu 12 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract:

Large language models have shown an impressive societal impact owing to their excellent understanding and logical reasoning skills. However, such strong ability relies on a huge amount of computing resources, which makes it difficult to deploy LLMs on computing resource-constrained platforms. Currently, LLMs process each token equivalently, but we argue that not every word is equally important. Some words should not be allocated excessive computing resources, particularly for dispensable terms in simple questions. In this paper, we propose a novel dynamic inference paradigm for LLMs, namely D-LLMs, which adaptively allocate computing resources in token processing. We design a dynamic decision module for each transformer layer that decides whether a network unit should be executed or skipped. Moreover, we tackle the issue of adapting D-LLMs to real-world applications, specifically concerning the missing KV-cache when layers are skipped. To overcome this, we propose a simple yet effective eviction policy to exclude the skipped layers from subsequent attention calculations. The eviction policy not only enables D-LLMs to be compatible with prevalent applications but also reduces considerable storage resources. Experimentally, D-LLMs show superior performance, in terms of computational cost and KV storage utilization. It can reduce up to 45\% computational cost and KV storage on Q\&A, summarization, and math solving tasks, 50\% on commonsense reasoning tasks.

Live content is unavailable. Log in and register to view live content