Poster
in
Workshop: The First Workshop on Efficient Reasoning

LOGCA: Layer-Optimized GPU-CPU Allocation for Efficient Resource Management in Large-Scale Models

Zichen Song

Project Page [ OpenReview]

Abstract

Efficient deployment of large-scale models in resource-limited environments requires intelligent resource management. While prior methods like PowerInfer offload less important neurons to CPUs, they overlook the varying importance of model layers. We propose LOGCA (Layer-Optimized GPU-CPU Allocation), which dynamically assigns layers to GPU or CPU based on importance, measured via a weighted angular distance incorporating neuron activation strength. Critical layers are executed on GPU for efficiency, while less important ones are offloaded to CPU to save memory. LOGCA further introduces an adaptive thresholding mechanism that adjusts in real-time based on system load, improving scalability. Our method boosts computational speed and memory efficiency, making it well-suited for large-scale models in constrained settings.

Chat is not available.