Skip to yearly menu bar Skip to main content


Poster

Scala: Scalable Representation Learning for Vision Transformer

Yitian Zhang · Huseyin Coskun · Xu Ma · Huan Wang · Ke Ma · Stephen Chen · Derek Hu · Yun Fu

East Exhibit Hall A-C #1511
[ ]
Fri 13 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract:

Vision Transformers (ViT) is known for its scalability. In this work, we target to scale down a ViT to fit in an environment with dynamic-changing resource constraints. We observe that smaller ViTs are intrinsically the sub-networks of a larger ViT with different widths. Thus, we propose a general framework, named Scala, to enable a single network to represent multiple smaller ViTs with flexible inference capability, which aligns with the inherent design of ViT to vary from widths. Concretely, Scala activates several subnets during training, introduces Isolated Activation to disentangle the smallest sub-network from other subnets, and leverages Scale Coordination to ensure each sub-network receives simplified, steady, and accurate learning objectives. Comprehensive empirical validations on different tasks demonstrate that with only one-shot training, Scala learns scalable representation without modifying the original ViT structure and matches the performance of Separate Training. Compared with the prior art, Scala achieves an average improvement of 1.6% on ImageNet-1K with fewer parameters.

Live content is unavailable. Log in and register to view live content