Timezone: »

Spatial-Temporal Gated Transformersfor Efficient Video Processing
Yawei Li · Babak Ehteshami Bejnordi · Bert Moons · Tijmen Blankevoort · Amirhossein Habibian · Radu Timofte · Luc V Gool

We focus on the problem of efficient video stream processing with fully transformer-based architectures. Recent advances brought by transformers for image-based tasks inspires the research interests of applying transformers for videos. Yet, when applying image-based transformer solutions to videos, the computation becomes inefficient due to the redundant information in adjacent video frames. An analysis of the computation cost of the video object detection framework DETR identifies the linear layers as the major computation bottleneck. Thus, we propose dynamic gating layers to conduct conditional computation. With the generated binary or ternary gates, it is possible to avoid the computation for the stable background tokens in the video frames. The effectiveness of the dynamic gating mechanism for transformers is validated by experimental results. For video object detection, the FLOPs could be reduced by 48.3% without a significant drop of accuracy.

Author Information

Yawei Li (Swiss Federal Institute of Technology)
Babak Ehteshami Bejnordi (Qualcomm AI Research)
Bert Moons (Synopsis)
Tijmen Blankevoort (Qualcomm)
Amirhossein Habibian (Qualcomm AI Research)
Radu Timofte (ETH Zurich)
Luc V Gool (Computer Vision Lab, ETH Zurich)

More from the Same Authors