Timezone: »
Temporal modeling is crucial for various video learning tasks. Most recent approaches employ either factorized (2D+1D) or joint (3D) spatial-temporal operations to extract temporal contexts from the input frames. While the former is more efficient in computation, the latter often obtains better performance. In this paper, we attribute this to a dilemma between the sufficiency and the efficiency of interactions among various positions in different frames. These interactions affect the extraction of task-relevant information shared among frames. To resolve this issue, we prove that frame-by-frame alignments have the potential to increase the mutual information between frame representations, thereby including more task-relevant information to boost effectiveness. Then we propose Alignment-guided Temporal Attention (ATA) to extend 1-dimensional temporal attention with parameter-free patch-level alignments between neighboring frames. It can act as a general plug-in for image backbones to conduct the action recognition task without any model-specific design. Extensive experiments on multiple benchmarks demonstrate the superiority and generality of our module.
Author Information
Yizhou Zhao (Carnegie Mellon University)
Zhenyang Li
Xun Guo (Microsoft Research Asia)
Yan Lu (Microsoft Research Asia)
More from the Same Authors
-
2022 Spotlight: Lightning Talks 5B-4 »
Yuezhi Yang · Zeyu Yang · Yong Lin · Yi.shi Xu · Linan Yue · Tao Yang · Weixin Chen · Qi Liu · Jiaqi Chen · Dongsheng Wang · Baoyuan Wu · Yuwang Wang · Hao Pan · Shengyu Zhu · Zhenwei Miao · Yan Lu · Lu Tan · Bo Chen · Yichao Du · Haoqian Wang · Wei Li · Yanqing An · Ruiying Lu · Peng Cui · Nanning Zheng · Li Wang · Zhibin Duan · Xiatian Zhu · Mingyuan Zhou · Enhong Chen · Li Zhang -
2022 Spotlight: Visual Concepts Tokenization »
Tao Yang · Yuwang Wang · Yan Lu · Nanning Zheng -
2022 Poster: Visual Concepts Tokenization »
Tao Yang · Yuwang Wang · Yan Lu · Nanning Zheng -
2022 Poster: Mask-based Latent Reconstruction for Reinforcement Learning »
Tao Yu · Zhizheng Zhang · Cuiling Lan · Yan Lu · Zhibo Chen -
2021 Poster: Deep Contextual Video Compression »
Jiahao Li · Bin Li · Yan Lu