Skip to yearly menu bar Skip to main content


Revealing the unseen: Benchmarking video action recognition under occlusion

Shresth Grover · Vibhav Vineet · Yogesh Rawat

Great Hall & Hall B1+B2 (level 1) #520
[ ]
[ Paper [ Poster [ OpenReview
Thu 14 Dec 3 p.m. PST — 5 p.m. PST


In this work, we study the effect of occlusion on video action recognition. Tofacilitate this study, we propose three benchmark datasets and experiment withseven different video action recognition models. These datasets include two synthetic benchmarks, UCF-101-O and K-400-O, which enabled understanding the effects of fundamental properties of occlusion via controlled experiments. We also propose a real-world occlusion dataset, UCF-101-Y-OCC, which helps in further validating the findings of this study. We find several interesting insights such as 1) transformers are more robust than CNN counterparts, 2) pretraining make modelsrobust against occlusions, and 3) augmentation helps, but does not generalize well to real-world occlusions. In addition, we propose a simple transformer based compositional model, termed as CTx-Net, which generalizes well under this distribution shift. We observe that CTx-Net outperforms models which are trained using occlusions as augmentation, performing significantly better under natural occlusions. We believe this benchmark will open up interesting future research in robust video action recognition

Chat is not available.