Timezone: »

Temporal Transductive Inference for Few-Shot Video Object Segmentation
Mennatullah Siam · Richard Wildes

Few-shot object segmentation has been focused on segmenting static images in the query set. Recently few-shot video object segmentation (FS-VOS), where the query images to be segmented belong to a video, has been introduced but is still under-explored. We propose a simple but effective temporal transductive inference (TTI) that uses the temporal continuity in videos to improve the segmentation with a few-shot support set. We use both global and local cues. Global cues focus on learning a consistent prototype on the sequence level, whereas local cues focus on a consistent foreground/background region proportion within a local temporal window. Our model outperforms state-of-the-art attention-based counterpart on few-shot Youtube-VIS with 2% in mean intersection over union (mIoU). Finally, we propose a more realistic FS-VOS setup that operates cross-domain. Our method outperforms the transductive inference baseline that uses static images with 1.3% improvement on two different benchmarks. It demonstrates that our method is a promising direction and opens the door towards a label efficient approach of annotating video datasets with rare classes that occur in different robotics settings such as autonomous driving.

Author Information

Mennatullah Siam (University of Alberta)
Richard Wildes (York University)

More from the Same Authors