Timezone: »
As an important step towards visual reasoning, visual grounding (e.g., phrase localization, referring expression comprehension / segmentation) has been widely explored. Previous approaches to referring expression comprehension (REC) or segmentation (RES) either suffer from limited performance, due to a two-stage setup, or require the designing of complex task-specific one-stage architectures. In this paper, we propose a simple one-stage multi-task framework for visual grounding tasks. Specifically, we leverage a transformer architecture, where two modalities are fused in a visual-lingual encoder. In the decoder, the model learns to generate contextualized lingual queries which are then decoded and used to directly regress the bounding box and produce a segmentation mask for the corresponding referred regions. With this simple but highly contextualized model, we outperform state-of-the-art methods by a large margin on both REC and RES tasks. We also show that a simple pre-training schedule (on an external dataset) further improves the performance. Extensive experiments and ablations illustrate that our model benefits greatly from contextualized information and multi-task training.
Author Information
Muchen Li (University of British Columbia)
Leonid Sigal (University of British Columbia)
More from the Same Authors
-
2023 Poster: Mitigating the Effect of Incidental Correlations on Part-based Learning »
Gaurav Bhatt · Deepayan Das · Leonid Sigal · Vineeth N Balasubramanian -
2022 Poster: Iterative Scene Graph Generation »
Siddhesh Khandelwal · Leonid Sigal -
2021 Poster: TriBERT: Human-centric Audio-visual Representation Learning »
Tanzila Rahman · Mengyu Yang · Leonid Sigal -
2020 Session: Orals & Spotlights Track 22: Vision Applications »
Leonid Sigal · Alex Schwing -
2019 : Traffic4cast -- Traffic Map Movie Forecasting »
Sepp Hochreiter · Leonid Sigal · Moritz Neun · David Jonietz · Sungbin Choi · Henry Martin · Wei Yu · Zhichen Liu · Tu Nguyen · Pedro Herruzo Sánchez · Xiaoxia Shi · Aleksandra Gruca · Alastair Sutherland · David Kreil · Michael Kopp -
2018 Poster: Middle-Out Decoding »
Shikib Mehri · Leonid Sigal