Workshop

Integrating Language and Vision

Raymond Mooney ⋅ Trevor Darrell ⋅ Kate Saenko

Project Page

Abstract

A growing number of researchers in computer vision have started to explore how language accompanying images and video can be used to aid interpretation and retrieval, as well as train object and activity recognizers. Simultaneously, an increasing number of computational linguists have begun to investigate how visual information can be used to aid language learning and interpretation, and to ground the meaning of words and sentences in perception. However, there has been very little direct interaction between researchers in these two distinct disciplines. Consequently, researchers in each area have a quite limited understanding of the methods in the other area, and do not optimally exploit the latest ideas and techniques from both disciplines when developing systems that integrate language and vision. Therefore, we believe the time is particularly opportune for a workshop that brings together researchers in both computer vision and natural-language processing (NLP) to discuss issues and ideas in developing systems that combine language and vision.

Traditional machine learning for both computer vision and NLP requires manually annotating images, video, text, or speech with detailed labels, parse-trees, segmentations, etc. Methods that integrate language and vision hold the promise of greatly reducing such manual supervision by using naturally co-occurring text and images/video to mutually supervise each other.

There are also a wide range of important real-world applications that require integrating vision and language, including but not limited to: image and video retrieval, human-robot interaction, medical image processing, human-computer interaction in virtual worlds, and computer graphics generation.

More than any other major conference, NIPS attracts a fair number of researchers in both computer vision and computational linguistics. Therefore, we believe it is the best venue for holding a workshop that brings these two communities together for the very first time to interact, collaborate, and discuss issues and future directions in integrating language and vision.

Chat is not available.