Skip to yearly menu bar Skip to main content

( events)   Timezone:  
Fri Dec 08 08:00 AM -- 06:30 PM (PST) @ 101 B
Visually grounded interaction and language
Florian Strub · Harm de Vries · Abhishek Das · Satwik Kottur · Stefan Lee · Mateusz Malinowski · Olivier Pietquin · Devi Parikh · Dhruv Batra · Aaron Courville · Jeremie Mary

Workshop Home Page

Everyday interactions require a common understanding of language, i.e. for people to communicate effectively, words (for example ‘cat’) should invoke similar beliefs over physical concepts (what cats look like, the sounds they make, how they behave, what their skin feels like etc.). However, how this ‘common understanding’ emerges is still unclear.

One appealing hypothesis is that language is tied to how we interact with the environment. As a result, meaning emerges by ‘grounding’ language in modalities in our environment (images, sounds, actions, etc.).

Recent concurrent works in machine learning have focused on bridging visual and natural language understanding through visually-grounded language learning tasks, e.g. through natural images (Visual Question Answering, Visual Dialog), or through interactions with virtual physical environments. In cognitive science, progress in fMRI enables creating a semantic atlas of the cerebral cortex, or to decode semantic information from visual input. And in psychology, recent studies show that a baby’s most likely first words are based on their visual experience, laying the foundation for a new theory of infant language acquisition and learning.

As the grounding problem requires an interdisciplinary attitude, this workshop aims to gather researchers with broad expertise in various fields — machine learning, computer vision, natural language, neuroscience, and psychology — to discuss their cutting edge work as well as perspectives on future directions in this exciting space of grounding and interactions.

We will accept papers related to:
— language acquisition or learning through interactions
— visual captioning, dialog, and question-answering
— reasoning in language and vision
— visual synthesis from language
— transfer learning in language and vision tasks
— navigation in virtual worlds with natural-language instructions
— machine translation with visual cues
— novel tasks that combine language, vision and actions
— understanding and modeling the relationship between language and vision in humans
— semantic systems and modeling of natural language and visual stimuli representations in the human brain

Important dates
Submission deadline: 3rd November 2017
Extended Submission deadline: 17th November 2017

Acceptance notification (First deadline): 10th November 2017
Acceptance notification (Second deadline): 24th November 2017

Workshop: 8th December 2017

Paper details
— Contributed papers may include novel research, preliminary results, extended abstract, positional papers or surveys
— Papers are limited to 4 pages, excluding references, in the latest camera-ready NIPS format:
— Papers published at the main conference can be submitted without reformatting
— Please submit via email:

Accepted papers
— All accepted papers will be presented during 2 poster sessions
— Up to 5 accepted papers will be invited to deliver short talks
— Accepted papers will be made publicly available as non-archival reports, allowing future submissions to archival conferences and journals

Invited Speakers
Raymond J. Mooney - University of Texas
Sanja Fidler - University of Toronto
Olivier Pietquin - DeepMind
Jack Gallant - University of Berkeley
Devi Parikh - Georgia Tech / FAIR
Felix Hill - DeepMind
Jack Gallant - Univeristy of Berkeley
Chen Yu - University of Indiana

Welcome! (Workshop Introduction)
Visually Grounded Language: Past, Present, and Future... (Presentation)
Connecting high-level semantics with low-level vision (Presentation)
Break + Poster (1) (Break + Poster)
The interface between vision and language in the human brain? (Presentation)
Towards Embodied Question Answering (Presentation)
Dialogue systems and RL: interconnecting language, vision and rewards (Presentation)
Accepted Papers (Presentation)
Break + Poster (2) (Break + Poster)
Grounded Language Learning in a Simulated 3D World (Presentation)
How infant learn to speak by interacting with the visual world? (Presentation)
Panel Discussion