Program Highlights »
Fri Dec 8th 08:00 AM -- 06:30 PM @ 101 B
Visually grounded interaction and language
Florian Strub · Harm de Vries · Abhishek Das · Satwik Kottur · Stefan Lee · Olivier Pietquin · Mateusz Malinowski · Devi Parikh · Dhruv Batra · Aaron C Courville · Jeremie Mary

Workshop Home Page

Everyday interactions require a common understanding of language, i.e. for people to communicate effectively, words (for example ‘cat’) should invoke similar beliefs over physical concepts (what cats look like, the sounds they make, how they behave, what their skin feels like etc.). However, how this ‘common understanding’ emerges is still unclear.

One appealing hypothesis is that language is tied to how we interact with the environment. As a result, meaning emerges by ‘grounding’ language in modalities in our environment (images, sounds, actions, etc.).

Recent concurrent works in machine learning have focused on bridging visual and natural language understanding through visually-grounded language learning tasks, e.g. through natural images (Visual Question Answering, Visual Dialog), or through interactions with virtual physical environments. In cognitive science, progress in fMRI enables creating a semantic atlas of the cerebral cortex, or to decode semantic information from visual input. And in psychology, recent studies show that a baby’s most likely first words are based on their visual experience, laying the foundation for a new theory of infant language acquisition and learning.

As the grounding problem requires an interdisciplinary attitude, this workshop aims to gather researchers with broad expertise in various fields — machine learning, computer vision, natural language, neuroscience, and psychology — to discuss their cutting edge work as well as perspectives on future directions in this exciting space of grounding and interactions.

We will accept papers related to:
— language acquisition or learning through interactions
— visual captioning, dialog, and question-answering
— reasoning in language and vision
— visual synthesis from language
— transfer learning in language and vision tasks
— navigation in virtual worlds with natural-language instructions
— machine translation with visual cues
— novel tasks that combine language, vision and actions
— understanding and modeling the relationship between language and vision in humans
— semantic systems and modeling of natural language and visual stimuli representations in the human brain

Important dates
Submission deadline: 3rd November 2017
Extended Submission deadline: 17th November 2017

Acceptance notification (First deadline): 10th November 2017
Acceptance notification (Second deadline): 24th November 2017

Workshop: 8th December 2017

Paper details
— Contributed papers may include novel research, preliminary results, extended abstract, positional papers or surveys
— Papers are limited to 4 pages, excluding references, in the latest camera-ready NIPS format:
— Papers published at the main conference can be submitted without reformatting
— Please submit via email:

Accepted papers
— All accepted papers will be presented during 2 poster sessions
— Up to 5 accepted papers will be invited to deliver short talks
— Accepted papers will be made publicly available as non-archival reports, allowing future submissions to archival conferences and journals

Invited Speakers
Raymond J. Mooney - University of Texas
Sanja Fidler - University of Toronto
Olivier Pietquin - DeepMind
Jack Gallant - University of Berkeley
Devi Parikh - Georgia Tech / FAIR
Felix Hill - DeepMind
Jack Gallant - Univeristy of Berkeley
Chen Yu - University of Indiana

08:30 AM Welcome! (Workshop Introduction)
08:45 AM Visually Grounded Language: Past, Present, and Future... (Presentation)
Ray Mooney
09:30 AM Connecting high-level semantics with low-level vision (Presentation)
Sanja Fidler
10:15 AM Break + Poster (1) (Break + Poster)
Devendra Singh Chaplot, CHIH-YAO MA, Simon Brodeur, Eri Matsuo, Ichiro Kobayashi, Seitaro Shinagawa, Koichiro Yoshino, Yuhong Guo, Ben Murdoch, Kanthashree Mysore Sathyendra, Daniel Ricks, Haichao Zhang, Joshua Peterson, Li Zhang, Mircea Mironenco, Peter Anderson, Mark Johnson, Kang Min Yoo, Guntis Barzdins, Ahmed H Zaidi, Martin Andrews, Sam Witteveen, SUBBAREDDY OOTA, Prashanth Vijayaraghavan, Ke Wang, Yan Zhu, Renars Liepins, Max Quinn, Amit Raj, Vincent Cartillier, Eric Chu, Ethan Caballero, Fritz H Obermeyer
10:40 AM The interface between vision and language in the human brain? (Presentation)
Jack Gallant
11:25 AM Towards Embodied Question Answering (Presentation)
Devi Parikh
02:00 PM Dialogue systems and RL: interconnecting language, vision and rewards (Presentation)
Olivier Pietquin
02:45 PM Accepted Papers (Presentation)
03:15 PM Break + Poster (2) (Break + Poster)
03:40 PM Grounded Language Learning in a Simulated 3D World (Presentation)
Felix Hill
04:25 PM How infant learn to speak by interacting with the visual world? (Presentation)
Chen Yu
05:10 PM Panel Discussion
Felix Hill, Olivier Pietquin, Jack Gallant, Ray Mooney, Sanja Fidler, Chen Yu, Devi Parikh