NIPS 2018
Skip to yearly menu bar Skip to main content


Visually grounded interaction and language

Florian Strub · Harm de Vries · Erik Wijmans · Samyak Datta · Ethan Perez · Mateusz Malinowski · Stefan Lee · Peter Anderson · Aaron Courville · Jeremie MARY · Dhruv Batra · Devi Parikh · Olivier Pietquin · Chiori HORI · Tim Marks · Anoop Cherian

Room 512 CDGH

The dominant paradigm in modern natural language understanding is learning statistical language models from text-only corpora. This approach is founded on a distributional notion of semantics, i.e. that the "meaning" of a word is based only on its relationship to other words. While effective for many applications, methods in this family suffer from limited semantic understanding, as they miss learning from the multimodal and interactive environment in which communication often takes place - the symbols of language thus are not grounded in anything concrete. The symbol grounding problem first highlighted this limitation, that “meaningless symbols (i.e.) words cannot be grounded in anything but other meaningless symbols” [18].

On the other hand, humans acquire language by communicating about and interacting within a rich, perceptual environment. This behavior provides the necessary grounding for symbols, i.e. to concrete objects or concepts (i.e. physical or psychological). Thus, recent work has aimed to bridge vision, interactive learning, and natural language understanding through language learning tasks based on natural images (ReferIt [1], GuessWhat?! [2], Visual Question Answering [3,4,5,6], Visual Dialog [7], Captioning [8]) or through embodied agents performing interactive tasks [13,14,17,22,23,24,26] in physically simulated environments (DeepMind Lab [9], Baidu XWorld [10], OpenAI Universe [11], House3D [20], Matterport3D [21], GIBSON [24], MINOS [25], AI2-THOR [19], StreetLearn [17]), often drawing on the recent successes of deep learning and reinforcement learning. We believe this line of research poses a promising, long-term solution to the grounding problem faced by current, popular language understanding models.

While machine learning research exploring visually-grounded language learning may be in its earlier stages, it may be possible to draw insights from the rich research literature on human language acquisition. In neuroscience, recent progress in fMRI technology has enabled to better understand the interleave between language, vision and other modalities [15,16] suggesting that the brains shares neural representation of concepts across vision and language. Differently, developmental cognitive scientists have also argued that children acquiring various words is closely linked to them learning the underlying concept in the real world [12].

This workshop thus aims to gather people from various backgrounds - machine learning, computer vision, natural language processing, neuroscience, cognitive science, psychology, and philosophy - to share and debate their perspectives on why grounding may (or may not) be important in building machines that truly understand natural language.

We invite you to submit papers related to the following topics:
- language acquisition or learning through interactions
- visual captioning, dialog, and question-answering
- reasoning in language and vision
- visual synthesis from language
- transfer learning in language and vision tasks
- navigation in virtual worlds via natural-language instructions or multi-agent communication
- machine translation with visual cues
- novel tasks that combine language, vision and actions
- modeling of natural language and visual stimuli representations in the human brain
- position papers on grounded language learning
- audio visual scene-aware dialog
- audio-visual fusion

Submissions should be up to 4 pages excluding references, acknowledgements, and supplementary material, and should be NIPS format and anonymous. The review process is double-blind.

We also welcome published papers that are within the scope of the workshop (without re-formatting). This specific papers do not have to be anonymous. They are not eligible for oral session and will only have a very light review process.

Please submit your paper to the following address:

Accepted workshop papers are eligible to the pool of reserved conference tickets (one ticket per accepted papers).

If you have any question, send an email to:

[1] Sahar Kazemzadeh et al. "ReferItGame: Referring to Objects in Photographs of Natural Scenes." EMNLP, 2014.
[2] Harm de Vries et al. "GuessWhat?! Visual object discovery through multi-modal dialogue." CVPR, 2017.
[3] Stanislaw Antol et al. "Vqa: Visual question answering." ICCV, 2015.
[4] Mateusz Malinowski et al. “Ask Your Neurons: A Neural-based Approach to Answering Questions about Images.” ICCV, 2015.
[5] Mateusz Malinowski et al. “A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input.” NIPS, 2014.
[6] Geman Donald, et al. “Visual Turing test for computer vision systems.” PNAS, 2015.
[7] Abhishek Das et al. "Visual dialog." CVPR, 2017.
[8] Anna Rohrbach et al. “Generating Descriptions with Grounded and Co-Referenced People.” CVPR, 2017.
[9] Charles Beattie et al. Deepmind lab. arXiv, 2016.
[10] Haonan Yu et al. “Guided Feature Transformation (GFT): A Neural Language Grounding Module for Embodied Agents.” arXiv, 2018.
[11] Openai universe., 2016.
[12] Alison Gopnik et al. “Semantic and cognitive development in 15- to 21-month-old children.” Journal of Child Language, 1984.
[13] Abhishek Das et al. "Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning." ICCV, 2017.
[14] Karl Moritz Hermann et al. "Grounded Language Learning in a Simulated 3D World." arXiv, 2017.
[15] Alexander G. Huth et al. "Natural speech reveals the semantic maps that tile human cerebral cortex." Nature, 2016.
[16] Alexander G. Huth, et al. "Decoding the semantic content of natural movies from human brain activity." Frontiers in systems neuroscience, 2016.
[17] Piotr Mirowski et al. “Learning to Navigate in Cities Without a Map.” arXiv, 2018.
[18] Stevan Harnad. “The symbol grounding problem.” CNLS, 1989.
[19] E Kolve, R Mottaghi, D Gordon, Y Zhu, A Gupta, A Farhadi. “AI2-THOR: An Interactive 3D Environment for Visual AI.” arXiv, 2017.
[20] Yi Wu et al. “House3D: A Rich and Realistic 3D Environment.” arXiv, 2017.
[21] Angel Chang et al. “Matterport3D: Learning from RGB-D Data in Indoor Environments.” arXiv, 2017.
[22] Abhishek Das et al. “Embodied Question Answering.” CVPR, 2018.
[23] Peter Anderson et al. “Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments.” CVPR, 2018.
[24] Fei Xia et al. “Gibson Env: Real-World Perception for Embodied Agents.” CVPR, 2018.
[25] Manolis Savva et al. “MINOS: Multimodal indoor simulator for navigation in complex environments.” arXiv, 2017.
[26] Daniel Gordon, Aniruddha Kembhavi, Mohammad Rastegari, Joseph Redmon, Dieter Fox, Ali Farhadi. “IQA: Visual Question Answering in Interactive Environments.” CVPR, 2018.

Live content is unavailable. Log in and register to view live content

Timezone: America/Los_Angeles


Log in and register to view live content