Timezone: »

Visually grounded interaction and language
Florian Strub · Harm de Vries · Erik Wijmans · Samyak Datta · Ethan Perez · Mateusz Malinowski · Stefan Lee · Peter Anderson · Aaron Courville · Jeremie MARY · Dhruv Batra · Devi Parikh · Olivier Pietquin · Chiori HORI · Tim Marks · Anoop Cherian

Fri Dec 07 05:00 AM -- 03:30 PM (PST) @ Room 512 CDGH
Event URL: https://nips2018vigil.github.io/ »

The dominant paradigm in modern natural language understanding is learning statistical language models from text-only corpora. This approach is founded on a distributional notion of semantics, i.e. that the "meaning" of a word is based only on its relationship to other words. While effective for many applications, methods in this family suffer from limited semantic understanding, as they miss learning from the multimodal and interactive environment in which communication often takes place - the symbols of language thus are not grounded in anything concrete. The symbol grounding problem first highlighted this limitation, that “meaningless symbols (i.e.) words cannot be grounded in anything but other meaningless symbols” [18].

On the other hand, humans acquire language by communicating about and interacting within a rich, perceptual environment. This behavior provides the necessary grounding for symbols, i.e. to concrete objects or concepts (i.e. physical or psychological). Thus, recent work has aimed to bridge vision, interactive learning, and natural language understanding through language learning tasks based on natural images (ReferIt [1], GuessWhat?! [2], Visual Question Answering [3,4,5,6], Visual Dialog [7], Captioning [8]) or through embodied agents performing interactive tasks [13,14,17,22,23,24,26] in physically simulated environments (DeepMind Lab [9], Baidu XWorld [10], OpenAI Universe [11], House3D [20], Matterport3D [21], GIBSON [24], MINOS [25], AI2-THOR [19], StreetLearn [17]), often drawing on the recent successes of deep learning and reinforcement learning. We believe this line of research poses a promising, long-term solution to the grounding problem faced by current, popular language understanding models.

While machine learning research exploring visually-grounded language learning may be in its earlier stages, it may be possible to draw insights from the rich research literature on human language acquisition. In neuroscience, recent progress in fMRI technology has enabled to better understand the interleave between language, vision and other modalities [15,16] suggesting that the brains shares neural representation of concepts across vision and language. Differently, developmental cognitive scientists have also argued that children acquiring various words is closely linked to them learning the underlying concept in the real world [12].

This workshop thus aims to gather people from various backgrounds - machine learning, computer vision, natural language processing, neuroscience, cognitive science, psychology, and philosophy - to share and debate their perspectives on why grounding may (or may not) be important in building machines that truly understand natural language.

We invite you to submit papers related to the following topics:
- language acquisition or learning through interactions
- visual captioning, dialog, and question-answering
- reasoning in language and vision
- visual synthesis from language
- transfer learning in language and vision tasks
- navigation in virtual worlds via natural-language instructions or multi-agent communication
- machine translation with visual cues
- novel tasks that combine language, vision and actions
- modeling of natural language and visual stimuli representations in the human brain
- position papers on grounded language learning
- audio visual scene-aware dialog
- audio-visual fusion

Submissions should be up to 4 pages excluding references, acknowledgements, and supplementary material, and should be NIPS format and anonymous. The review process is double-blind.

We also welcome published papers that are within the scope of the workshop (without re-formatting). This specific papers do not have to be anonymous. They are not eligible for oral session and will only have a very light review process.

Please submit your paper to the following address: https://cmt3.research.microsoft.com/VIGIL2018

Accepted workshop papers are eligible to the pool of reserved conference tickets (one ticket per accepted papers).

If you have any question, send an email to: vigilworkshop2018@gmail.com

[1] Sahar Kazemzadeh et al. "ReferItGame: Referring to Objects in Photographs of Natural Scenes." EMNLP, 2014.
[2] Harm de Vries et al. "GuessWhat?! Visual object discovery through multi-modal dialogue." CVPR, 2017.
[3] Stanislaw Antol et al. "Vqa: Visual question answering." ICCV, 2015.
[4] Mateusz Malinowski et al. “Ask Your Neurons: A Neural-based Approach to Answering Questions about Images.” ICCV, 2015.
[5] Mateusz Malinowski et al. “A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input.” NIPS, 2014.
[6] Geman Donald, et al. “Visual Turing test for computer vision systems.” PNAS, 2015.
[7] Abhishek Das et al. "Visual dialog." CVPR, 2017.
[8] Anna Rohrbach et al. “Generating Descriptions with Grounded and Co-Referenced People.” CVPR, 2017.
[9] Charles Beattie et al. Deepmind lab. arXiv, 2016.
[10] Haonan Yu et al. “Guided Feature Transformation (GFT): A Neural Language Grounding Module for Embodied Agents.” arXiv, 2018.
[11] Openai universe. https://universe.openai.com, 2016.
[12] Alison Gopnik et al. “Semantic and cognitive development in 15- to 21-month-old children.” Journal of Child Language, 1984.
[13] Abhishek Das et al. "Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning." ICCV, 2017.
[14] Karl Moritz Hermann et al. "Grounded Language Learning in a Simulated 3D World." arXiv, 2017.
[15] Alexander G. Huth et al. "Natural speech reveals the semantic maps that tile human cerebral cortex." Nature, 2016.
[16] Alexander G. Huth, et al. "Decoding the semantic content of natural movies from human brain activity." Frontiers in systems neuroscience, 2016.
[17] Piotr Mirowski et al. “Learning to Navigate in Cities Without a Map.” arXiv, 2018.
[18] Stevan Harnad. “The symbol grounding problem.” CNLS, 1989.
[19] E Kolve, R Mottaghi, D Gordon, Y Zhu, A Gupta, A Farhadi. “AI2-THOR: An Interactive 3D Environment for Visual AI.” arXiv, 2017.
[20] Yi Wu et al. “House3D: A Rich and Realistic 3D Environment.” arXiv, 2017.
[21] Angel Chang et al. “Matterport3D: Learning from RGB-D Data in Indoor Environments.” arXiv, 2017.
[22] Abhishek Das et al. “Embodied Question Answering.” CVPR, 2018.
[23] Peter Anderson et al. “Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments.” CVPR, 2018.
[24] Fei Xia et al. “Gibson Env: Real-World Perception for Embodied Agents.” CVPR, 2018.
[25] Manolis Savva et al. “MINOS: Multimodal indoor simulator for navigation in complex environments.” arXiv, 2017.
[26] Daniel Gordon, Aniruddha Kembhavi, Mohammad Rastegari, Joseph Redmon, Dieter Fox, Ali Farhadi. “IQA: Visual Question Answering in Interactive Environments.” CVPR, 2018.

Author Information

Florian Strub (Univ Lille1, CRIStAL, Inria - SequeL Team)
Harm de Vries (Université de Montréal)
Erik Wijmans (Georgia Institute of Technology)
Samyak Datta (Georgia Institute of Technology)

I am a PhD student advised by Prof. Devi Parikh in the School of Interactive Computing within the College of Computing at Georgia Tech. I also work closely with Prof. Dhruv Batra. My area of interests lie at the intersection of vision, language and actions. I am interested in training embodied agents to solve high-level AI tasks such as visual navigation and question-answering in simulation environments. I have also worked on problems in the space of weakly supervised learning.

Ethan Perez (New York University)

My research focuses on developing question-answering methods that generalize to harder questions than we have supervision for. Learning from human examples (supervised learning) won't scale to these kinds of questions, so I am investigating other paradigms that recursively break down harder questions into simpler ones.

Mateusz Malinowski (DeepMind)

Mateusz Malinowski is a research scientist at DeepMind, where he works at the intersection of computer vision, natural language understanding, and deep learning. He was granted PhD (Dr.-Ing.) with the highest honor (summa cum laude) at Max Planck Institute for Informatics in 2017 in computer vision for his pioneering work on visual question answering, where he proposed the task and developed methods that answer questions about the content of images. Prior to this, he graduated with honors from Saarland University in computer science. Before that, he studied computer science at Wroclaw University in Poland.

Stefan Lee (Georgia Tech)
Peter Anderson (Georgia Tech)

Research Scientist in Computer Vision / Deep Learning at Georgia Tech. I like to work on problems involving vision, language and embodied agents, e.g. image captioning, visual question answering (VQA), vision-and-language navigation (VLN), etc.

Aaron Courville (U. Montreal)
Dhruv Batra (FAIR (Meta) / Georgia Tech)
Devi Parikh (Georgia Tech / Facebook AI Research (FAIR))
Olivier Pietquin (Google Research Brain Team)
Chiori HORI (Mitsubishi Electric Research Laboratories (MERL))

Dr. Chiori Hori has worked on spoken language processing technologies since 1998. In 2002, she worked on spoken interactive Q&A using a real-time Automatic Speech Recognition (ASR) based on Weighted Finite-State Transducer (WFST) with over-a-million word vocabulary, at NTT. She joined CMU in 2004 and then moved to ATR/NICT in 2007. She led the NICT ASR research group and their system to first place in the English TED talk recognition at IWSLT for three consecutive years from 2012. She invented a WFST-based dialog technology which was implemented on a humanoid robot, Honda's ASIMO, at NICT. She has been working on neural network based technologies for Human-Robot communication at MERL since 2015. She is leading the 7th Dialog System Technology Challenge (DSTC6 and DSTC7) and the track of Audio Visual Scene Aware Dialog (AVSD). She has been an editorial board of "Computer Speech and Language" since 2016 and a member of IEEE Speech and Language Processing Technical Committee.

Tim Marks (Mitsubishi Electric Research Laboratories (MERL))
Anoop Cherian (MERL)

More from the Same Authors