Timezone: »

Workshop
Visually Grounded Interaction and Language
Florian Strub · Abhishek Das · Erik Wijmans · Harm de Vries · Stefan Lee · Alane Suhr · Dor Arad Hudson

Fri Dec 13 08:00 AM -- 06:15 PM (PST) @ West 202 - 204

The dominant paradigm in modern natural language understanding is learning statistical language models from text-only corpora. This approach is founded on a distributional notion of semantics, i.e. that the ''meaning'' of a word is based only on its relationship to other words. While effective for many applications, this approach suffers from limited semantic understanding -- symbols learned this way lack any concrete groundings into the multimodal, interactive environment in which communication takes place. The symbol grounding problem first highlighted this limitation, that `meaningless symbols (i.e. words) cannot be grounded in anything but other meaningless symbols''.

On the other hand, humans acquire language by communicating about and interacting within a rich, perceptual environment -- providing concrete groundings, e.g. to objects or concepts either physical or psychological. Thus, recent works have aimed to bridge computer vision, interactive learning, and natural language understanding through language learning tasks based on natural images or through embodied agents performing interactive tasks in physically simulated environments, often drawing on the recent successes of deep learning and reinforcement learning. We believe these lines of research pose a promising approach for building models that do grasp the world's underlying complexity.

The goal of this third ViGIL workshop is to bring together scientists from various backgrounds - machine learning, computer vision, natural language processing, neuroscience, cognitive science, psychology, and philosophy - to share their perspectives on grounding, embodiment, and interaction. By providing this opportunity for cross-discipline discussion, we hope to foster new ideas about how to learn and leverage grounding in machines as well as build new bridges between the science of human cognition and machine learning.

 Fri 8:20 a.m. - 8:30 a.m. Opening Remarks Florian Strub · Harm de Vries · Abhishek Das · Stefan Lee · Erik Wijmans · Drew Arad Hudson · Alane Suhr 🔗 Fri 8:30 a.m. - 9:10 a.m. Grasping Language (Talk) There is a usability gap between manipulation-capable robots and helpful in-home digital agents. Dialog-enabled smart assistants have recently seen widespread adoption, but these cannot move or manipulate objects. By contrast, manipulation-capable and mobile robots are still largely deployed in industrial settings and do not interact with human users. Language-enabled robots can bridge this gap---natural language interfaces help robots and non-experts collaborate to achieve their goals. Navigation in unexplored environments to high-level targets like "Go to the room with a plant" can be facilitated by enabling agents to ask questions and react to human clarifications on-the-fly. Further, high-level instructions like "Put a plate of toast on the table" require inferring many steps, from finding a knife to operating a toaster. Low-level instructions can serve to clarify these individual steps. Through two new datasets and accompanying models, we study human-human dialog for cooperative navigation, and high- and low-level language instructions for cooking, cleaning, and tidying in interactive home environments. These datasets are a first step towards collaborative, dialog-enabled robots helpful in human spaces. Jason Baldridge 🔗 Fri 9:10 a.m. - 9:50 a.m. From Human Language to Agent Action (Talk) There is a usability gap between manipulation-capable robots and helpful in-home digital agents. Dialog-enabled smart assistants have recently seen widespread adoption, but these cannot move or manipulate objects. By contrast, manipulation-capable and mobile robots are still largely deployed in industrial settings and do not interact with human users. Language-enabled robots can bridge this gap---natural language interfaces help robots and non-experts collaborate to achieve their goals. Navigation in unexplored environments to high-level targets like "Go to the room with a plant" can be facilitated by enabling agents to ask questions and react to human clarifications on-the-fly. Further, high-level instructions like "Put a plate of toast on the table" require inferring many steps, from finding a knife to operating a toaster. Low-level instructions can serve to clarify these individual steps. Through two new datasets and accompanying models, we study human-human dialog for cooperative navigation, and high- and low-level language instructions for cooking, cleaning, and tidying in interactive home environments. These datasets are a first step towards collaborative, dialog-enabled robots helpful in human spaces. Jesse Thomason 🔗 Fri 9:50 a.m. - 10:30 a.m. Coffee Break 🔗 Fri 10:30 a.m. - 10:50 a.m. Spotlight 🔗 Fri 10:50 a.m. - 11:30 a.m. Why language understanding is not a solved problem (Talk) Over the years, periods of intense excitement about the prospects of machine intelligence and language understanding have alternated with periods of skepticism, to say the least. It is possible to look back over the ~70 year history of this effort and see great progress, and I for one am pleased to see how far we have come. Yet from where I sit we still have a long way to go, and language understanding may be one of those parts of intelligence that will be the hardest to solve. In spite of recent breakthroughs, humans create and comprehend more structured discourse than our current machines. At the same time, psycholinguistic research suggests that humans suffer from some of the same limitations as these machines. How can humans create and comprehend structured arguments given these limitations? Will it be possible for machines to emulate these aspects of human achievement as well? Jay McClelland 🔗 Fri 11:30 a.m. - 12:10 p.m. Louis-Philippe Morency (Talk) Note that the schedule is not final, and may change. LP Morency 🔗 Fri 12:10 p.m. - 2:00 p.m. Poster session (Poster Session) Candace Ross · Yassine Mrabet · Sanjay Subramanian · Geoffrey Cideron · Jesse Mu · Suvrat Bhooshan · Eda Okur Kavil · Jean-Benoit Delbrouck · Yen-Ling Kuo · Nicolas Lair · Gabriel Ilharco · T.S. Jayram · Alba María Herrera Palacio · Chihiro Fujiyama · Olivier Tieleman · Anna Potapenko · Guan-Lin Chao · Thomas Sutter · Olga Kovaleva · Farley Lai · Xin Wang · Vasu Sharma · Catalina Cangea · Nikhil Krishnaswamy · Yuta Tsuboi · Alexander Kuhnle · Khanh Nguyen · Dian Yu · Homagni Saha · Jiannan Xiang · Vijay Venkataraman · Ankita Kalra · Ning Xie · Derek Doran · Travis Goodwin · Asim Kadav · Shabnam Daghaghi · Jason Baldridge · Jialin Wu · Jingxiang Lin · Unnat Jain 🔗 Fri 1:50 p.m. - 2:30 p.m. Lisa Anne Hendricks (Talk) Note that the schedule is not final, and may change. Lisa Anne Hendricks 🔗 Fri 2:30 p.m. - 3:10 p.m. Linda Smith (Talk) Note that the schedule is not final, and may change. Linda Smith 🔗 Fri 3:10 p.m. - 4:00 p.m. Poster Session (Coffee Break) 🔗 Fri 4:00 p.m. - 4:40 p.m. Timothy Lillicrap (Talk) Timothy Lillicrap 🔗 Fri 4:40 p.m. - 5:20 p.m. Josh Tenenbaum (Talk) Note that the schedule is not final, and may change. Josh Tenenbaum 🔗 Fri 5:20 p.m. - 6:00 p.m. Panel Discussion Linda Smith · Josh Tenenbaum · Lisa Anne Hendricks · Jay McClelland · Timothy Lillicrap · Jesse Thomason · Jason Baldridge · LP Morency 🔗 Fri 6:00 p.m. - 6:05 p.m. Closing Remarks 🔗

#### Author Information

##### Abhishek Das (Georgia Tech)

CS PhD student at Georgia Tech. Learning to build machines that can see, think and talk. Interested in Deep Learning / Computer Vision.