Timezone: »

Learning by Instruction
Shashank Srivastava · Igor Labutov · Bishan Yang · Amos Azaria · Tom Mitchell

Sat Dec 08 05:00 AM -- 03:30 PM (PST) @ Room 516 AB
Event URL: https://sites.google.com/view/lbi2018/ »

Today machine learning is largely about pattern discovery and function approximation. But as computing devices that interact with us in natural language become ubiquitous (e.g., Siri, Alexa, Google Now), and as computer perceptual abilities become more accurate, they open an exciting possibility of enabling end-users to teach machines similar to the way in which humans teach one another. Natural language conversation, gesturing, demonstrating, teleoperating and other modes of communication offer a new paradigm for machine learning through instruction from humans. This builds on several existing machine learning paradigms (e.g., active learning, supervised learning, reinforcement learning), but also brings a new set of advantages and research challenges that lie at the intersection of several fields including machine learning, natural language understanding, computer perception, and HCI.

The aim of this workshop is to engage researchers from these diverse fields to explore fundamental research questions in this new area, such as:
How do people interact with machines when teaching them new learning tasks and knowledge?
What novel machine learning models and algorithms are needed to learn from human instruction?
What are the practical considerations towards building practical systems that can learn from instruction?

Sat 5:30 a.m. - 5:35 a.m.
Introduction (Welcome)
Sat 5:35 a.m. - 6:00 a.m.
Teaching Machines like we Teach People (Talk from Organizers)
Sat 6:00 a.m. - 6:30 a.m.

Natural language understanding in grounded interactive scenarios is tightly coupled with the actions the system generates. The action space used determines much of the complexity of the problem and the type of reasoning required. In this talk, I will describe our approach to learning to map instructions and observations to continuous control of a realistic quadcopter drone. This scenario raises new challenging questions including how can we use demonstrations to learn to bridge the gap between the high-level concepts of language and low-level robot controls? And how do we design models that continuously observe, control, and react to a rapidly changing environment? This work uses a new publicly available evaluation benchmark.

Yoav Artzi
Sat 6:30 a.m. - 7:00 a.m.
An Cognitive Architecture Approach to Interactive Task Learning (Invited Talk)
John Laird
Sat 7:00 a.m. - 7:15 a.m.

We introduce a framework for Compositional Imitation Learning and Execution (CompILE) of hierarchically-structured behavior. CompILE learns reusable, variable-length segments of behavior from demonstration data using a novel unsupervised, fully-differentiable sequence segmentation module. These learned behaviors can then be re-composed and executed to perform new tasks. At training time, CompILE auto-encodes observed behavior into a sequence of latent codes, each corresponding to a variable-length segment in the input sequence. Once trained, our model generalizes to sequences of longer length and from environment instances not seen during training. We evaluate our model in a challenging 2D multi-task environment and show that CompILE can find correct task boundaries and event encodings in an unsupervised manner without requiring annotated demonstration data. We demonstrate that latent codes and associated behavior policies discovered by CompILE can be used by a hierarchical agent, where the high-level policy selects actions in the latent code space, and the low-level, task-specific policies are simply the learned decoders. We found that our agent could learn given only sparse rewards, where agents without task-specific policies struggle.

Thomas Kipf
Sat 7:15 a.m. - 7:30 a.m.

In the standard formulation of imitation learning, the agent starts from scratch without the means to take advantage of an informative prior. As a result, the expert's demonstrations have to either be optimal, or contain a known mode of sub-optimality that could be modeled. In this work, we consider instead the problem of imitation learning from imperfect demonstrations where a small number of demonstrations containing unstructured imperfections is available. In particular, these demonstrations contain large systematic biases, or fails to complete the task in unspecified ways. Our Learning to Learn From Imperfect Demonstrations (LID) framework casts such problem as a meta-learning problem, where the agent meta-learns a robust imitation algorithm that is able to infer the correct policy despite of these imperfections, by taking advantage of an informative prior. We demonstrate the robustness of this algorithm over 2D reaching tasks, multitask door opening and picking tasks with a simulated robot arm, where the demonstration merely gestures for the intended target. Despite not seeing a demonstration that completes the task, the agent is able to draw lessons from its prior experience--correctly inferring a policy that accomplishes the task where the demonstration fails to.

Ge Yang, Chelsea Finn
Sat 8:00 a.m. - 8:30 a.m.
Natural Language Supervision (Invited Talk)
Percy Liang
Sat 8:30 a.m. - 9:00 a.m.
Control Algorithms for Imitation Learning from Observation (Invited Talk)
Peter Stone
Sat 9:00 a.m. - 9:15 a.m.

Reinforcement learning is a promising framework for solving control problems, but its use in practical situations is hampered by the fact that reward functions are often difficult to engineer. Specifying goals and tasks for autonomous machines, such as robots, is a significant challenge: conventionally, reward functions and goal states have been used to communicate objectives. But people can communicate objectives to each other simply by describing or demonstrating them. How can we build learning algorithms that will allow us to tell machines what we want them to do? In this work, we investigate the problem of grounding language commands as reward functions using inverse reinforcement learning, and argue that language-conditioned rewards are more transferable than language-conditioned policies to new environments. We propose language-conditioned reward learning (LC-RL), which grounds language commands as a reward function represented by a deep neural network. We demonstrate that our model learns rewards that transfer to novel tasks and environments on realistic, high-dimensional visual environments with natural language commands, whereas directly learning a language-conditioned policy leads to poor performance.

Justin Fu
Sat 9:15 a.m. - 9:30 a.m.

This paper examines the problem of how to teach multiple tasks to a Reinforcement Learning (RL) agent. To this end, we use Linear Temporal Logic (LTL) as a language for specifying multiple tasks in a manner that supports the composition of learned skills. We also propose a novel algorithm that exploits LTL progression and off-policy RL to speed up learning without compromising convergence guarantees, and show that our method outperforms the state-of-the-art.

Rodrigo Toro Icarte, Sheila McIlraith
Sat 10:30 a.m. - 11:00 a.m.
Meta-Learning to Follow Instructions, Examples, and Demonstrations (Invited Talk)
Sergey Levine
Sat 11:00 a.m. - 11:30 a.m.
Learning to Understand Natural Language Instructions through Human-Robot Dialog (Invited Talk)
Ray Mooney
Sat 11:30 a.m. - 11:45 a.m.

Reinforcement learning (RL) agents optimize only the specified features and are indifferent to anything left out inadvertently. This means that we must not only tell a household robot what to do, but also the much larger space of what not to do. It is easy to forget these preferences, since we are so used to having them satisfied. Our key insight is that when a robot is deployed in an environment that humans act in, the state of the environment is already optimized for what humans want. We can therefore use this implicit information from the state to fill in the blanks. We develop an algorithm based on Maximum Causal Entropy IRL and use it to evaluate the idea in a suite of proof-of-concept environments designed to show its properties. We find that information from the initial state can be used to infer both side effects that should be avoided as well as preferences for how the environment should be organized.

Rohin Shah
Sat 11:45 a.m. - 12:00 p.m.

Many interactive intelligent systems, such as recommendation and information retrieval systems, treat users as a passive data source. Yet, users form mental models of systems and instead of passively providing feedback to the queries of the system, they will strategically plan their actions within the constraints of the mental model to steer the system and achieve their goals faster. We propose to explicitly account for the user's theory of the AI's mind in the user model: the intelligent system has a model of the user having a model of the intelligent system. We study a case where the system is a contextual bandit and the user model is a Markov decision process that plans based on a simpler model of the bandit. Inference in the model can be reduced to probabilistic inverse reinforcement learning, with the nested bandit model defining the transition dynamics, and is implemented using probabilistic programming. Our results show that improved performance is achieved if users can form accurate mental models that the system can capture, implying predictability of the interactive intelligent system is important not only for the user experience but also for the design of the system's statistical models.

Tomi Peltola
Sat 12:30 p.m. - 1:15 p.m.
Poster Session
Carl Trimbach, Mennatullah Siam, Rodrigo Toro Icarte, Falcon Dai, Sheila McIlraith, Matthew Rahtz, Rob Sheline, Chris MacLellan, Carolin Lawrence, Stefan Riezler, Dylan Hadfield-Menell, Fang-I Hsiao
Sat 1:15 p.m. - 1:30 p.m.

We study the problem of inverse reinforcement learning (IRL) with the added twist that the learner is assisted by a helpful teacher. More formally, we tackle the following algorithmic question: How could a teacher provide an informative sequence of demonstrations to an IRL agent to speed up the learning process? We prove rigorous convergence guarantees of a new iterative teaching algorithm that adaptively chooses demonstrations based on the learner’s current performance. Extensive experiments with a car driving simulator environment show that the learning progress can be speeded up drastically as compared to an uninformative teacher.

Adish Singla, Rati Devidze
Sat 1:30 p.m. - 2:00 p.m.
Teaching through Dialogue and Games (Invited Talk)
Jason E Weston
Sat 2:00 p.m. - 2:45 p.m.
Panel Discussion (Discussion Panel)

Author Information

Shashank Srivastava (Microsoft Research)
Igor Labutov (Cornell University)
Bishan Yang (Cornell University)
Amos Azaria (Ariel University)
Tom Mitchell (Carnegie Mellon University)

More from the Same Authors