As situated agents begin to cohabit with humans in semi-structured environments, the need arises to understand their instructions, conveyed to the agent via a combination of natural language utterances and physical actions. Understanding the instructions involves decoding the speaker's intended message from their signal, and this involves learning how to ground the symbols in the physical world. The latter can be ambiguous due to variability in the physical instantiations of concepts - different people might use turquoise, sky blue, light blue and blue while referring to the same color or small-sized building blocks for one person could be determined as medium-sized by another. Realistically, symbol grounding is a task which must cope with small datasets consisting of a particular users' contextual assignment of meaning to terms.
Our demonstration shows usage of 3D eye-tracking for the purpose of interactive multi-modal symbol grounding. We propose a set up in which a human can teach an observing agent both the meaning of abstract labels (e.g., "blue", "squidgy") and a sequence of actions using the groundings of those labels for achieving a goal, all while executing the task. They describe the actions they take and the objects they manipulate in a natural language form which is parsed to an abstract representation. The GLIDE framework is used to temporally align parsed natural language utterances with eye-tracking fixations in the world. It allows us to seamlessly gather and label a task-specific dataset where each data point is an image of an object in the world and the labels are the abstract symbols in the instruction that was aligned with a slice of the eye-tracking trace corresponding to focusing on the object. The agent assumes some prior knowledge in the form of low-level features that it can extract from the visual input---e.g. intensities in the primary colour channels and areas of pixel patches of any specific colour. Probabilistic models are fit over those extracted features from the resultant dataset in order to teach the agent the groundings of symbols in that feature space.
Furthermore, once symbols are grounded to their physical instances through the analysis of eye tracking traces, we will demonstrate how the executed task can be also learnt in order to induce a program in real time which describes the actions executed by a person. The induced program is a high level LISP-like functional program which can capture abstract concepts such as spatial relations.