This demonstration illustrates how research results in grounded language learning and understanding can be used in a cooperative task between an intelligent agent and a human. The task, undertaken by a robot, is the question answering game GuessWhat?! [1][2]
Providing human-robot interactions in the real world requires interfacing GuessWhat?! with: speech recognition and synthesis modules; video processing and recognition algorithms; the robot’s control module. One main challenge is adapting GuessWhat?! to work with images outside of MSCOCO’s domain. This required implementing a pipeline in ROS which takes images from a Kinect, ensures image quality with blur detection, extracts VGG-16 feature vectors, segments objects using Mask R-CNN, and extracts position information from the segmented objects. Images from the pipeline are used by GuessWhat?! in tandem with utterances from the player. Snips voice assistant recognizes whether the player says “Yes”, “No” or “Not Applicable". Snips also provides speech synthesis, converting questions generated by GuessWhat?! into speech for the player. To identify potential players, OpenPose allows IRL-1 to interact with them throughout the game.
Our open source code could be useful as intelligent agents are becoming commonplace and the ability to communicate with people in a given context, such as the home or workplace, becomes imperative. The various functionalities implemented on IRL-1 [3], would be beneficial to any agent assisting a person in a cooperative task.
More details can be found at: https://devine.gel.usherbrooke.ca/abstract_devine.pdf
[1] https://www.guesswhat.ai
[2] https://iglu-chistera.github.io
[3] http://humanrobotinteraction.org/journal/index.php/HRI/article/view/65