Timezone: »
Image captioning models are becoming increasingly successful at describing the content of images in restricted domains. However, if these models are to function in the wild --- for example, as assistants for people with impaired vision --- a much larger number and variety of visual concepts must be understood. To address this problem, we teach image captioning models new visual concepts from labeled images and object detection datasets. Since image labels and object classes can be interpreted as partial captions, we formulate this problem as learning from partially-specified sequence data. We then propose a novel algorithm for training sequence models, such as recurrent neural networks, on partially-specified sequences which we represent using finite state automata. In the context of image captioning, our method lifts the restriction that previously required image captioning models to be trained on paired image-sentence corpora only, or otherwise required specialized model architectures to take advantage of alternative data modalities. Applying our approach to an existing neural captioning model, we achieve state of the art results on the novel object captioning task using the COCO dataset. We further show that we can train a captioning model to describe new visual concepts from the Open Images dataset while maintaining competitive COCO evaluation scores.
Author Information
Peter Anderson (Georgia Tech)
Research Scientist in Computer Vision / Deep Learning at Georgia Tech. I like to work on problems involving vision, language and embodied agents, e.g. image captioning, visual question answering (VQA), vision-and-language navigation (VLN), etc.
Stephen Gould (ANU)
Mark Johnson (Macquarie University)
More from the Same Authors
-
2023 Poster: Revisiting Implicit Differentiation for Learning Problems in Optimal Control »
Ming Xu · Timothy Molloy · Stephen Gould -
2022 Spotlight: Lightning Talks 6B-2 »
Alexander Korotin · Jinyuan Jia · Weijian Deng · Shi Feng · Maying Shen · Denizalp Goktas · Fang-Yi Yu · Alexander Kolesov · Sadie Zhao · Stephen Gould · Hongxu Yin · Wenjie Qu · Liang Zheng · Evgeny Burnaev · Amy Greenwald · Neil Gong · Pavlo Molchanov · Yiling Chen · Lei Mao · Jianna Liu · Jose M. Alvarez -
2022 Spotlight: On the Strong Correlation Between Model Invariance and Generalization »
Weijian Deng · Stephen Gould · Liang Zheng -
2022 Poster: On the Strong Correlation Between Model Invariance and Generalization »
Weijian Deng · Stephen Gould · Liang Zheng -
2021 Poster: Neural Rule-Execution Tracking Machine For Transformer-Based Text Generation »
Yufei Wang · Can Xu · Huang Hu · Chongyang Tao · Stephen Wan · Mark Dras · Mark Johnson · Daxin Jiang -
2021 Poster: Rethinking conditional GAN training: An approach using geometrically structured latent manifolds »
Sameera Ramasinghe · Moshiur Farazi · Salman H Khan · Nick Barnes · Stephen Gould -
2020 Poster: Language and Visual Entity Relationship Graph for Agent Navigation »
Yicong Hong · Cristian Rodriguez · Yuankai Qi · Qi Wu · Stephen Gould -
2019 Poster: Chasing Ghosts: Instruction Following as Bayesian State Tracking »
Peter Anderson · Ayush Shrivastava · Devi Parikh · Dhruv Batra · Stefan Lee -
2018 Workshop: Visually grounded interaction and language »
Florian Strub · Harm de Vries · Erik Wijmans · Samyak Datta · Ethan Perez · Mateusz Malinowski · Stefan Lee · Peter Anderson · Aaron Courville · Jeremie MARY · Dhruv Batra · Devi Parikh · Olivier Pietquin · Chiori HORI · Tim Marks · Anoop Cherian -
2017 : Break + Poster (1) »
Devendra Singh Chaplot · CHIH-YAO MA · Simon Brodeur · Eri Matsuo · Ichiro Kobayashi · Seitaro Shinagawa · Koichiro Yoshino · Yuhong Guo · Ben Murdoch · Kanthashree Mysore Sathyendra · Daniel Ricks · Haichao Zhang · Joshua Peterson · Li Zhang · Mircea Mironenco · Peter Anderson · Mark Johnson · Kang Min Yoo · Guntis Barzdins · Ahmed H Zaidi · Martin Andrews · Sam Witteveen · SUBBAREDDY OOTA · Prashanth Vijayaraghavan · Ke Wang · Yan Zhu · Renars Liepins · Max Quinn · Amit Raj · Vincent Cartillier · Eric Chu · Ethan Caballero · Fritz Obermeyer -
2010 Spotlight: Synergies in learning words and their referents »
Mark Johnson · Katherine Demuth · Michael C Frank · Bevan K Jones -
2010 Poster: Synergies in learning words and their referents »
Mark Johnson · Katherine Demuth · Michael C Frank · Bevan K Jones -
2009 Poster: Region-based Segmentation and Object Detection »
Stephen Gould · Tianshi Gao · Daphne Koller -
2009 Spotlight: Region-based Segmentation and Object Detection »
Stephen Gould · Tianshi Gao · Daphne Koller -
2008 Oral: Cascaded Classification Models: Combining Models for Holistic Scene Understanding »
Geremy Heitz · Stephen Gould · Ashutosh Saxena · Daphne Koller -
2008 Poster: Cascaded Classification Models: Combining Models for Holistic Scene Understanding »
Geremy Heitz · Stephen Gould · Ashutosh Saxena · Daphne Koller -
2008 Poster: Learning Bounded Treewidth Bayesian Networks »
Gal Elidan · Stephen Gould -
2008 Demonstration: High-Accuracy 3D Sensing for Mobile Manipulators »
Stephen Gould · Morgan Quigley · Siddarth Batra · Ellen Klingbiel · Quoc V Le · Andrew Y Ng -
2008 Spotlight: Learning Bounded Treewidth Bayesian Networks »
Gal Elidan · Stephen Gould -
2007 Demonstration: Holistic Scene Understanding from Visual and Range Data »
Stephen Gould · Morgan Quigley · Andrew Y Ng · Daphne Koller -
2007 Spotlight: A Bayesian LDA-based model for semi-supervised part-of-speech tagging »
Kristina N Toutanova · Mark Johnson -
2007 Poster: A Bayesian LDA-based model for semi-supervised part-of-speech tagging »
Kristina N Toutanova · Mark Johnson -
2006 Demonstration: Peripheral-Foveal Vision for Real-time Object Recognition »
Benjamin Sapp · Stephen Gould · Adrian Kaehler · Gary R Bradski · Andrew Y Ng -
2006 Poster: Adaptor Grammars: A Framework for Specifying Compositional Nonparametric Bayesian Mod »
Mark Johnson · Tom Griffiths · Sharon Goldwater