Timezone: »
We introduce a model for bidirectional retrieval of images and sentences through a deep, multi-modal embedding of visual and natural language data. Unlike previous models that directly map images or sentences into a common embedding space, our model works on a finer level and embeds fragments of images (objects) and fragments of sentences (typed dependency tree relations) into a common space. We then introduce a structured max-margin objective that allows our model to explicitly associate these fragments across modalities. Extensive experimental evaluation shows that reasoning on both the global level of images and sentences and the finer level of their respective fragments improves performance on image-sentence retrieval tasks. Additionally, our model provides interpretable predictions for the image-sentence retrieval task since the inferred inter-modal alignment of fragments is explicit.
Author Information
Andrej Karpathy (Tesla)
Armand Joulin (Stanford University)
Li Fei-Fei (Stanford University)
More from the Same Authors
-
2021 : Neural Abstructions: Abstractions that Support Construction for Grounded Language Learning »
Kaylee Burns · Christopher D Manning · Li Fei-Fei -
2021 : What Matters in Learning from Offline Human Demonstrations for Robot Manipulation »
Ajay Mandlekar · Danfei Xu · Josiah Wong · Chen Wang · Li Fei-Fei · Silvio Savarese · Yuke Zhu · Roberto Martín-Martín -
2020 : Closing remarks from Fei-Fei Li, Sequoia Professor of Computer Science, Stanford University & Co-Director of Stanford’s Human-Centered AI Institute »
Li Fei-Fei -
2020 : Q/A for invited talk #5 »
Li Fei-Fei -
2020 : Creating diverse tasks to catalyze robot learning »
Li Fei-Fei -
2019 Poster: Regression Planning Networks »
Danfei Xu · Roberto Martín-Martín · De-An Huang · Yuke Zhu · Silvio Savarese · Li Fei-Fei -
2019 Poster: HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models »
Sharon Zhou · Mitchell Gordon · Ranjay Krishna · Austin Narcomey · Li Fei-Fei · Michael Bernstein -
2019 Oral: HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models »
Sharon Zhou · Mitchell Gordon · Ranjay Krishna · Austin Narcomey · Li Fei-Fei · Michael Bernstein -
2018 Poster: Learning to Play With Intrinsically-Motivated, Self-Aware Agents »
Nick Haber · Damian Mrowca · Stephanie Wang · Li Fei-Fei · Daniel Yamins -
2018 Poster: Learning to Decompose and Disentangle Representations for Video Prediction »
Jun-Ting Hsieh · Bingbin Liu · De-An Huang · Li Fei-Fei · Juan Carlos Niebles -
2018 Poster: Flexible neural representation for physics prediction »
Damian Mrowca · Chengxu Zhuang · Elias Wang · Nick Haber · Li Fei-Fei · Josh Tenenbaum · Daniel Yamins -
2017 : Keynote II: Fei-Fei Li, Stanford »
Li Fei-Fei -
2017 Poster: Label Efficient Learning of Transferable Representations acrosss Domains and Tasks »
Zelun Luo · Yuliang Zou · Judy Hoffman · Li Fei-Fei -
2016 : Knowledge Acquisition for Visual Question Answering via Iterative Querying »
Yuke Zhu · Joseph Lim · Li Fei-Fei -
2012 Workshop: Big Data Meets Computer Vision: First International Workshop on Large Scale Visual Recognition and Retrieval »
Jia Deng · Samy Bengio · Yuanqing Lin · Li Fei-Fei -
2012 Poster: Shifting Weights: Adapting Object Detectors from Image to Video »
Kevin Tang · Vignesh Ramanathan · Li Fei-Fei · Daphne Koller -
2012 Poster: Emergence of Object-Selective Features in Unsupervised Feature Learning »
Adam Coates · Andrej Karpathy · Andrew Y Ng -
2012 Demonstration: EVA: Engine for Visual Annotation »
Jia Deng · Joanathan Krause · Zhiheng Huang · Alexander C Berg · Li Fei-Fei -
2011 Poster: Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition »
Jia Deng · Sanjeev Satheesh · Alexander C Berg · Li Fei-Fei -
2011 Poster: Large-Scale Category Structure Aware Image Categorization »
Bin Zhao · Li Fei-Fei · Eric Xing -
2010 Session: Oral Session 10 »
Li Fei-Fei -
2010 Poster: Large Margin Learning of Upstream Scene Understanding Models »
Jun Zhu · Li-Jia Li · Li Fei-Fei · Eric Xing -
2010 Poster: Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification »
Li-Jia Li · Hao Su · Eric Xing · Li Fei-Fei