Timezone: »
Video-language models (VLMs), large models pre-trained on numerous but noisy video-text pairs from the internet, have revolutionized activity recognition through their remarkable generalization and open-vocabulary capabilities. While complex human activities are often hierarchical and compositional, most existing tasks for evaluating VLMs focus only on high-level video understanding, making it difficult to accurately assess and interpret the ability of VLMs to understand complex and fine-grained human activities. Inspired by the recently proposed MOMA framework, we define activity graphs as a single universal representation of human activities that encompasses video understanding at the activity, sub-activity, and atomic action level. We redefine activity parsing as the overarching task of activity graph generation, requiring understanding human activities across all three levels. To facilitate the evaluation of models on activity parsing, we introduce MOMA-LRG (Multi-Object Multi-Actor Language-Refined Graphs), a large dataset of complex human activities with activity graph annotations that can be readily transformed into natural language sentences. Lastly, we present a model-agnostic and lightweight approach to adapting and evaluating VLMs by incorporating structured knowledge from activity graphs into VLMs, addressing the individual limitations of language and graphical models. We demonstrate strong performance on few-shot activity parsing, and our framework is intended to foster future research in the joint modeling of videos, graphs, and language.
Author Information
Zelun Luo (Stanford University)
Zane Durante (Stanford University)
Linden Li (Computer Science Department, Stanford University)
Wanze Xie (Stanford University)
Ruochen Liu
Emily Jin (Stanford University)
Zhuoyi Huang (Stanford University)
I am a second year master student at Stanford in computer science, my research and industrial interest lay in computer vision, reinforcement learning, full-stack development, and massive data analysis relating to health care or other social goods.
Lun Yu Li (Stanford University)
Jiajun Wu (Stanford University)
Juan Carlos Niebles (Stanford University)
Ehsan Adeli (Stanford University)
Fei-Fei Li (Princeton University)
More from the Same Authors
-
2021 : Physion: Evaluating Physical Prediction from Vision in Humans and Machines »
Daniel Bear · Elias Wang · Damian Mrowca · Felix Binder · Hsiao-Yu Tung · Pramod RT · Cameron Holdaway · Sirui Tao · Kevin Smith · Fan-Yun Sun · Fei-Fei Li · Nancy Kanwisher · Josh Tenenbaum · Dan Yamins · Judith Fan -
2022 : What Makes Certain Pre-Trained Visual Representations Better for Robotic Learning? »
Kyle Hsu · Tyler Lum · Ruohan Gao · Shixiang (Shane) Gu · Jiajun Wu · Chelsea Finn -
2022 : A Control-Centric Benchmark for Video Prediction »
Stephen Tian · Chelsea Finn · Jiajun Wu -
2022 : VIMA: General Robot Manipulation with Multimodal Prompts »
Yunfan Jiang · Agrim Gupta · Zichen Zhang · Guanzhi Wang · Yongqiang Dou · Yanjun Chen · Fei-Fei Li · Anima Anandkumar · Yuke Zhu · Linxi Fan -
2022 : What Makes Certain Pre-Trained Visual Representations Better for Robotic Learning? »
Kyle Hsu · Tyler Lum · Ruohan Gao · Shixiang (Shane) Gu · Jiajun Wu · Chelsea Finn -
2022 : Giving Robots a Hand: Broadening Generalization via Hand-Centric Human Video Demonstrations »
Moo J Kim · Jiajun Wu · Chelsea Finn -
2022 Spotlight: Lightning Talks 5A-3 »
Minting Pan · Xiang Chen · Wenhan Huang · Can Chang · Zhecheng Yuan · Jianzhun Shao · Yushi Cao · Peihao Chen · Ke Xue · Zhengrong Xue · Zhiqiang Lou · Xiangming Zhu · Lei Li · Zhiming Li · Kai Li · Jiacheng Xu · Dongyu Ji · Ni Mu · Kun Shao · Tianpei Yang · Kunyang Lin · Ningyu Zhang · Yunbo Wang · Lei Yuan · Bo Yuan · Hongchang Zhang · Jiajun Wu · Tianze Zhou · Xueqian Wang · Ling Pan · Yuhang Jiang · Xiaokang Yang · Xiaozhuan Liang · Hao Zhang · Weiwen Hu · Miqing Li · YAN ZHENG · Matthew Taylor · Huazhe Xu · Shumin Deng · Chao Qian · YI WU · Shuncheng He · Wenbing Huang · Chuanqi Tan · Zongzhang Zhang · Yang Gao · Jun Luo · Yi Li · Xiangyang Ji · Thomas Li · Mingkui Tan · Fei Huang · Yang Yu · Huazhe Xu · Dongge Wang · Jianye Hao · Chuang Gan · Yang Liu · Luo Si · Hangyu Mao · Huajun Chen · Jianye Hao · Jun Wang · Xiaotie Deng -
2022 Spotlight: E-MAPP: Efficient Multi-Agent Reinforcement Learning with Parallel Program Guidance »
Can Chang · Ni Mu · Jiajun Wu · Ling Pan · Huazhe Xu -
2022 Poster: ELIGN: Expectation Alignment as a Multi-Agent Intrinsic Reward »
Zixian Ma · Rose Wang · Fei-Fei Li · Michael Bernstein · Ranjay Krishna -
2022 Poster: E-MAPP: Efficient Multi-Agent Reinforcement Learning with Parallel Program Guidance »
Can Chang · Ni Mu · Jiajun Wu · Ling Pan · Huazhe Xu -
2022 Poster: CLEVRER-Humans: Describing Physical and Causal Events the Human Way »
Jiayuan Mao · Xuelin Yang · Xikun Zhang · Noah Goodman · Jiajun Wu -
2022 Poster: Interaction Modeling with Multiplex Attention »
Fan-Yun Sun · Isaac Kauvar · Ruohan Zhang · Jiachen Li · Mykel J Kochenderfer · Jiajun Wu · Nick Haber -
2022 Poster: Geoclidean: Few-Shot Generalization in Euclidean Geometry »
Joy Hsu · Jiajun Wu · Noah Goodman -
2022 Poster: IKEA-Manual: Seeing Shape Assembly Step by Step »
Ruocheng Wang · Yunzhi Zhang · Jiayuan Mao · Ran Zhang · Chin-Yi Cheng · Jiajun Wu -
2022 Poster: Unsupervised Learning of Shape Programs with Repeatable Implicit Parts »
Boyang Deng · Sumith Kulal · Zhengyang Dong · Congyue Deng · Yonglong Tian · Jiajun Wu -
2021 Poster: Grammar-Based Grounded Lexicon Learning »
Jiayuan Mao · Freda Shi · Jiajun Wu · Roger Levy · Josh Tenenbaum -
2021 Poster: MOMA: Multi-Object Multi-Actor Activity Parsing »
Zelun Luo · Wanze Xie · Siddharth Kapoor · Yiyun Liang · Michael Cooper · Juan Carlos Niebles · Ehsan Adeli · Fei-Fei Li -
2020 : Panel #2 »
Oren Etzioni · Heng Ji · Subbarao Kambhampati · Victoria Lin · Jiajun Wu -
2020 : Q&A #2 »
Heng Ji · Jure Leskovec · Jiajun Wu -
2020 : Invited Talk #6 »
Jiajun Wu -
2020 Poster: Multi-Plane Program Induction with 3D Box Priors »
Yikai Li · Jiayuan Mao · Xiuming Zhang · Bill Freeman · Josh Tenenbaum · Noah Snavely · Jiajun Wu -
2020 Poster: Learning Physical Graph Representations from Visual Scenes »
Daniel Bear · Chaofei Fan · Damian Mrowca · Yunzhu Li · Seth Alter · Aran Nayebi · Jeremy Schwartz · Li Fei-Fei · Jiajun Wu · Josh Tenenbaum · Daniel Yamins -
2020 Oral: Learning Physical Graph Representations from Visual Scenes »
Daniel Bear · Chaofei Fan · Damian Mrowca · Yunzhu Li · Seth Alter · Aran Nayebi · Jeremy Schwartz · Li Fei-Fei · Jiajun Wu · Josh Tenenbaum · Daniel Yamins -
2020 : Neuro-Symbolic Visual Concept Learning »
Jiajun Wu -
2019 : Poster Session »
Ethan Harris · Tom White · Oh Hyeon Choung · Takashi Shinozaki · Dipan Pal · Katherine L. Hermann · Judy Borowski · Camilo Fosco · Chaz Firestone · Vijay Veerabadran · Benjamin Lahner · Chaitanya Ryali · Fenil Doshi · Pulkit Singh · Sharon Zhou · Michel Besserve · Michael Chang · Anelise Newman · Mahesan Niranjan · Jonathon Hare · Daniela Mihai · Marios Savvides · Simon Kornblith · Christina M Funke · Aude Oliva · Virginia de Sa · Dmitry Krotov · Colin Conwell · George Alvarez · Alex Kolchinski · Shengjia Zhao · Mitchell Gordon · Michael Bernstein · Stefano Ermon · Arash Mehrjou · Bernhard Schölkopf · John Co-Reyes · Michael Janner · Jiajun Wu · Josh Tenenbaum · Sergey Levine · Yalda Mohsenzadeh · Zhenglong Zhou -
2019 Poster: Modeling Expectation Violation in Intuitive Physics with Coarse Probabilistic Object Representations »
Kevin Smith · Lingjie Mei · Shunyu Yao · Jiajun Wu · Elizabeth Spelke · Josh Tenenbaum · Tomer Ullman -
2019 Poster: Visual Concept-Metaconcept Learning »
Chi Han · Jiayuan Mao · Chuang Gan · Josh Tenenbaum · Jiajun Wu -
2018 Workshop: NIPS Workshop on Machine Learning for Intelligent Transportation Systems 2018 »
Li Erran Li · Anca Dragan · Juan Carlos Niebles · Silvio Savarese -
2018 Workshop: Modeling the Physical World: Learning, Perception, and Control »
Jiajun Wu · Kelsey Allen · Kevin Smith · Jessica Hamrick · Emmanuel Dupoux · Marc Toussaint · Josh Tenenbaum -
2018 Poster: Learning to Reconstruct Shapes from Unseen Classes »
Xiuming Zhang · Zhoutong Zhang · Chengkai Zhang · Josh Tenenbaum · Bill Freeman · Jiajun Wu -
2018 Oral: Learning to Reconstruct Shapes from Unseen Classes »
Xiuming Zhang · Zhoutong Zhang · Chengkai Zhang · Josh Tenenbaum · Bill Freeman · Jiajun Wu -
2018 Poster: Visual Object Networks: Image Generation with Disentangled 3D Representations »
Jun-Yan Zhu · Zhoutong Zhang · Chengkai Zhang · Jiajun Wu · Antonio Torralba · Josh Tenenbaum · Bill Freeman -
2018 Poster: Learning to Decompose and Disentangle Representations for Video Prediction »
Jun-Ting Hsieh · Bingbin Liu · De-An Huang · Li Fei-Fei · Juan Carlos Niebles -
2018 Poster: Learning to Exploit Stability for 3D Scene Parsing »
Yilun Du · Zhijian Liu · Hector Basevi · Ales Leonardis · Bill Freeman · Josh Tenenbaum · Jiajun Wu -
2018 Poster: Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding »
Kexin Yi · Jiajun Wu · Chuang Gan · Antonio Torralba · Pushmeet Kohli · Josh Tenenbaum -
2018 Poster: 3D-Aware Scene Manipulation via Inverse Graphics »
Shunyu Yao · Tzu Ming Hsu · Jun-Yan Zhu · Jiajun Wu · Antonio Torralba · Bill Freeman · Josh Tenenbaum -
2018 Spotlight: Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding »
Kexin Yi · Jiajun Wu · Chuang Gan · Antonio Torralba · Pushmeet Kohli · Josh Tenenbaum -
2017 Workshop: 2017 NIPS Workshop on Machine Learning for Intelligent Transportation Systems »
Li Erran Li · Anca Dragan · Juan Carlos Niebles · Silvio Savarese -
2017 Spotlight: Shape and Material from Sound »
Zhoutong Zhang · Qiujia Li · Zhengjia Huang · Jiajun Wu · Josh Tenenbaum · Bill Freeman -
2017 Spotlight: Scene Physics Acquisition via Visual De-animation »
Jiajun Wu · Erika Lu · Pushmeet Kohli · Bill Freeman · Josh Tenenbaum -
2017 Poster: Learning to See Physics via Visual De-animation »
Jiajun Wu · Erika Lu · Pushmeet Kohli · Bill Freeman · Josh Tenenbaum -
2017 Poster: Shape and Material from Sound »
Zhoutong Zhang · Qiujia Li · Zhengjia Huang · Jiajun Wu · Josh Tenenbaum · Bill Freeman -
2017 Poster: MarrNet: 3D Shape Reconstruction via 2.5D Sketches »
Jiajun Wu · Yifan Wang · Tianfan Xue · Xingyuan Sun · Bill Freeman · Josh Tenenbaum -
2017 Poster: Self-Supervised Intrinsic Image Decomposition »
Michael Janner · Jiajun Wu · Tejas Kulkarni · Ilker Yildirim · Josh Tenenbaum -
2016 : Invited Talk: Visual Understanding of Human Activities for Smart Vehicles and Interactive Environments (Juan Carlos Niebles, Stanford) »
Juan Carlos Niebles -
2016 Workshop: Intuitive Physics »
Adam Lerer · Jiajun Wu · Josh Tenenbaum · Emmanuel Dupoux · Rob Fergus -
2016 Poster: Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling »
Jiajun Wu · Chengkai Zhang · Tianfan Xue · Bill Freeman · Josh Tenenbaum -
2016 Poster: Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks »
Tianfan Xue · Jiajun Wu · Katherine Bouman · Bill Freeman -
2016 Oral: Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks »
Tianfan Xue · Jiajun Wu · Katherine Bouman · Bill Freeman -
2015 Poster: Galileo: Perceiving Physical Object Properties by Integrating a Physics Engine with Deep Learning »
Jiajun Wu · Ilker Yildirim · Joseph Lim · Bill Freeman · Josh Tenenbaum -
2009 Poster: Exploring Functional Connectivities of the Human Brain using Multivariate Information Analysis »
Barry W Chai · Dirk B Walther · Diane M Beck · Fei-Fei Li -
2009 Poster: Hierarchical Mixture of Classification Experts Uncovers Interactions between Brain Regions »
Bangpeng Yao · Dirk B Walther · Diane M Beck · Fei-Fei Li