Timezone: »
Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of a single task. In reality, a truly useful VidL system is expected to be easily generalizable to diverse tasks, domains, and datasets. To facilitate the evaluation of such systems, we introduce Video-And-Language Understanding Evaluation (VALUE) benchmark, an assemblage of 11 VidL datasets over 3 popular tasks: (i) text-to-video retrieval; (ii) video question answering; and (iii) video captioning. VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels. Rather than focusing on single-channel videos with visual information only, VALUE promotes models that leverage information from both video frames and their associated subtitles, as well as models that share knowledge across multiple tasks. We evaluate various baseline methods with and without large-scale VidL pre-training, and systematically investigate the impact of video input channels, fusion methods, and different video representations. We also study the transferability between tasks, and conduct multi-task learning under different settings. The significant gap between our best model and human performance calls for future study for advanced VidL models. VALUE is available at https://value-benchmark.github.io/.
Author Information
Linjie Li (Microsoft)
Jie Lei (Department of Computer Science, UNC, Chapel Hill)
Zhe Gan (Duke University)
Licheng Yu (University of North Carolina, Chapel Hill)
Yen-Chun Chen (Microsoft)
Rohit Pillai
Yu Cheng (Microsoft Research)
Luowei Zhou (Microsoft)
Xin Wang (University of California, Santa Cruz)
William Yang Wang (University of California, Santa Barbara)
William Wang is the Co-Director of UC Santa Barbara's Natural Language Processing group and Center for Responsible Machine Learning. He is the Duncan and Suzanne Mellichamp Chair in Artificial Intelligence and Designs, and an Associate Professor in the Department of Computer Science at the University of California, Santa Barbara. He received his PhD from School of Computer Science, Carnegie Mellon University. He has broad interests in Artificial Intelligence, including statistical relational learning, information extraction, computational social science, dialog & generation, and vision. He has published more than 100 papers at leading NLP/AI/ML conferences and journals, and received best paper awards (or nominations) at ASRU 2013, CIKM 2013, EMNLP 2015, and CVPR 2019, a DARPA Young Faculty Award (Class of 2018), an IEEE AI's 10 to Watch Award (Class of 2020), an NSF CAREER Award (2021), two Google Faculty Research Awards (2018, 2019), three IBM Faculty Awards (2017-2019), two Facebook Research Awards (2018, 2019), an Amazon AWS Machine Learning Research Award, a JP Morgan Chase Faculty Research Award, an Adobe Research Award in 2018, and the Richard King Mellon Presidential Fellowship in 2011. He frequently serves as an Area Chair or Senior Area Chair for NAACL, ACL, EMNLP, and AAAI. He is an elected member of IEEE Speech and Language Processing Technical Committee (2021-2023) and a member of ACM Future of Computing Academy. In addition to research, William enjoys writing scientific articles that impact the broader online community. His work and opinions appear at major tech media outlets such as Wired, VICE, Scientific American, Fortune, Fast Company, NASDAQ, The Next Web, Law.com, and Mental Floss.
Tamara L Berg (Stony Brook University)
Mohit Bansal (UNC Chapel Hill)
Jingjing Liu (Microsoft)
Lijuan Wang
Zicheng Liu (Microsoft)
More from the Same Authors
-
2021 : Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models »
Boxin Wang · Chejian Xu · Shuohang Wang · Zhe Gan · Yu Cheng · Jianfeng Gao · Ahmed Awadallah · Bo Li -
2021 : A Dataset for Answering Time-Sensitive Questions »
Wenhu Chen · Xinyi Wang · William Yang Wang -
2022 Poster: OmniVL: One Foundation Model for Image-Language and Video-Language Tasks »
Junke Wang · Dongdong Chen · Zuxuan Wu · Chong Luo · Luowei Zhou · Yucheng Zhao · Yujia Xie · Ce Liu · Yu-Gang Jiang · Lu Yuan -
2022 : LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning »
Yi-Lin Sung · Jaemin Cho · Mohit Bansal -
2022 : LAD: Language Augmented Diffusion for Reinforcement Learning »
Edwin Zhang · Yujie Lu · William Yang Wang · Amy Zhang -
2022 : Offline Reinforcement Learning with Closed-Form Policy Improvement Operators »
Jiachen Li · Edwin Zhang · Ming Yin · Qinxun Bai · Yu-Xiang Wang · William Yang Wang -
2022 : Off-policy Reinforcement Learning with Optimistic Exploration and Distribution Correction »
Jiachen Li · Shuo Cheng · Zhenyu Liao · Huayan Wang · William Yang Wang · Qinxun Bai -
2022 : Distance-Sensitive Offline Reinforcement Learning »
Li Jianxiong · Xianyuan Zhan · Haoran Xu · Xiangyu Zhu · Jingjing Liu · Ya-Qin Zhang -
2023 Poster: Visual Programming for Step-by-Step Text-to-Image Generation and Evaluation »
Jaemin Cho · Abhay Zala · Mohit Bansal -
2023 Poster: Resolving Interference When Merging Models »
Prateek Yadav · Derek Tam · Leshem Choshen · Colin Raffel · Mohit Bansal -
2023 Poster: Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser »
Yung-Hsuan Lai · Yen-Chun Chen · Frank Wang -
2023 Poster: Flexible Attention-Based Multi-Policy Fusion for Efficient Deep Reinforcement Learning »
Zih-Yun Chiu · Yi-Lin Tuan · William Yang Wang · Michael Yip -
2023 Poster: DrugCLIP: Contrasive Protein-Molecule Representation Learning for Virtual Screening »
Bowen Gao · Bo Qiang · Haichuan Tan · Yinjun Jia · Minsi Ren · Minsi Lu · Jingjing Liu · Wei-Ying Ma · Yanyan Lan -
2023 Poster: PanoGen: Text-Conditioned Panoramic Environment Generation for Vision-and-Language Navigation »
Jialu Li · Mohit Bansal -
2023 Poster: Self-Chained Image-Language Model for Video Localization and Question Answering »
Shoubin Yu · Jaemin Cho · Prateek Yadav · Mohit Bansal -
2023 Poster: Paxion: Patching Action Knowledge in Video-Language Foundation Models »
Zhenhailong Wang · Ansel Blume · Sha Li · Genglin Liu · Jaemin Cho · Zineng Tang · Mohit Bansal · Heng Ji -
2023 Poster: LayoutGPT: Compositional Visual Planning and Generation with Large Language Models »
Weixi Feng · Wanrong Zhu · Tsu-Jui Fu · Varun Jampani · Arjun Akula · Xuehai He · S Basu · Xin Wang · William Yang Wang -
2023 Poster: Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models »
Shihao Zhao · Dongdong Chen · Yen-Chun Chen · Jianmin Bao · Shaozhe Hao · Lu Yuan · Kwan-Yee K. Wong -
2023 Poster: Segment Everything Everywhere All at Once »
Xueyan Zou · Jianwei Yang · Hao Zhang · Feng Li · Linjie Li · Jianfeng Wang · Lijuan Wang · Jianfeng Gao · Yong Jae Lee -
2023 Poster: LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation »
Yujie Lu · Xianjun Yang · Xiujun Li · Xin Wang · William Yang Wang -
2023 Poster: ALGO: Synthesizing Algorithmic Programs with Generated Oracle Verifiers »
Kexun Zhang · Danqing Wang · Jingtao Xia · William Yang Wang · Lei Li -
2023 Poster: Improving Few-Shot Generalization by Exploring and Exploiting Auxiliary Data »
Alon Albalak · Colin Raffel · William Yang Wang -
2023 Poster: Can Language Models Teach? Teacher Explanations Improve Student Performance via Theory of Mind »
Swarnadeep Saha · Peter Hase · Mohit Bansal -
2023 Poster: PaintSeg: Painting Pixels for Training-free Segmentation »
Xiang Li · Chung-Ching Lin · Yinpeng Chen · Zicheng Liu · Jinglu Wang · Bhiksha Raj -
2023 Poster: Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models »
Peter Hase · Mohit Bansal · Been Kim · Asma Ghandeharioun -
2023 Poster: Large Language Models Are Implicitly Topic Models: Explaining and Finding Good Demonstrations for In-Context Learning »
Xinyi Wang · Wanrong Zhu · Michael Saxon · Mark Steyvers · William Yang Wang -
2023 Poster: Adaptive Contextual Perception: How To Generalize To New Backgrounds and Ambiguous Objects »
Zhuofan Ying · Peter Hase · Mohit Bansal -
2023 Poster: Idempotent Learned Image Compression with Right-Inverse »
Yanghao Li · Tongda Xu · Yan Wang · Jingjing Liu · Ya-Qin Zhang -
2023 Poster: Any-to-Any Generation via Composable Diffusion »
Zineng Tang · Ziyi Yang · Chenguang Zhu · Michael Zeng · Mohit Bansal -
2023 Workshop: Third Workshop on Efficient Natural Language and Speech Processing (ENLSP-III): Towards the Future of Large Language Models and their Emerging Descendants »
Mehdi Rezagholizadeh · Peyman Passban · Yue Dong · Yu Cheng · Soheila Samiee · Lili Mou · Qun Liu · Boxing Chen -
2022 Spotlight: ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models »
Chunyuan Li · Haotian Liu · Liunian Li · Pengchuan Zhang · Jyoti Aneja · Jianwei Yang · Ping Jin · Houdong Hu · Zicheng Liu · Yong Jae Lee · Jianfeng Gao -
2022 Spotlight: OmniVL: One Foundation Model for Image-Language and Video-Language Tasks »
Junke Wang · Dongdong Chen · Zuxuan Wu · Chong Luo · Luowei Zhou · Yucheng Zhao · Yujia Xie · Ce Liu · Yu-Gang Jiang · Lu Yuan -
2022 Poster: K-LITE: Learning Transferable Visual Models with External Knowledge »
Sheng Shen · Chunyuan Li · Xiaowei Hu · Yujia Xie · Jianwei Yang · Pengchuan Zhang · Zhe Gan · Lijuan Wang · Lu Yuan · Ce Liu · Kurt Keutzer · Trevor Darrell · Anna Rohrbach · Jianfeng Gao -
2022 Poster: Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone »
Zi-Yi Dou · Aishwarya Kamath · Zhe Gan · Pengchuan Zhang · Jianfeng Wang · Linjie Li · Zicheng Liu · Ce Liu · Yann LeCun · Nanyun Peng · Jianfeng Gao · Lijuan Wang -
2022 Poster: ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models »
Chunyuan Li · Haotian Liu · Liunian Li · Pengchuan Zhang · Jyoti Aneja · Jianwei Yang · Ping Jin · Houdong Hu · Zicheng Liu · Yong Jae Lee · Jianfeng Gao -
2022 Poster: TVLT: Textless Vision-Language Transformer »
Zineng Tang · Jaemin Cho · Yixin Nie · Mohit Bansal -
2022 Poster: NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis »
Jian Liang · Chenfei Wu · Xiaowei Hu · Zhe Gan · Jianfeng Wang · Lijuan Wang · Zicheng Liu · Yuejian Fang · Nan Duan -
2022 Poster: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners »
Zhenhailong Wang · Manling Li · Ruochen Xu · Luowei Zhou · Jie Lei · Xudong Lin · Shuohang Wang · Ziyi Yang · Chenguang Zhu · Derek Hoiem · Shih-Fu Chang · Mohit Bansal · Heng Ji -
2022 Poster: Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning »
Yujia Xie · Luowei Zhou · Xiyang Dai · Lu Yuan · Nguyen Bach · Ce Liu · Michael Zeng -
2022 Poster: LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning »
Yi-Lin Sung · Jaemin Cho · Mohit Bansal -
2022 Poster: Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning »
Haokun Liu · Derek Tam · Mohammed Muqeeth · Jay Mohta · Tenghao Huang · Mohit Bansal · Colin Raffel -
2022 Poster: GLIPv2: Unifying Localization and Vision-Language Understanding »
Haotian Zhang · Pengchuan Zhang · Xiaowei Hu · Yen-Chun Chen · Liunian Li · Xiyang Dai · Lijuan Wang · Lu Yuan · Jenq-Neng Hwang · Jianfeng Gao -
2022 Poster: VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives »
Zhuofan Ying · Peter Hase · Mohit Bansal -
2022 Poster: WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models »
Yonatan Bitton · Nitzan Bitton Guetta · Ron Yosef · Yuval Elovici · Mohit Bansal · Gabriel Stanovsky · Roy Schwartz -
2021 Poster: Local Explanation of Dialogue Response Generation »
Yi-Lin Tuan · Connor Pryor · Wenhu Chen · Lise Getoor · William Yang Wang -
2021 Poster: The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations »
Peter Hase · Harry Xie · Mohit Bansal -
2021 : Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models »
Boxin Wang · Chejian Xu · Shuohang Wang · Zhe Gan · Yu Cheng · Jianfeng Gao · Ahmed Awadallah · Bo Li -
2021 Poster: Stronger NAS with Weaker Predictors »
Junru Wu · Xiyang Dai · Dongdong Chen · Yinpeng Chen · Mengchen Liu · Ye Yu · Zhangyang Wang · Zicheng Liu · Mei Chen · Lu Yuan -
2021 Poster: Chasing Sparsity in Vision Transformers: An End-to-End Exploration »
Tianlong Chen · Yu Cheng · Zhe Gan · Lu Yuan · Lei Zhang · Zhangyang Wang -
2021 Poster: Data-Efficient GAN Training Beyond (Just) Augmentations: A Lottery Ticket Perspective »
Tianlong Chen · Yu Cheng · Zhe Gan · Jingjing Liu · Zhangyang Wang -
2021 Poster: The Elastic Lottery Ticket Hypothesis »
Xiaohan Chen · Yu Cheng · Shuohang Wang · Zhe Gan · Jingjing Liu · Zhangyang Wang -
2021 Poster: VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer »
Zineng Tang · Jaemin Cho · Hao Tan · Mohit Bansal -
2021 Poster: Detecting Moments and Highlights in Videos via Natural Language Queries »
Jie Lei · Tamara L Berg · Mohit Bansal -
2021 Poster: Counterfactual Maximum Likelihood Estimation for Training Deep Networks »
Xinyi Wang · Wenhu Chen · Michael Saxon · William Yang Wang -
2020 Workshop: HAMLETS: Human And Model in the Loop Evaluation and Training Strategies »
Divyansh Kaushik · Bhargavi Paranjape · Forough Arabshahi · Yanai Elazar · Yixin Nie · Max Bartolo · Polina Kirichenko · Pontus Lars Erik Saito Stenetorp · Mohit Bansal · Zachary Lipton · Douwe Kiela -
2020 Poster: Large-Scale Adversarial Training for Vision-and-Language Representation Learning »
Zhe Gan · Yen-Chun Chen · Linjie Li · Chen Zhu · Yu Cheng · Jingjing Liu -
2020 Spotlight: Large-Scale Adversarial Training for Vision-and-Language Representation Learning »
Zhe Gan · Yen-Chun Chen · Linjie Li · Chen Zhu · Yu Cheng · Jingjing Liu -
2018 Poster: Dialog-based Interactive Image Retrieval »
Xiaoxiao Guo · Hui Wu · Yu Cheng · Steven Rennie · Gerald Tesauro · Rogerio Feris -
2017 Demonstration: Interactive-Length Multi-Task Video Captioning with Cooperative Feedback »
Han Guo · Ramakanth Pasunuru · Mohit Bansal -
2017 Poster: Triangle Generative Adversarial Networks »
Zhe Gan · Liqun Chen · Weiyao Wang · Yuchen Pu · Yizhe Zhang · Hao Liu · Chunyuan Li · Lawrence Carin -
2017 Poster: VAE Learning via Stein Variational Gradient Descent »
Yuchen Pu · Zhe Gan · Ricardo Henao · Chunyuan Li · Shaobo Han · Lawrence Carin -
2017 Poster: Deconvolutional Paragraph Representation Learning »
Yizhe Zhang · Dinghan Shen · Guoyin Wang · Zhe Gan · Ricardo Henao · Lawrence Carin -
2017 Poster: Adversarial Symmetric Variational Autoencoder »
Yuchen Pu · Weiyao Wang · Ricardo Henao · Liqun Chen · Zhe Gan · Chunyuan Li · Lawrence Carin -
2011 Poster: Im2Text: Describing Images Using 1 Million Captioned Photographs »
Vicente Ordonez · Girish Kulkarni · Tamara L Berg -
2011 Spotlight: Im2Text: Describing Images Using 1 Million Captioned Photographs »
Vicente Ordonez · Girish Kulkarni · Tamara L Berg