Timezone: »

 
VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation
Linjie Li · Jie Lei · Zhe Gan · Licheng Yu · Yen-Chun Chen · Rohit Pillai · Yu Cheng · Luowei Zhou · Xin Wang · William Yang Wang · Tamara L Berg · Mohit Bansal · Jingjing Liu · Lijuan Wang · Zicheng Liu

Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of a single task. In reality, a truly useful VidL system is expected to be easily generalizable to diverse tasks, domains, and datasets. To facilitate the evaluation of such systems, we introduce Video-And-Language Understanding Evaluation (VALUE) benchmark, an assemblage of 11 VidL datasets over 3 popular tasks: (i) text-to-video retrieval; (ii) video question answering; and (iii) video captioning. VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels. Rather than focusing on single-channel videos with visual information only, VALUE promotes models that leverage information from both video frames and their associated subtitles, as well as models that share knowledge across multiple tasks. We evaluate various baseline methods with and without large-scale VidL pre-training, and systematically investigate the impact of video input channels, fusion methods, and different video representations. We also study the transferability between tasks, and conduct multi-task learning under different settings. The significant gap between our best model and human performance calls for future study for advanced VidL models. VALUE is available at https://value-benchmark.github.io/.

Author Information

Linjie Li (Microsoft)
Jie Lei (Department of Computer Science, UNC, Chapel Hill)
Zhe Gan (Duke University)
Licheng Yu (University of North Carolina, Chapel Hill)
Yen-Chun Chen (Microsoft)
Rohit Pillai
Yu Cheng (Microsoft Research)
Luowei Zhou (Microsoft)
Xin Wang (University of California, Santa Cruz)
William Yang Wang (University of California, Santa Barbara)

William Wang is the Co-Director of UC Santa Barbara's Natural Language Processing group and Center for Responsible Machine Learning. He is the Duncan and Suzanne Mellichamp Chair in Artificial Intelligence and Designs, and an Associate Professor in the Department of Computer Science at the University of California, Santa Barbara. He received his PhD from School of Computer Science, Carnegie Mellon University. He has broad interests in Artificial Intelligence, including statistical relational learning, information extraction, computational social science, dialog & generation, and vision. He has published more than 100 papers at leading NLP/AI/ML conferences and journals, and received best paper awards (or nominations) at ASRU 2013, CIKM 2013, EMNLP 2015, and CVPR 2019, a DARPA Young Faculty Award (Class of 2018), an IEEE AI's 10 to Watch Award (Class of 2020), an NSF CAREER Award (2021), two Google Faculty Research Awards (2018, 2019), three IBM Faculty Awards (2017-2019), two Facebook Research Awards (2018, 2019), an Amazon AWS Machine Learning Research Award, a JP Morgan Chase Faculty Research Award, an Adobe Research Award in 2018, and the Richard King Mellon Presidential Fellowship in 2011. He frequently serves as an Area Chair or Senior Area Chair for NAACL, ACL, EMNLP, and AAAI. He is an elected member of IEEE Speech and Language Processing Technical Committee (2021-2023) and a member of ACM Future of Computing Academy. In addition to research, William enjoys writing scientific articles that impact the broader online community. His work and opinions appear at major tech media outlets such as Wired, VICE, Scientific American, Fortune, Fast Company, NASDAQ, The Next Web, Law.com, and Mental Floss.

Tamara L Berg (Stony Brook University)
Mohit Bansal (UNC Chapel Hill)
Jingjing Liu (Microsoft)
Lijuan Wang
Zicheng Liu (Microsoft)

More from the Same Authors