Skip to yearly menu bar Skip to main content

Workshop

Workshop on Responsibly Building Next Generation of Multimodal Foundation Models

Maitreya Patel ⋅ Changhoon Kim ⋅ Siwon Kim ⋅ Chaowei Xiao ⋅ Zhe Gan ⋅ 'YZ' Yezhou Yang

Project Page [ OpenReview]

Abstract

The rapid evolution of multimodal foundation models, capable of processing and generating language, images, video, and audio, has transformed numerous fields, including robotics, healthcare, and AI-driven media. However, these advancements bring forth significant challenges related to reliability, security, and societal impact. Instances of model hallucinations and the inadvertent generation of harmful content by Text-to-Image (T2I) models underscore the need for responsible and sustainable development practices.Our workshop aims to address these critical issues by establishing design principles that prioritize precautionary measures over reactive solutions. We will explore methodologies to enhance the reliability and robustness of multimodal models, focusing on fairness, security, and the mitigation of misinformation. By emphasizing preemptive strategies during dataset curation and model pre-training, we aim to reduce the extensive resource demands traditionally associated with iterative refinement processes.Key topics of discussion will include the identification of reliability concerns stemming from data quality, model architecture, and training strategies. Additionally, we will explore novel design principles that ensure the responsible and sustainable advancement of multimodal generative models. Our goal is to foster a collaborative environment where leading researchers and practitioners can develop actionable frameworks that align with ethical standards and maximize societal benefits.Through keynote talks, panel discussions, and interactive sessions, this workshop will provide a comprehensive platform for the AI community to converge on best practices for building the next generation of multimodal foundation models. We seek to ensure these models are not only technologically advanced but also secure, equitable, and environmentally sustainable.

Video

Chat is not available.

Schedule

Timezone: America/Los_Angeles

8:15 AM

Opening Remarks

Video

8:30 AM

Keynote 1: Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Aniruddha Kembhavi ⋅ Aniruddha Kembhavi

Video

9:10 AM

Keynote 2: Understanding How Knowledge Can Be Localized, Unlearned, or Verified in Foundation Models

Soheil Feizi

Video

9:50 AM

Poster Session 1

10:44 AM

Oral Presentations

10:45 AM

When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?

Rylan Schaeffer ⋅ Dan Valentine ⋅ Luke Bailey ⋅ James Chua ⋅ Cristobal Eyzaguirre ⋅ Zane Durante ⋅ Joe Benton ⋅ Brando Miranda ⋅ Henry Sleight ⋅ Tony Wang ⋅ John Hughes ⋅ Rajashree Agrawal ⋅ Mrinank Sharma ⋅ Scott Emmons ⋅ Sanmi Koyejo ⋅ Ethan Perez

Video

11:00 AM

PopAlign: Population-Level Alignment for Fair Text-to-Image Generation

Shufan Li ⋅ Aditya Grover ⋅ Harkanwar Singh

11:15 AM

LLAVAGUARD: VLM-based Safeguards for Vision Dataset Curation and Safety Assessment

Lukas Helff ⋅ Felix Friedrich ⋅ Manuel Brack ⋅ Kristian Kersting ⋅ Patrick Schramowski

Video

11:30 AM

Multimodal Situational Safety

Kaiwen Zhou ⋅ Chengzhi Liu ⋅ Xuandong Zhao ⋅ Anderson Compalas ⋅ Xin Eric Wang

Video

11:45 AM

MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs

Wenqian Ye ⋅ Guangtao Zheng ⋅ Yunsheng Ma ⋅ Xu Cao ⋅ Bolin Lai ⋅ James Rehg ⋅ Aidong Zhang

Video

12:00 PM

Consistency-diversity-realism Pareto fronts of conditional image generative models

Pietro Astolfi ⋅ Melissa Hall ⋅ Jakob Verbeek ⋅ Marlene Careil ⋅ Oscar Mañas ⋅ Matthew Muckley ⋅ Adriana Romero ⋅ Michal Drozdzal

Video

12:15 PM

Lunch Break

1:15 PM

Keynote 3: Risk assessment, safety alignment, and guardrails for multimodal foundation models

Bo Li

Video

2:00 PM

Panel Discussion

Video

2:45 PM

Coffee Break + Poster Session 2

3:30 PM

Keynote 4: TextAttack for Improving Toxicity Detectors’ Adversarial Robustness

Yanjun Qi ⋅ Yanjun Qi

Video

4:10 PM

Keynote 5: Responsibility, Robustness, and Interpretability in the era of Generative AI

David Bau ⋅ David Bau

Video

4:50 PM

Closing Remarks and Awards

Video

Towards Secure and Private AI: A Framework for Decentralized Inference

Hongyang Zhang ⋅ Yue Zhao ⋅ Harry Yang ⋅ Ahmad Farhan ⋅ Fielding Johnston

Position Paper: Decentralized Frontier Risk and the No-Off Problem

Alexander Long

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

Wenqi Zhang ⋅ Zhenglin Cheng ⋅ Yuanyu He ⋅ Mengna Wang ⋅ Yongliang Shen ⋅ Zeqi Tan ⋅ Guiyang Hou ⋅ Mingqian He ⋅ Yanna Ma ⋅ Weiming Lu ⋅ Yueting Zhuang

MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs

Saeid Asgari ⋅ Aliasghar Khani ⋅ Amir Khasahmadi

Skipping Computations in Multimodal LLMs

Mustafa Shukor ⋅ Matthieu Cord

Aligning to What? Limits to RLHF Based Alignment

Logan Barnhart ⋅ Reza Akbarian Bafghi ⋅ Maziar Raissi ⋅ Stephen Becker

Exploring Intrinsic Fairness in Stable Diffusion

Eunji Kim ⋅ Siwon Kim ⋅ Robin Rombach ⋅ Rahim Entezari ⋅ Sungroh Yoon

MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models

Mohammad Shahab Sepehri ⋅ Zalan Fabian ⋅ Maryam Soltanolkotabi ⋅ Mahdi Soltanolkotabi

Building and better understanding vision-language models: insights and future directions

Hugo Laurençon ⋅ Andrés Marafioti ⋅ Victor Sanh ⋅ Leo Tronchon

Rethinking Artistic Copyright Infringements in the Era of Text-to-Image Generative Models

Mazda Moayeri ⋅ Samyadeep Basu ⋅ Sriram Balasubramanian ⋅ Priyatham Kattakinda ⋅ Atoosa Chegini ⋅ Robert Brauneis ⋅ Soheil Feizi

How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model?

Saeid Asgari ⋅ Joseph G Lambourne ⋅ Alana Mongkhounsavath

BigDocs: A Permissively-Licensed Dataset for Training Vision-Language Models on Document and Code Tasks

Juan Rodriguez ⋅ Xiangru Jian ⋅ Siba Smarak Panigrahi ⋅ Tianyu Zhang ⋅ Aarash Feizi ⋅ Abhay Puri ⋅ Akshay Kalkunte Suresh ⋅ François Savard ⋅ Amirhossein Abaskohi ⋅ Ahmed Masry ⋅ Shravan Nayak ⋅ Mahsa Massoud ⋅ Rabiul Awal ⋅ Pierre-André Noël ⋅ Mats L Richter ⋅ Saverio Vadacchino ⋅ Shubham Agarwal ⋅ Sanket Biswas ⋅ Ying Zhang ⋅ Sathwik Tejaswi Madhusudhan ⋅ Joao Monteiro ⋅ Krishnamurthy Dvijotham ⋅ Torsten Scholak ⋅ Nicolas Chapados ⋅ Sean Hughes ⋅ M. Tamer Özsu ⋅ Aishwarya Agrawal ⋅ Marco Pedersoli ⋅ Chris Pal ⋅ Perouz Taslakian ⋅ David Vazquez ⋅ Issam Hadj Laradji ⋅ Spandana Gella ⋅ Sai Rajeswar Mudumba

Trust but Verify: Reliable VLM evaluation in-the-wild with program synthesis

Viraj Uday Prabhu ⋅ Senthil Purushwalkam ⋅ Jieyu Zhang ⋅ An Yan ⋅ Caiming Xiong ⋅ Ran Xu

Comparison Visual Instruction Tuning

Wei Lin ⋅ Muhammad Jehanzeb Mirza ⋅ Sivan Doveh ⋅ Rogerio Feris ⋅ Raja Giryes ⋅ Sepp Hochreiter ⋅ Leonid Karlinsky

Adversarial Robust Deep Reinforcement Learning is Neither Robust Nor Safe

Ezgi Korkmaz

Attention Shift: Steering AI Away from Unsafe Content

Shivank Garg ⋅ Manyana Tiwari

LEMoN: Label Error Detection using Multimodal Neighbors

Haoran Zhang ⋅ Aparna Balagopalan ⋅ Nassim Oufattole ⋅ Hyewon Jeong ⋅ Yan Wu ⋅ Jiacheng Zhu ⋅ Marzyeh Ghassemi

GUIDE: A Responsible Multimodal Approach for Enhanced Glaucoma Risk Modeling and Patient Trajectory Analysis

Heman Shakeri ⋅ Behnaz Moradijamei

The Multi-faceted Monosemanticity in Multimodal Representations

Hanqi Yan ⋅ Yulan He ⋅ Yifei Wang

You Never Know: Quantization Induces Inconsistent Biases in Vision-Language Foundation Models

Eric Slyman ⋅ Anirudh Kanneganti ⋅ Sanghyun Hong ⋅ Stefan Lee

Decompose, Recompose, and Conquer: Multi-modal LLMs are Vulnerable to Compositional Adversarial Attacks in Multi-Image Queries

Julius Broomfield ⋅ George Ingebretsen ⋅ Reihaneh Iranmanesh ⋅ Sara Pieri ⋅ Ethan Kosak-Hine ⋅ Tom Gibbs ⋅ Reihaneh Rabbany ⋅ Kellin Pelrine

Coordinated Robustness Evaluation Framework for Vision Language Models

Ashwin Ramesh Babu ⋅ Sajad Mousavi ⋅ Desik Rengarajan ⋅ Vineet Gundecha ⋅ Sahand Ghorbanpour ⋅ Avisek Naug ⋅ Antonio Guillen-Perez ⋅ Ricardo Luna Gutierrez ⋅ Soumyendu Sarkar

WikiDO: A New Benchmark Evaluating Cross-Modal Retrieval for Vision-Language Models

Pavan Kalyan Tankala ⋅ Piyush Pasi ⋅ Sahil Dharod ⋅ Azeem Motiwala ⋅ Preethi Jyothi ⋅ Aditi Chaudhary ⋅ Krishna Srinivasan

Probabilistic Active Few-Shot Learning in Vision-Language Models

Anton Baumann ⋅ Marcus Klasson ⋅ Rui Li ⋅ Arno Solin ⋅ Martin Trapp

Just rephrase it! Uncertainty estimation in closed-source language models via multiple rephrased queries

Adam Yang ⋅ CHEN CHEN ⋅ Konstantinos Pitas

Seeing Through Their Eyes: Evaluating Visual Perspective Taking in Vision Language Models

Gracjan Góral ⋅ Alicja Ziarko ⋅ Michal Nauman ⋅ Maciej Wolczyk

CrossCheckGPT: Universal Hallucination Ranking for Multimodal Foundation Models

Guangzhi Sun ⋅ Potsawee Manakul ⋅ Adian Liusie ⋅ Kunat Pipatanakul ⋅ Chao Zhang ⋅ Phil Woodland ⋅ Mark Gales

Incorporating Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models

Ce Zhang ⋅ Zifu Wan ⋅ Zhehan Kan ⋅ Martin Q. Ma ⋅ Simon Stepputtis ⋅ Deva Ramanan ⋅ Ruslan Salakhutdinov ⋅ Louis-Philippe Morency ⋅ Katia Sycara ⋅ Yaqi Xie