NeurIPS Poster Pandora's Box: Towards Building Universal Attackers against Real-World Large Vision-Language Models

Poster

Pandora's Box: Towards Building Universal Attackers against Real-World Large Vision-Language Models

Daizong Liu · Mingyu Yang · Xiaoye Qu · Pan Zhou · Xiang Fang · Keke Tang · Yao Wan · Lichao Sun

West Ballroom A-D #6104

[ Abstract ]

[ Paper] [ Slides] [ Poster] [ OpenReview]

Fri 13 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across a wide range of multimodal understanding tasks. Nevertheless, these models are susceptible to adversarial examples. In real-world applications, existing LVLM attackers generally rely on the detailed prior knowledge of the model to generate effective perturbations. Moreover, these attacks are task-specific, leading to significant costs for designing perturbation. Motivated by the research gap and practical demands, in this paper, we make the first attempt to build a universal attacker against real-world LVLMs, focusing on two critical aspects: (i) restricting access to only the LVLM inputs and outputs. (ii) devising a universal adversarial patch, which is task-agnostic and can deceive any LVLM-driven task when applied to various inputs. Specifically, we start by initializing the location and the pattern of the adversarial patch through random sampling, guided by the semantic distance between their output and the target label. Subsequently, we maintain a consistent patch location while refining the pattern to enhance semantic resemblance to the target. In particular, our approach incorporates a diverse set of LVLM task inputs as query samples to approximate the patch gradient, capitalizing on the importance of distinct inputs. In this way, the optimized patch is universally adversarial against different tasks and prompts, leveraging solely gradient estimates queried from the model. Extensive experiments are conducted to verify the strong universal adversarial capabilities of our proposed attack with prevalent LVLMs including LLaVA, MiniGPT-4, Flamingo, and BLIP-2, spanning a spectrum of tasks, all achieved without delving into the details of the model structures.

Chat is not available.