Poster
in
Workshop: Workshop on robustness of zero/few-shot learning in foundation models (R0-FoMo)

Fooling GPT with adversarial in-context examples for text classification

Sudhanshu Ranjan ⋅ Chung-En Sun ⋅ Linbo Liu ⋅ Lily Weng

Abstract

Deep learning-based methods helped solve NLP tasks more efficiently than traditional methods, and adversarial attacks for these methods have been extensively explored. However, Large Language Models (LLMs) have set up a new paradigm of few-shot prompting, which opens up the possibility for novel attacks. In this study, we show that LLMs can be vulnerable to adversarial prompts. We develop the first method to attack the few-shot examples in the text classification setup. We can degrade the model performance significantly during the test time by only slightly perturbing the examples based on optimization. Our method achieves a performance degradation of up to 50% without distorting the semantic meaning.

Video

Chat is not available.