Skip to yearly menu bar Skip to main content

Workshop: Workshop on robustness of zero/few-shot learning in foundation models (R0-FoMo)

Fooling GPT with adversarial in-context examples for text classification

Sudhanshu Ranjan · Chung-En Sun · Linbo Liu · Lily Weng


Deep learning-based methods helped solve NLP tasks more efficiently than traditional methods, and adversarial attacks for these methods have been extensively explored. However, Large Language Models (LLMs) have set up a new paradigm of few-shot prompting, which opens up the possibility for novel attacks. In this study, we show that LLMs can be vulnerable to adversarial prompts. We develop the first method to attack the few-shot examples in the text classification setup. We can degrade the model performance significantly during the test time by only slightly perturbing the examples based on optimization. Our method achieves a performance degradation of up to 50% without distorting the semantic meaning.

Chat is not available.