Poster
in
Workshop: Socially Responsible Language Modelling Research (SoLaR)

Shh, don't say that! Domain Certification in LLMs

Cornelius Emde · Preetham Arvind · Alasdair Paren · Maxime Kayser · Thomas Rainforth · Thomas Lukasiewicz · Philip Torr · Adel Bibi

Keywords: natural language processing natural text generation certification large language model verification adversary adversarial robustness

Project Page [ OpenReview]

Abstract

Foundation language models, such as LLama, are often deployed in constrained environments. For instance, a customer support bot may utilize a large language model (LLM) as its backbone due to the model's extensive language comprehension, which typically enhances downstream performance. However, these LLMs are adversarially susceptible, potentially generating outputs outside the intended target domain. To formalize, assess and mitigate this risk, we introduce \emph{domain certification}. We formalize a guarantee that accurately characterizes the out-of-domain behavior of language models and propose an algorithm that provides adversarial bounds as a certificate. Finally, we evaluate our method across various datasets and models, demonstrating that it yields meaningful certificates.

Chat is not available.