Shh, don't say that! Domain Certification in LLMs
Abstract
Foundation language models, such as LLama, are often deployed in constrained environments. For instance, a customer support bot may utilize a large language model (LLM) as its backbone due to the model's extensive language comprehension, which typically enhances downstream performance. However, these LLMs are adversarially susceptible, potentially generating outputs outside the intended target domain. To formalize, assess and mitigate this risk, we introduce \emph{domain certification}. We formalize a guarantee that accurately characterizes the out-of-domain behavior of language models and propose an algorithm that provides adversarial bounds as a certificate. Finally, we evaluate our method across various datasets and models, demonstrating that it yields meaningful certificates.