Workshop: Workshop on Machine Learning Safety

Can Large Language Models Truly Follow your Instructions?

Joel Jang · Seonghyeon Ye · Minjoon Seo


In this work, to test the capabilities of large language models on truly following the given instructions, we evaluate 9 common NLP benchmarks with negated instructions on (1) pretrained LMs (OPT \& GPT-3) of varying sizes (125M - 175B), (2) LMs further pretrained to generalize to novel instructions (InstructGPT), (3) LMs provided with few-shot examples, and (4) LMs fine-tuned specifically on negated instructions; all LM types perform worse on negated instructions as they scale and show a huge performance gap between the human performance when comparing the average score on both original and negated instructions. By highlighting a critical limitation of existing LMs and methods, we urge the community to develop new approaches to developing LMs that actually follow the given instructions in order to prevent catastrophic consequences that may occur if we prematurely endow LMs with real-world responsibilities.

Chat is not available.