Poster
Stealth edits to large language models
Oliver Sutton · Qinghua Zhou · Wei Wang · Desmond Higham · Alexander N Gorban · Alexander Bastounis · Ivan Tyukin
East Exhibit Hall A-C #4410
We present a computationally efficient new method for selectively editing large language models without retraining. This offers a potential way forward for correcting hallucinations, but also reveals a previously unrecognised vulnerability in many state of the art families of large language models. At the heart of the method is a mechanism to harness the inherent non-linearity and high dimensionality in the large language model to directly control the selectivity of the edit. Surprisingly, this can be done without accessing the knowledge stored in the language model through training, and does not require modifying the original training set. We reveal a fundamental metric for determining the ability to edit a specific language model. This metric is defined by a separability-based notion of the intrinsic dimension of the model's feature space. Extensive experimental results illustrate and support the method and its theoretical underpinnings.
Live content is unavailable. Log in and register to view live content