NeurIPS Poster Explanations that reveal all through the deﬁnition of encoding

Poster

Explanations that reveal all through the deﬁnition of encoding

Aahlad Manas Puli · Nhi Nguyen · Rajesh Ranganath

East Exhibit Hall A-C #3109

[ Abstract ]

[ Paper] [ Poster] [ OpenReview]

Thu 12 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

Feature attributions attempt to highlight what inputs drive predictive power. Good attributions or explanations are thus those that produce inputs that retain this predictive power; accordingly, evaluations of explanations score their quality of prediction. However, evaluations produce scores better than what appears possible from the values in the explanation for a class of explanations, called encoding explanations. Probing for encoding remains a challenge because there is no general characterization of what gives the extra predictive power. We develop a deﬁnition of encoding that identiﬁes this extra predictive power via conditional dependence and show that the deﬁnition ﬁts existing examples of encoding. This deﬁnition implies, in contrast to encoding explanations, that non-encoding explanations contain all the informative inputs used to produce the explanation, giving them a “what you see is what you get” property, which makes them transparent and simple to use. Next, we prove that existing scores (ROAR, FRESH, EVAL-X) do not rank non-encoding explanations above encoding ones, and develop STRIPE-X which ranks them correctly. After empirically demonstrating the theoretical insights, we use STRIPE-X to show that despite prompting an LLM to produce non-encoding explanations for a sentiment analysis task, the LLM-generated explanations encode.

Chat is not available.