Skip to yearly menu bar Skip to main content

Workshop: Attributing Model Behavior at Scale (ATTRIB)

Efficient Data Valuation for Weighted Nearest Neighbor Algorithms

Jiachen (Tianhao) Wang · Ruoxi Jia

Abstract: Data Shapley is a principled way to assess the importance of individual training data sources for machine learning (ML) applications. However, it often comes with computational challenges in calculating exact Data Shapley scores. KNN-Shapley \citep{jia2019efficient}, which assigns data value leveraging the efficiently computable Data Shapley score of $K$ nearest neighbors (KNN), has gained popularity as a viable alternative due to its computationally efficient nature. However, \cite{jia2019efficient} only gives a practical algorithm for computing Data Shapley for unweighted KNN, but weighted KNN is more prevalently used in practice. This work addresses the computational challenges of calculating the exact Data Shapley for weighted KNN classifiers (WKNN-Shapley). By making small adjustments to KNN configurations, we recast the computation of WKNN-Shapley into a counting problem and introduce an $O(K^2 N^2)$ algorithm, presenting a notable improvement from the naive, impractical $O(N^K)$ algorithm. We also develop a deterministic approximation algorithm that further improves computational efficiency while maintaining the key fairness properties of the Shapley value. These advancements position WKNN-Shapley as a compelling alternative to KNN-Shapley. In particular, WKNN-Shapley can select high-quality data points and improve the performance of retrieval-augmented language models.

Chat is not available.