Skip to yearly menu bar Skip to main content


Off-Policy Evaluation with Deficient Support Using Side Information

Nicolò Felicioni · Maurizio Ferrari Dacrema · Marcello Restelli · Paolo Cremonesi

Hall J (level 1) #736

Keywords: [ recommendation systems ] [ contextual bandits ] [ Deficient Support ] [ Inverse Propensity Score ] [ Off-Policy Evaluation ] [ Importance Sampling ]


The Off-Policy Evaluation (OPE) problem consists in evaluating the performance of new policies from the data collected by another one. OPE is crucial when evaluating a new policy online is too expensive or risky. Many of the state-of-the-art OPE estimators are based on the Inverse Propensity Scoring (IPS) technique, which provides an unbiased estimator when the full support assumption holds, i.e., when the logging policy assigns a non-zero probability to each action. However, there are several scenarios where this assumption does not hold in practice, i.e., there is deficient support, and the IPS estimator is biased in the general case.In this paper, we consider two alternative estimators for the deficient support OPE problem. We first show how to adapt an estimator that was originally proposed for a different domain to the deficient support setting.Then, we propose another estimator, which is a novel contribution of this paper.These estimators exploit additional information about the actions, which we call side information, in order to make reliable estimates on the unsupported actions. Under alternative assumptions that do not require full support, we show that the considered estimators are unbiased.We also provide a theoretical analysis of the concentration when relaxing all the assumptions. Finally, we provide an experimental evaluation showing how the considered estimators are better suited for the deficient support setting compared to the baselines.

Chat is not available.