Poster
in
Workshop: Workshop on Machine Learning Safety
Bandits with Costly Reward Observations
Aaron Tucker · Caleb Biddulph · Claire Wang · Thorsten Joachims
Abstract:
Many Machine Learning applications are based on clever ways of constructing a large dataset from already existing sources in order to avoid the cost of labeling training examples. However, in settings such as content moderation with rapidly changing distributions without automatic ground-truth feedback, this may not be possible. If an algorithm has to pay for reward information, for example by asking a person for feedback, how does this change the exploration/exploitation tradeoff? We study Bandits with Costly Reward Observations, where a cost needs to be paid in order to observe the reward of the bandit's action. We show the impact of the observation cost on the regret by proving an lower bound, present a general non-adaptive algorithm which matches the lower bound, and present several competitive adaptive algorithms.
Chat is not available.