Timezone: »
Multi-armed bandit algorithms like Thompson Sampling can be used to conduct adaptive experiments, in which maximizing reward means that data is used to progressively assign more participants to more effective arms. Such assignment strategies increase the risk of statistical hypothesis tests identifying a difference between arms when there is not one, and failing to conclude there is a difference in arms when there truly is one . We present simulations for 2-arm experiments that explore two algorithms that combine the benefits of uniform randomization for statistical analysis, with the benefits of reward maximization achieved by Thompson sampling (TS). First, Top-Two Thompson Sampling adds a fixed amount of uniform random allocation (UR) spread evenly over time. Second, a novel heuristic algorithm, called TS-PostDiff (Posterior Probability of Difference). TS-PostDiff takes a Bayesian approach to mixing TS and UR: the probability a participant is assigned using UR allocation is the posterior probability that the difference between two arms is `small' (below a certain threshold), allowing for more UR exploration when there is little or no reward to be gained. We find that TS-PostDiff method performs well across multiple effect sizes, and thus does not require tuning based on a guess for the true effect size.
Author Information
Jacob Nogas (University of Toronto)
Arghavan Modiri (University of Toronto)
Sofia Villar (MRC Biostatistics Unit, University of Cambridge)
Audrey Durand (Université Laval)
Anna Rafferty (Carleton College)
Joseph Williams (National University of Singapore)
More from the Same Authors
-
2019 : Audrey Durand, Douwe Kiela, Kamalika Chaudhuri moderated by Yann Dauphin »
Audrey Durand · Kamalika Chaudhuri · Yann Dauphin · Orhan Firat · Dilan Gorur · Douwe Kiela -
2019 : Audrey Durand - Trading off theory and practice: A bandit perspective »
Audrey Durand -
2017 Workshop: Teaching Machines, Robots, and Humans »
Maya Cakmak · Anna Rafferty · Adish Singla · Jerry Zhu · Sandra Zilles