Skip to yearly menu bar Skip to main content

Workshop: Intrinsically Motivated Open-ended Learning (IMOL) Workshop

Children prioritize purely exploratory actions in observe-vs.-bet tasks

Eunice Yiu · Kai Sandbrink · Alison Gopnik

Keywords: [ exploitation ] [ children ] [ Reinforcement Learning ] [ Exploration ]


In reinforcement learning, agents often need to make decisions between selecting actions that are familiar and have previously yielded positive results (exploitation), and seeking new information that could allow them to uncover more effective actions (exploration). Understanding how humans learn their sophisticated exploratory strategies over the course of their development remains an open question for both computer and cognitive science. Existing studies typically use classic bandit or gridworld tasks that confound the rewarding with the informative characteristics of an outcome. In this study, we adopt an observe-vs.-bet task that separates “pure exploration” from “pure exploitation” by giving participants the option to either observe an instance of an outcome and receive no reward, or to bet on one action that is eventually rewarding, but offers no immediate feedback. We collected data from 33 five-to-seven-year-old children who completed the task at one of three different bias levels. We compared how children performed with both approximate solutions to the partially-observable Markov decision process and meta-reinforcement learning models that was meta trained on the same decision making task across different probability levels. We found that the children observe significantly more than the two classes of algorithms and qualitatively more than adults in similar tasks. We then quantified how children’s policies differ between the different efficacy levels by fitting probabilistic programming models and by calculating the likelihood of the children’s actions under the task-driven model. The fitted parameters of the behavioral model as well as the direction of the deviation from neural network policies demonstrate that the primary way children adapt their behavior is by changing the amount of time that they bet on the most-recently-observed arm while maintaining a consistent frequency of observations across bias levels, suggesting both that children model the causal structure of the environment and a “hedging behavior” that would be impossible to detect in standard bandit tasks. The results shed light on how children reason about reward and information, providing an important developmental benchmark that can help shape our understanding of human behavior that we hope to investigate further using recently-developed neural network reinforcement learning models on reasoning about information and reward.

Chat is not available.