Sequential decisionmaking problems appear in settings as varied as healthcare, ecommerce, operations management, and policymaking, and depending on the context these can have very varied features that make each problem unique. Problems can involve online learning or offline data, known cost structures or unknown counterfactuals, continuous actions with or without constraints or finite or combinatorial actions, stationary environments or environments with dynamic agents, utilitarian considerations or fairness or equity considerations. More and more, causal inference and discovery and adjacent statistical theories have come to bear on such problems, from the early work on longitudinal causal inference from the last millenium up to recent developments in bandit algorithms and inference, dynamic treatment regimes, both online and offline reinforcement learning, interventions in general causal graphs and discovery thereof, and more. While the interaction between these theories has grown, expertise is spread across many different disciplines, including CS/ML, (bio)statistics, econometrics, ethics/law, and operations research.
The primary purpose of this workshop is to convene both experts, practitioners, and interested young researchers from a wide range of backgrounds to discuss recent developments around causal inference in sequential decision making and the avenues forward on the topic, especially ones that bring together ideas from different fields. The allvirtual nature of this year's NeurIPS workshop makes it particularly felicitous to such an assembly. The workshop will combine invited talks and panels by a diverse group of researchers and practitioners from both academia and industry together with contributed talks and townhall Q\&A that will particularly seek to draw from younger individuals new to the area.
Tue 10:50 a.m.  11:00 a.m.

Opening Remarks
SlidesLive Video » This is a placeholder added by NeurIPS to define the start time of the workshop. 
🔗 
Tue 11:00 a.m.  11:30 a.m.

TBD (Elias Bareibnboim)
(
Live Talk and Q&A
)
SlidesLive Video » 
Elias Bareinboim 🔗 
Tue 11:30 a.m.  12:00 p.m.

Sequential Adaptive Designs for Learning Optimal Individualized Treatment Rules with Formal Inference (Mark van der Laan)
(
Live Talk and Q&A
)
SlidesLive Video » In this talk we review various of our contributions on sequential adaptive designs. Firstly, we review sequential targeted adaptive designs in which one adapts to complete data records of previously enrolled subjects, thereby relying on shortterm clinical outcomes. In particular, we show how TMLE and online superlearning can be used to preserve unbiased formal inference in such adaptive designs. We demonstrate the power of this type of design for optimizing treatment for sepsis patients. Secondly, we discuss the natural extension of these sequential adaptive designs in continuous time in which case one adapts to all previously collected data, including many partially observed data structures of previously enrolled subjects. In this setting, we emphasize the utilization of timespecific surrogate outcomes as a way to still have power to learn powerful optimal rules while adapting over time the choice of surrogate so that we best approximate the optimal treatment rule w.r.t. the final clinical outcome. We show some simulation results demonstrating the performance of such adaptive designs in continuous time. Finally, we present sequential adaptive designs for a single time series in which one learns the optimal rule for maximizing the next short term outcome, again using TMLE and online super learning to allow for formal statistical inference. It is shown how all formal results rely on theory of TMLE and martingale processes. This is a join work with: Antoine Chambaz, Wenjing Zheng, Ivana Malenica, Aurelien Bibaut, Aaron Hudson, Wenxin Zhang. 
Mark van der Laan 🔗 
Tue 12:00 p.m.  12:30 p.m.

Confident OffPolicy Evaluation and Selection through SelfNormalized Importance Weighting (Claire Vernade)
(
Live Talk and Q&A
)
SlidesLive Video » We consider offpolicy evaluation in the contextual bandit setting for the purpose of obtaining a robust offpolicy selection strategy, where the selection strategy is evaluated based on the value of the chosen policy in a set of proposal (target) policies. We propose a new method to compute a lower bound on the value of an arbitrary target policy given some logged data in contextual bandits for a desired coverage. The lower bound is built around the socalled Selfnormalized Importance Weighting (SN) estimator. It combines the use of a semiempirical EfronStein tail inequality to control the concentration and Harris' inequality to control the bias. The new approach is evaluated on a number of synthetic and real datasets and is found to be superior to its main competitors, both in terms of tightness of the confidence intervals and the quality of the policies chosen. 
Claire Vernade 🔗 
Tue 12:30 p.m.  1:10 p.m.

Panel Discussion
(
Discussion Panel
)
SlidesLive Video » 
Elias Bareinboim · Mark van der Laan · Claire Vernade 🔗 
Tue 1:20 p.m.  2:20 p.m.

Poster Presentation link »  🔗 
Tue 2:20 p.m.  2:40 p.m.

Covariate Shift of Latent Confounders in Imitation and Reinforcement Learning (Guy Tennenholtz)
(
Oral Presentation
)
SlidesLive Video » 
Guy Tennenholtz 🔗 
Tue 2:40 p.m.  3:00 p.m.

MAGNET: MultiAgent Graph Cooperative Bandits (Hengrui Cai)
(
Oral Presentation
)
SlidesLive Video » 
Hengrui Cai 🔗 
Tue 3:00 p.m.  3:30 p.m.

(un)fairness in sequential decision making as a challenge (Razieh Nabi)
(
Live Talk and Q&A
)
SlidesLive Video » In this talk, we focus on automated sequential decision making which are increasingly used in sociallyimpactful settings like social welfare,hiring, criminal justice, etc. A particular challenge in the context of automated sequential decision making is to avoid the “perpetuation of injustice,” i.e., when maximizing utility maintains, reinforces, or even introduces unfair dependencies between sensitive features, decisions, and outcomes. In this talk, we show how to use methods from causal inference and constrained optimization to make optimal but fair decisions that would “break the cycle of injustice” by correcting for the unfair dependence of both decisions and outcomes on sensitive features. 
Razieh Nabi 🔗 
Tue 3:30 p.m.  4:00 p.m.

OffPolicy Confidence Interval Estimation with Confounded Markov Decision Process (Rui Song)
(
Live Talk and Q&A
)
SlidesLive Video » In this talk, we consider constructing a confidence interval for a target policy’s value offline based on precollected observational data in infinite horizon settings. Most of the existing works assume no unmeasured variables exist that confound the observed actions. This assumption, however, is likely to be violated in real applications such as healthcare and technological industries. We show that with some auxiliary variables that mediate the effect of actions on the system dynamics, the target policy’s value is identifiable in a confounded Markov decision process. Based on this result, we develop an efficient offpolicy value estimator that is robust to potential model misspecification and provides rigorous uncertainty quantification. 
Rui Song 🔗 
Tue 4:00 p.m.  4:30 p.m.

TALK (Susan Athey)
(
Live Talk and Q&A
)
SlidesLive Video » 
Susan Athey 🔗 
Tue 4:30 p.m.  5:10 p.m.

Panel Discussion
SlidesLive Video » 
Susan Athey · Rui Song · Razieh Nabi 🔗 
Tue 5:30 p.m.  5:50 p.m.

What Would the Expert $do(\cdot)$?: Causal Imitation Learning (Gokul Swamy)
(
Oral Presentation
)
SlidesLive Video » 
Gokul Swamy 🔗 
Tue 5:50 p.m.  6:10 p.m.

The Limits to Learning a Diffusion Model (Andy Zheng)
(
Oral Presentation
)
SlidesLive Video » 
Andrew Zheng 🔗 
Tue 6:10 p.m.  6:30 p.m.

DeviationBased Learning (Komiyama Junpei)
(
Oral Presentation
)
SlidesLive Video » 
Junpei Komiyama 🔗 
Tue 6:30 p.m.  6:40 p.m.

Closing Remarks
(
Remarks
)
This is a placeholder event added by NeurIPS to define the end of the workshop. 
🔗 
Tue 6:40 p.m.  7:30 p.m.

Poster Presentation link »  🔗 


Bandits with Partially Observable Confounded Data
(
Poster
)
We study linear contextual bandits with access to a large, confounded, offline dataset that was sampled from some fixed policy. We show that this problem is closely related to a variant of the bandit problem with side information. We construct a linear bandit algorithm that takes advantage of the projected information, and prove regret bounds. Our results demonstrate the ability to take advantage of confounded offline data. Particularly, we prove regret bounds that improve current bounds by a factor related to the visible dimensionality of the contexts in the data. Our results indicate that confounded offline data can significantly improve online learning algorithms. Finally, we demonstrate various characteristics of our approach through synthetic simulations. 
Guy Tennenholtz · Uri Shalit · Shie Mannor · Yonathan Efroni 🔗 


DeviationBased Learning
(
Poster
)
We propose deviationbased learning, a new approach to training recommender systems. In the beginning, the recommender and rational users have different pieces of knowledge, and the recommender needs to learn the users' knowledge to make better recommendations. The recommender learns users' knowledge by observing whether each user followed or deviated from her recommendations. We show that learning frequently stalls if the recommender always recommends a choice: users tend to follow the recommendation blindly, and their choices do not reflect their knowledge. Social welfare and the learning rate are improved drastically if the recommender abstains from recommending a choice when she predicts that multiple arms will produce a similar payoff. 
Junpei Komiyama · Shunya Noda 🔗 


MAGNET: MultiAgent Graph Cooperative Bandits
(
Poster
)
We consider the online optimization for multiple interactive agents who cooperate to achieve a global best reward. By interaction, one agent's action and reward will influence other agents' rewards. Yet, the existing cooperative bandits primarily assume that agents are independent of each other. The multiagent reinforcement learning methods, on the other hand, mainly rely on deep learning, and thus hard to interpret or achieve a theoretical guarantee. In this work, we leverage the idea from the structural equation model to characterize different interactions among agents simultaneously, and propose the multiagent graph cooperative bandits (MAGNET) that integrates techniques from online decision making, graph structure modeling, and multiagent system. We allow heterogeneous individual rewards in the cooperative agents, and the interaction information partially known or completely unknown. Depending on the scenario if the global objective is known or not, the framework consists of a global optimizer for a central controller, and a local optimizer for decentralized agents, respectively. We derive the regret bound for the global optimizer as $\mathcal{O}( \sqrt{KT})$ and for the local optimizer as $\mathcal{O}(K \sqrt{T})$. Extensive simulation studies show the empirical validity of the proposed methods.

Hengrui Cai · Rui Song 🔗 


On Adaptivity and Confounding in Contextual Bandit Experiments
(
Poster
)
Multiarmed bandit algorithms minimize experimentation costs required to converge on optimal behavior. They do so by rapidly adapting experimentation effort away from poorly performing actions as feedback is observed. But this desirable feature makes them sensitive to confounding. We highlight, for instance, that popular bandit algorithms cannot address the problem of identifying the best action when dayofweek effects may confound inferences. In response, this paper proposes deconfounded Thompson sampling, which makes simple, but critical, modifications to the way Thompson sampling is usually applied. Theoretical guarantees suggest the algorithm strikes a delicate balance between adaptivity and robustness to confounding. It attains asymptotic lower bounds on the number of samples required to confidently identify the best action  suggesting optimal adaptivity  but also satisfies strong performance guarantees in the presence of dayofweek effects and delayed observations  suggesting unusual robustness. 
Chao Qin · Daniel Russo 🔗 


Doubly robust confidence sequences
(
Poster
)
This paper derives timeuniform confidence sequences (CS) for causal effects in experimental and observational settings. A confidence sequence for a target parameter $\psi$ is a sequence of confidence intervals $(C_t)_{t=1}^\infty$ such that every one of these intervals simultaneously captures $\psi$ with high probability. Such CSs provide valid statistical inference for $\psi$ at arbitrary stopping times, unlike classical fixedtime confidence intervals which require the sample size to be fixed in advance. Existing methods for constructing CSs focus on the nonasymptotic regime where certain assumptions (such as known bounds on the random variables) are imposed, while doubly robust estimators of causal effects rely on (asymptotic) semiparametric theory. We use sequential versions of central limit theorem arguments to construct largesample CSs for causal estimands, with a particular focus on the average treatment effect (ATE) under nonparametric conditions. These CSs allow analysts to update inferences about the ATE in lieu of new data, and experiments can be continuously monitored, stopped, or continued for any datadependent reason, all while controlling the typeI error. Finally, we describe how these CSs readily extend to other causal estimands and estimators, providing a new framework for sequential causal inference in a wide array of problems.

Ian WaudbySmith · David Arbour · Ritwik Sinha · Edward Kennedy · Aaditya Ramdas 🔗 


A Causalitybased Graphical Test to obtain an Optimal Blocking Set for Randomized Experiments
(
Poster
)
Randomized experiments are often performed to study the causal effects of interest. Blocking is a technique to precisely estimate the causal effects when the experimental material is not homogeneous. We formalize the problem of obtaining a statistically optimal set of covariates to be used to create blocks while performing a randomized experiment. We provide a graphical test to obtain such a set for a general semiMarkovian causal model. We also propose and provide ideas towards solving a more general problem of obtaining an optimal blocking set that considers both the statistical and economic costs of blocking. 
Abhishek Kumar Umrawal 🔗 


Kernel Methods for Multistage Causal Inference: Mediation Analysis and Dynamic Treatment Effects
(
Poster
)
We propose kernel ridge regression estimators for causal inference in multistage settings. Specifically, we focus on (i) the decomposition of a total effect into a direct effect and an indirect effect (mediated by a particular mechanism); and (ii) effects of sequences of treatments. We allow treatment, covariates, and mediators to be discrete or continuous, and low, high, or infinite dimensional. We propose estimators of means, increments, and distributions of counterfactual outcomes. Important examples are (i) direct and indirect dose response curves; and (ii) dynamic dose response curves. Each estimator has a closed form solution and is easily computed by kernel matrix operations. For the nonparametric case, we prove uniform consistency and provide finite sample rates of convergence. For the semiparametric case, we prove rootn consistency, Gaussian approximation, and semiparametric efficiency. We evaluate our estimators in simulations then estimate mediated and dynamic treatment effects of the US Job Corps training program for disadvantaged youth. 
Rahul Singh · Ritsugen Jo · Arthur Gretton 🔗 


Reinforcement Learning in RewardMixing MDPs
(
Poster
)
Learning a near optimal policy in a partially observable system remains an elusive challenge in contemporary reinforcement learning. In this work, we consider episodic reinforcement learning in a rewardmixing Markov decision process (MDP). There, a reward function is drawn from one of $M$ possible reward models at the beginning of every episode, but the identity of the chosen reward model is not revealed to the agent. Hence, the latent state space, for which the dynamics are Markovian, is not given to the agent. We study the problem of learning a near optimal policy for two rewardmixing MDPs. Unlike existing approaches that rely on strong assumptions on the dynamics, we make no assumptions and study the problem in full generality. Indeed, with no further assumptions, even for two switching rewardmodels, the problem requires several new ideas beyond existing algorithmic and analysis techniques for efficient exploration. We provide the first polynomialtime algorithm that finds an $\epsilon$optimal policy after exploring $\tilde{O}(poly(H,\epsilon^{1}) \cdot S^2 A^2)$ episodes, where $H$ is timehorizon and $S, A$ are the number of states and actions respectively. This is the first efficient algorithm that does not require any assumptions in partially observed environments where the observation space is smaller than the latent state space.

Jeongyeol Kwon · Yonathan Efroni · Constantine Caramanis · Shie Mannor 🔗 


Chronological Causal Bandit
(
Poster
)
This paper studies an instance of the multiarmed bandit (\mab) problem, specifically where several causal \textsc{mab}s operate chronologically in the same dynamical system. Practically the reward distribution of each bandit is governed by the same nontrivial dependence structure, which is a dynamic causal model. Dynamic because we allow for each causal \mab to depend on the preceding \mab and in doing so are able to transfer information between both agents. Our contribution, the Chronological Causal Bandit (\ccb), is useful in discrete decisionmaking settings where the causal effects are changing across time and can be informed by earlier interventions in the same system. In this paper we present some early findings of the \ccb as demonstrated on a toy problem. 
Neil Dhir 🔗 


Causal MultiAgent Reinforcement Learning: Review and Open Problems
(
Poster
)
This paper serves to introduce the reader to the field of multiagent reinforcement learning (MARL) and its intersection with methods from the study of causality. We highlight key challenges in MARL and discuss these in the context of how causal methods may assist in tackling them. We promote moving toward a 'causality first' perspective on MARL. Specifically, we argue that causality can offer improved safety, interpretability, and robustness, while also providing strong theoretical guarantees for emergent behaviour. We discuss potential causality related solutions for common challenges, and use this context to motivate future research directions. Finally, we provide some perspectives on open problems. 
St John Grimbly · Jonathan Shock · Arnu Pretorius 🔗 


Covariate Shift of Latent Confounders in Imitation and Reinforcement Learning
(
Poster
)
We consider the problem of using expert data with unobserved confounders for imitation and reinforcement learning. We begin by defining the problem of learning from confounded expert data in a contextual MDP setup. We analyze the limitations of learning from such data with and without external reward and propose an adjustment of standard imitation learning algorithms to fit this setup. In addition, we discuss the problem of distribution shift between the expert data and the online environment when partial observability is present in the data. We prove possibility and impossibility results for imitation learning under arbitrary distribution shift of the missing covariates. When additional external reward is provided, we propose a sampling procedure that addresses the unknown shift and prove convergence to an optimal solution. Finally, we validate our claims empirically on challenging assistive healthcare and recommender system simulation tasks. 
Guy Tennenholtz · Assaf Hallak · Gal Dalal · Shie Mannor · Gal Chechik · Uri Shalit 🔗 


Multiple imputation via state space model for missing data in nonstationary multivariate time series
(
Poster
)
Missing data is a ubiquitous problem in biomedical and social science research and almost all areas involving data collection. Data imputation is commonly recommended to make the best use of valuable data at hand and gain more efficient estimates of quantities of interest. Mobile technology (e.g., mobile phones and wearable devices) allows an unprecedented way to closely monitor individuals' behavior, social interaction, symptoms, and other health conditions in realtime, holding great potential for scientific discoveries in psychiatric research and advancements in the treatment of severe mental illness. Continuous data collection using mobile technology gives rise to a new type of data  entangled multivariate time series of outcome, exposure, and covariates, and poses new challenges in missing data imputation for valid causal inference. Most existing imputation methods are either designed for longitudinal data with limited followup times or for stationary time series, which may not be suitable in the field of psychiatry when mental health symptoms display dramatic changes over time or patients experience shifts in treatment regime over their course of recovery. We propose a novel multiple imputation method based on the state space model (SSMmp) to address missing data in multivariate time series that are potentially nonstationary. We evaluate its theoretical properties and performance in extensive simulations of both stationary and nonstationary time series under different missing mechanisms, showing its advantages over other commonly used strategies for missing data. We apply the SSMmp method in the analysis of a multiyear observational smartphone study of bipolar patients to evaluate the association between social network size and psychiatric symptoms adjusting for confounding. 
Xiaoxuan Cai · Linda Valeri 🔗 


Practical Policy Optimization with PersonalizedExperimentation
(
Poster
)
Many organizations measure treatment effects via an experimentation platform to evaluate the casual effect of product variations prior to fullscale deployment [3,25]. However, standard experimentation platforms do not perform optimally for end user populations that exhibit heterogeneous treatment effects (HTEs) [3,5]. Here we present a personalized experimentation framework, Personalized Experiments (PEX), which optimizes treatment group assignment at the user level via HTE modeling and sequential decision policy optimization to optimize multiple short term and long term outcomes simultaneously. We describe an endtoend workflow that has proven to be successful in practice and can be readily implemented using opensource software 
Mia Garrard · Hanson Wang · Ben Letham · Zehui Wang · Yin Huang · Yichun Hu · Chad Zhou · Norm Zhou · Eytan Bakshy 🔗 


Dynamic Causal Discovery in Imitation Learning
(
Poster
)
Using deep reinforcement learning (DRL) to recover expert policies via imitation has been found to be promising in a wide range of applications. However, it remains a difficult task to interpret the control policy learned by the agent. Difficulties mainly come from two aspects: 1) agents in DRL are usually implemented as deep neural networks (DNNs), which are blackbox models and lack in interpretability; 2) the latent causal mechanism behind agents' decisions may vary along the trajectory, rather than staying static throughout time steps. To address these difficulties, in this paper, we propose a selfexplaining imitation framework, which can expose causal relations among states and action variables behind its decisions. Specifically, a dynamic causal discovery module is designed to extract the causal graph basing on historical trajectory and current states at each time step, and a causality encoding module is designed to model the interactions among variables with discovered causal edges. After encoding causality into variable embeddings, a prediction model conducts the imitation learning on top of obtained representations. These three components are trained endtoend, and discovered causal edges can provide interpretations on rules captured by the agent. Comprehensive experiments are conducted on the simulation dataset to analyze its causal discovery capacity, and we further test it on a realworld medical dataset MIMICIV. Experimental results demonstrate its potential of providing explanations behind decisions. 
Wenchao Yu 🔗 


A Validation Tool for Designing Reinforcement Learning Environments
(
Poster
)
Reinforcement learning (RL) has gained increasing attraction in the academia and tech industry with launches to a variety of impactful applications and products. Although research is being actively conducted on many fronts (e.g., offline RL, performance, etc.), many RL practitioners face a challenge that has been largely ignored: determine whether a designed Markov Decision Process (MDP) is valid and meaningful. This study proposes a heuristicbased feature analysis method to validate whether an MDP is well formulated. We believe an MDP suitable for applying RL should contain a set of state features that are both sensitive to actions and predictive in rewards. We tested our method in constructed environments showing that our approach can identify certain invalid environment formulations. As far as we know, performing validity analysis for RL problem formulation is a novel direction. We envision that our tool will serve as a motivational example to help practitioners apply RL in realworld problems more easily. 
RUIYANG XU · Zhengxing Chen 🔗 


What Would the Expert $do(\cdot)$?: Causal Imitation Learning
(
Poster
)
We develop algorithms for imitation learning from data that was corrupted by unobserved confounders. Sources of such confounding include (a) persistent perturbations to actions or (b) the expert responding to a part of the state that the learner does not have access to. When a confounder affects multiple timesteps of recorded data, it can manifest as spurious correlations between states and actions that a learner might latch onto, leading to poor policy performance. By utilizing the effect of past states on current states, we are able to break up these spurious correlations, an application of the econometric technique of instrumental variable regression. This insight leads to two novel algorithms, one of a generativemodeling flavor (DoubIL) that can utilize access to a simulator and one of a gametheoretic flavor (ResiduIL) that can be run offline. Both approaches are able to find policies that match the result of a query to an unconfounded expert. We find both algorithms compare favorably to noncausal approaches on simulated control problems. 
Gokul Swamy · Sanjiban Choudhury · James Bagnell · Steven Wu 🔗 


Learning Treatment Effects in Panels with General Intervention Patterns
(
Poster
)
The problem of causal inference with panel data is a central econometric question. The following is a fundamental version of this problem: Let $M^*$ be a lowrank matrix and $E$ be a zeromean noise matrix. For a `treatment' matrix $Z$ with entries in $\{0,1\}$ we observe the matrix $O$ with entries $O_{ij} := M^*_{ij} + E_{ij} + \mathcal{T}_{ij} Z_{ij}$ where $\mathcal{T}_{ij} $ are unknown, heterogeneous treatment effects. The problem requires we estimate the average treatment effect $\tau^* := \sum_{ij} \mathcal{T}_{ij} Z_{ij} / \sum_{ij} Z_{ij}$. The synthetic control paradigm provides an approach to estimating $\tau^*$ when $Z$ places support on a single row. This paper extends that framework to allow rateoptimal recovery of $\tau^*$ for general $Z$, thus broadly expanding its applicability. Our guarantees are the first of their type in this general setting. Computational experiments on synthetic and realworld data show a substantial advantage over competing estimators.

Vivek Farias · Andrew Li · Tianyi Peng 🔗 


Partitionbased Local Independence Discovery
(
Poster
)
A system often exhibits more finegrained causal relationships between a variable and its parents, which we call local independence relationships including ContextSpecific Independence (CSI). It plays a crucial role in probabilistic inference and causal effect identification. However, for the continuous variables, directly searching for the local independence is infeasible and it is still unclear how to effectively discover them. To this end, we propose a novel framework to discover the local independences of continuous variables by partitioning the joint outcome space of the parent variables and imposing each partition to induce the local independence via modeling of a conditional distribution. In experiments with both static and sequential settings, we empirically demonstrated that the proposed method successfully discovers the ground truth local independencies. 
Inwoo Hwang · ByoungTak Zhang · Sanghack Lee 🔗 


Understanding User Podcast Consumption Using Sequential Treatment Effect Estimation
(
Poster
)
Podcast recommendation systems are increasingly becoming prevalent with the growing popularity of podcasts. Accounting for the sequential nature of user interaction with such systems is cardinal to ensure that the recommendations are meaningful to the users. However, while there is growing work in effectively designing the recommendation systems, there is limited understanding of how the recommendations affect user behavior, i.e., what is the causal effect of a recommendation on longterm user behavior. In this work, we seek to explain the sequential causal effect of consuming a podcast category on longterm user behavior. We estimate the direct and indirect effect of user consumption, here, podcast category, on consecutive outcomes, i.e, aggregate weekly user podcast consumption observed across four weeks. We observe nonzero, nonhomogeneous direct and indirect effects on outcomes across all weeks for multiple treatments, i.e. podcast categories. Moreover, consuming certain categories such as true crime have overall higher causal effects than others, highlighting that some podcast categories can alter user behavior resulting in longterm increased consumption. 
Vishwali Mhasawade · Praveen Chandar · Ghazal Fazelnia · Benjamin Carterette 🔗 


A Variational Information Bottleneck Principle for Recurrent Neural Networks
(
Poster
)
We propose an information bottleneck principle for causal timeseries prediction. We develop variational bounds on the information bottleneck objective function that can be optimized using recurrent neural networks. We implement our algorithm on simulated data as well as realworld weatherprediction and stock marketprediction datasets and show that these problems can be successfully solved using the new information bottleneck principle. 
ADRIAN TOVAR · Varun Jog 🔗 


OffPolicy Evaluation with Embedded Spaces
(
Poster
)
Slate recommendation systems are commonly evaluated prior to deployment using offpolicy evaluation methods, whereby data collected under the old logging policy is used to predict the performance of a new target policy. However, in practice most recommendation systems are not observed to recommend the vast majority of items, which is an issue since existing methods require that the probability of the target policy recommending an item can only be nonzero when the probability of the logging policy is nonzero. To circumvent this issue, we explore the use of item embeddings. By representing queries and slates in an embedding space, we are able to share information to extrapolate behaviors for queries and items that have not been seen yet. 
Jaron Jia Rong Lee · David Arbour · Georgios Theocharous 🔗 


Algorithms for Adaptive Experiments that Tradeoff Statistical Analysis with Reward: Combining Uniform Random Assignment and Reward Maximization
(
Poster
)
Multiarmed bandit algorithms like Thompson Sampling can be used to conduct adaptive experiments, in which maximizing reward means that data is used to progressively assign more participants to more effective arms. Such assignment strategies increase the risk of statistical hypothesis tests identifying a difference between arms when there is not one, and failing to conclude there is a difference in arms when there truly is one . We present simulations for 2arm experiments that explore two algorithms that combine the benefits of uniform randomization for statistical analysis, with the benefits of reward maximization achieved by Thompson sampling (TS). First, TopTwo Thompson Sampling adds a fixed amount of uniform random allocation (UR) spread evenly over time. Second, a novel heuristic algorithm, called TSPostDiff (Posterior Probability of Difference). TSPostDiff takes a Bayesian approach to mixing TS and UR: the probability a participant is assigned using UR allocation is the posterior probability that the difference between two arms is `small' (below a certain threshold), allowing for more UR exploration when there is little or no reward to be gained. We find that TSPostDiff method performs well across multiple effect sizes, and thus does not require tuning based on a guess for the true effect size. 
Jacob Nogas · Arghavan Modiri · · Sofia Villar · Audrey Durand · Anna Rafferty · Joseph Williams 🔗 


The Limits to Learning a Diffusion Model
(
Poster
)
This paper provides the first sample complexity lower bounds for the estimation of simple diffusion models, including the Bass model (used in modeling consumer adoption) and the SIR model (used in modeling epidemics). We show that one cannot hope to learn such models until quite late in the diffusion. Specifically, we show that the time required to collect a number of observations that exceeds our sample complexity lower bounds is large. For Bass models with low innovation rates, our results imply that one cannot hope to predict the eventual number of adopting customers until one is at least twothirds of the way to the time at which the rate of new adopters is at its peak. In a similar vein, our results imply that in the case of an SIR model, one cannot hope to predict the eventual number of infections until one is approximately twothirds of the way to the time at which the infection rate has peaked. These limits are borne out in both product adoption data (Amazon), as well as epidemic data (COVID19). 
Jackie Baek · Vivek Farias · ANDREEA GEORGESCU · Retsef Levi · Tianyi Peng · Joshua Wilde · Andrew Zheng 🔗 


Beyond Ads: Sequential DecisionMaking Algorithmsin Public Policy
(
Poster
)
We explore the promises and challenges of employing sequential decisionmaking algorithms  such as bandits, reinforcement learning, and active learning  in the public sector. While such algorithms have been heavily studied in settings that are suitable for the private sector (e.g., online advertising), the public sector could greatly benefit from these approaches, but poses unique methodological challenges for machine learning. We highlight several applications of sequential decisionmaking algorithms in regulation and governance, and discuss areas for further research which would enable them to be more widely applicable, fair, and effective. In particular, ensuring that these systems learn rational, causal decisionmaking policies can be difficult and requires great care. We also note the potential risks of such deployments and urge caution when conducting work in this area. We hope our work inspires more investigation of publicsector sequential decision making applications, which provide unique challenges for machine learning researchers and can be socially beneficial. 
Peter Henderson · Brandon Anderson · Daniel Ho 🔗 


Double/Debiased Machine Learning for Dynamic Treatment Effects via $g$Estimation
(
Poster
)
We consider the estimation of treatment effects in settings when multiple treatments are assigned over time and treatments can have a causal effect on future outcomes or the state of the treated unit. We propose an extension of the double/debiased machine learning framework to estimate the dynamic effects of treatments, which can be viewed as a Neyman orthogonal (locally robust) crossfitted version of $g$estimation in the dynamic treatment regime. Our method applies to a general class of nonlinear dynamic treatment models known as Structural Nested Mean Models and allows the use of machine learning methods to control for potentially high dimensional state variables, subject to a mean square error guarantee, while still allowing parametric estimation and construction of confidence intervals for the structural parameters of interest. These structural parameters can be used for offpolicy evaluation of any target dynamic policy at parametric rates, subject to semiparametric restrictions on the data generating process. Our work is based on a recursive peeling process, typical in $g$estimation, and formulates a strongly convex objective at each stage, which allows us to extend the $g$estimation framework in multiple directions: i) to provide finite sample guarantees, ii) to estimate nonlinear effect heterogeneity with respect to fixed unit characteristics, within arbitrary function spaces, enabling a dynamic analogue of the RLearner algorithm for heterogeneous effects, iii) to allow for highdimensional sparse parameterizations of the target structural functions, enabling automated model selection via a recursive lasso algorithm. We also provide guarantees for data stemming from a single treated unit over a long horizon and under stationarity conditions.

Greg Lewis · Vasilis Syrgkanis 🔗 


Estimating the LongTerm Effects of Novel Treatments
(
Poster
)
Policy makers often need to estimate the longterm effects of novel treatments, while only having historical data of older treatment options. We propose a surrogatebased approach using a longterm dataset where only past treatments were administered and a shortterm dataset where novel treatments have been administered. Our approach generalizes previous surrogatestyle methods, allowing for continuous treatments and seriallycorrelated treatment policies while maintaining consistency and rootn asymptotically normal estimates under a Markovian assumption on the data and the observational policy. Using a semisynthetic dataset on customer incentives from a major corporation, we evaluate the performance of our method and discuss solutions to practical challenges when deploying our methodology. 
Keith Battocchi · Maggie Hei · Greg Lewis · Miruna Oprescu · Vasilis Syrgkanis 🔗 


DeviationBased Learning
(
Oral
)
We propose deviationbased learning, a new approach to training recommender systems. In the beginning, the recommender and rational users have different pieces of knowledge, and the recommender needs to learn the users' knowledge to make better recommendations. The recommender learns users' knowledge by observing whether each user followed or deviated from her recommendations. We show that learning frequently stalls if the recommender always recommends a choice: users tend to follow the recommendation blindly, and their choices do not reflect their knowledge. Social welfare and the learning rate are improved drastically if the recommender abstains from recommending a choice when she predicts that multiple arms will produce a similar payoff. 
Junpei Komiyama · Shunya Noda 🔗 


MAGNET: MultiAgent Graph Cooperative Bandits
(
Oral
)
We consider the online optimization for multiple interactive agents who cooperate to achieve a global best reward. By interaction, one agent's action and reward will influence other agents' rewards. Yet, the existing cooperative bandits primarily assume that agents are independent of each other. The multiagent reinforcement learning methods, on the other hand, mainly rely on deep learning, and thus hard to interpret or achieve a theoretical guarantee. In this work, we leverage the idea from the structural equation model to characterize different interactions among agents simultaneously, and propose the multiagent graph cooperative bandits (MAGNET) that integrates techniques from online decision making, graph structure modeling, and multiagent system. We allow heterogeneous individual rewards in the cooperative agents, and the interaction information partially known or completely unknown. Depending on the scenario if the global objective is known or not, the framework consists of a global optimizer for a central controller, and a local optimizer for decentralized agents, respectively. We derive the regret bound for the global optimizer as $\mathcal{O}( \sqrt{KT})$ and for the local optimizer as $\mathcal{O}(K \sqrt{T})$. Extensive simulation studies show the empirical validity of the proposed methods.

Hengrui Cai · Rui Song 🔗 


Covariate Shift of Latent Confounders in Imitation and Reinforcement Learning
(
Oral
)
We consider the problem of using expert data with unobserved confounders for imitation and reinforcement learning. We begin by defining the problem of learning from confounded expert data in a contextual MDP setup. We analyze the limitations of learning from such data with and without external reward and propose an adjustment of standard imitation learning algorithms to fit this setup. In addition, we discuss the problem of distribution shift between the expert data and the online environment when partial observability is present in the data. We prove possibility and impossibility results for imitation learning under arbitrary distribution shift of the missing covariates. When additional external reward is provided, we propose a sampling procedure that addresses the unknown shift and prove convergence to an optimal solution. Finally, we validate our claims empirically on challenging assistive healthcare and recommender system simulation tasks. 
Guy Tennenholtz · Assaf Hallak · Gal Dalal · Shie Mannor · Gal Chechik · Uri Shalit 🔗 


What Would the Expert $do(\cdot)$?: Causal Imitation Learning
(
Oral
)
We develop algorithms for imitation learning from data that was corrupted by unobserved confounders. Sources of such confounding include (a) persistent perturbations to actions or (b) the expert responding to a part of the state that the learner does not have access to. When a confounder affects multiple timesteps of recorded data, it can manifest as spurious correlations between states and actions that a learner might latch onto, leading to poor policy performance. By utilizing the effect of past states on current states, we are able to break up these spurious correlations, an application of the econometric technique of instrumental variable regression. This insight leads to two novel algorithms, one of a generativemodeling flavor (DoubIL) that can utilize access to a simulator and one of a gametheoretic flavor (ResiduIL) that can be run offline. Both approaches are able to find policies that match the result of a query to an unconfounded expert. We find both algorithms compare favorably to noncausal approaches on simulated control problems. 
Gokul Swamy · Sanjiban Choudhury · James Bagnell · Steven Wu 🔗 


The Limits to Learning a Diffusion Model
(
Oral
)
This paper provides the first sample complexity lower bounds for the estimation of simple diffusion models, including the Bass model (used in modeling consumer adoption) and the SIR model (used in modeling epidemics). We show that one cannot hope to learn such models until quite late in the diffusion. Specifically, we show that the time required to collect a number of observations that exceeds our sample complexity lower bounds is large. For Bass models with low innovation rates, our results imply that one cannot hope to predict the eventual number of adopting customers until one is at least twothirds of the way to the time at which the rate of new adopters is at its peak. In a similar vein, our results imply that in the case of an SIR model, one cannot hope to predict the eventual number of infections until one is approximately twothirds of the way to the time at which the infection rate has peaked. These limits are borne out in both product adoption data (Amazon), as well as epidemic data (COVID19). 
Jackie Baek · Vivek Farias · ANDREEA GEORGESCU · Retsef Levi · Tianyi Peng · Joshua Wilde · Andrew Zheng 🔗 