`

Timezone: »

 
Workshop
Causal Inference Challenges in Sequential Decision Making: Bridging Theory and Practice
Aurelien Bibaut · Maria Dimakopoulou · Nathan Kallus · Xinkun Nie · Masatoshi Uehara · Kelly Zhang

Tue Dec 14 10:50 AM -- 06:40 PM (PST) @ None
Event URL: https://sites.google.com/view/causal-sequential-decisions/home »

Sequential decision-making problems appear in settings as varied as healthcare, e-commerce, operations management, and policymaking, and depending on the context these can have very varied features that make each problem unique. Problems can involve online learning or offline data, known cost structures or unknown counterfactuals, continuous actions with or without constraints or finite or combinatorial actions, stationary environments or environments with dynamic agents, utilitarian considerations or fairness or equity considerations. More and more, causal inference and discovery and adjacent statistical theories have come to bear on such problems, from the early work on longitudinal causal inference from the last millenium up to recent developments in bandit algorithms and inference, dynamic treatment regimes, both online and offline reinforcement learning, interventions in general causal graphs and discovery thereof, and more. While the interaction between these theories has grown, expertise is spread across many different disciplines, including CS/ML, (bio)statistics, econometrics, ethics/law, and operations research.

The primary purpose of this workshop is to convene both experts, practitioners, and interested young researchers from a wide range of backgrounds to discuss recent developments around causal inference in sequential decision making and the avenues forward on the topic, especially ones that bring together ideas from different fields. The all-virtual nature of this year's NeurIPS workshop makes it particularly felicitous to such an assembly. The workshop will combine invited talks and panels by a diverse group of researchers and practitioners from both academia and industry together with contributed talks and town-hall Q\&A that will particularly seek to draw from younger individuals new to the area.

Tue 10:50 a.m. - 11:00 a.m.

This is a placeholder added by NeurIPS to define the start time of the workshop.

Tue 6:30 p.m. - 6:40 p.m.

This is a placeholder event added by NeurIPS to define the end of the workshop.

-
[ Visit Poster at Spot D4 in Virtual World ]

We study linear contextual bandits with access to a large, confounded, offline dataset that was sampled from some fixed policy. We show that this problem is closely related to a variant of the bandit problem with side information. We construct a linear bandit algorithm that takes advantage of the projected information, and prove regret bounds. Our results demonstrate the ability to take advantage of confounded offline data. Particularly, we prove regret bounds that improve current bounds by a factor related to the visible dimensionality of the contexts in the data. Our results indicate that confounded offline data can significantly improve online learning algorithms. Finally, we demonstrate various characteristics of our approach through synthetic simulations.

Guy Tennenholtz · Uri Shalit · Shie Mannor · Yonathan Efroni
-
[ Visit Poster at Spot D3 in Virtual World ]

We propose deviation-based learning, a new approach to training recommender systems. In the beginning, the recommender and rational users have different pieces of knowledge, and the recommender needs to learn the users' knowledge to make better recommendations. The recommender learns users' knowledge by observing whether each user followed or deviated from her recommendations. We show that learning frequently stalls if the recommender always recommends a choice: users tend to follow the recommendation blindly, and their choices do not reflect their knowledge. Social welfare and the learning rate are improved drastically if the recommender abstains from recommending a choice when she predicts that multiple arms will produce a similar payoff.

Junpei Komiyama · Shunya Noda
-
[ Visit Poster at Spot D2 in Virtual World ]
We consider the online optimization for multiple interactive agents who cooperate to achieve a global best reward. By interaction, one agent's action and reward will influence other agents' rewards. Yet, the existing cooperative bandits primarily assume that agents are independent of each other. The multi-agent reinforcement learning methods, on the other hand, mainly rely on deep learning, and thus hard to interpret or achieve a theoretical guarantee. In this work, we leverage the idea from the structural equation model to characterize different interactions among agents simultaneously, and propose the multi-agent graph cooperative bandits (MAGNET) that integrates techniques from online decision making, graph structure modeling, and multi-agent system. We allow heterogeneous individual rewards in the cooperative agents, and the interaction information partially known or completely unknown. Depending on the scenario if the global objective is known or not, the framework consists of a global optimizer for a central controller, and a local optimizer for decentralized agents, respectively. We derive the regret bound for the global optimizer as $\mathcal{O}( \sqrt{KT})$ and for the local optimizer as $\mathcal{O}(K \sqrt{T})$. Extensive simulation studies show the empirical validity of the proposed methods.
Hengrui Cai · Rui Song
-
[ Visit Poster at Spot D1 in Virtual World ]

Multi-armed bandit algorithms minimize experimentation costs required to converge on optimal behavior. They do so by rapidly adapting experimentation effort away from poorly performing actions as feedback is observed. But this desirable feature makes them sensitive to confounding. We highlight, for instance, that popular bandit algorithms cannot address the problem of identifying the best action when day-of-week effects may confound inferences. In response, this paper proposes deconfounded Thompson sampling, which makes simple, but critical, modifications to the way Thompson sampling is usually applied. Theoretical guarantees suggest the algorithm strikes a delicate balance between adaptivity and robustness to confounding. It attains asymptotic lower bounds on the number of samples required to confidently identify the best action --- suggesting optimal adaptivity --- but also satisfies strong performance guarantees in the presence of day-of-week effects and delayed observations --- suggesting unusual robustness.

Chao Qin · Daniel Russo
-
[ Visit Poster at Spot D0 in Virtual World ]
This paper derives time-uniform confidence sequences (CS) for causal effects in experimental and observational settings. A confidence sequence for a target parameter $\psi$ is a sequence of confidence intervals $(C_t)_{t=1}^\infty$ such that every one of these intervals simultaneously captures $\psi$ with high probability. Such CSs provide valid statistical inference for $\psi$ at arbitrary stopping times, unlike classical fixed-time confidence intervals which require the sample size to be fixed in advance. Existing methods for constructing CSs focus on the nonasymptotic regime where certain assumptions (such as known bounds on the random variables) are imposed, while doubly robust estimators of causal effects rely on (asymptotic) semiparametric theory. We use sequential versions of central limit theorem arguments to construct large-sample CSs for causal estimands, with a particular focus on the average treatment effect (ATE) under nonparametric conditions. These CSs allow analysts to update inferences about the ATE in lieu of new data, and experiments can be continuously monitored, stopped, or continued for any data-dependent reason, all while controlling the type-I error. Finally, we describe how these CSs readily extend to other causal estimands and estimators, providing a new framework for sequential causal inference in a wide array of problems.
Ian Waudby-Smith · David Arbour · Ritwik Sinha · Edward Kennedy · Aaditya Ramdas
-
[ Visit Poster at Spot C6 in Virtual World ]

Randomized experiments are often performed to study the causal effects of interest. Blocking is a technique to precisely estimate the causal effects when the experimental material is not homogeneous. We formalize the problem of obtaining a statistically optimal set of covariates to be used to create blocks while performing a randomized experiment. We provide a graphical test to obtain such a set for a general semi-Markovian causal model. We also propose and provide ideas towards solving a more general problem of obtaining an optimal blocking set that considers both the statistical and economic costs of blocking.

Abhishek Kumar Umrawal
-
[ Visit Poster at Spot C5 in Virtual World ]

We propose kernel ridge regression estimators for causal inference in multistage settings. Specifically, we focus on (i) the decomposition of a total effect into a direct effect and an indirect effect (mediated by a particular mechanism); and (ii) effects of sequences of treatments. We allow treatment, covariates, and mediators to be discrete or continuous, and low, high, or infinite dimensional. We propose estimators of means, increments, and distributions of counterfactual outcomes. Important examples are (i) direct and indirect dose response curves; and (ii) dynamic dose response curves. Each estimator has a closed form solution and is easily computed by kernel matrix operations. For the nonparametric case, we prove uniform consistency and provide finite sample rates of convergence. For the semiparametric case, we prove root-n consistency, Gaussian approximation, and semiparametric efficiency. We evaluate our estimators in simulations then estimate mediated and dynamic treatment effects of the US Job Corps training program for disadvantaged youth.

Rahul Singh · Liyuan Xu · Arthur Gretton
-
[ Visit Poster at Spot C4 in Virtual World ]
Learning a near optimal policy in a partially observable system remains an elusive challenge in contemporary reinforcement learning. In this work, we consider episodic reinforcement learning in a reward-mixing Markov decision process (MDP). There, a reward function is drawn from one of $M$ possible reward models at the beginning of every episode, but the identity of the chosen reward model is not revealed to the agent. Hence, the latent state space, for which the dynamics are Markovian, is not given to the agent. We study the problem of learning a near optimal policy for two reward-mixing MDPs. Unlike existing approaches that rely on strong assumptions on the dynamics, we make no assumptions and study the problem in full generality. Indeed, with no further assumptions, even for two switching reward-models, the problem requires several new ideas beyond existing algorithmic and analysis techniques for efficient exploration. We provide the first polynomial-time algorithm that finds an $\epsilon$-optimal policy after exploring $\tilde{O}(poly(H,\epsilon^{-1}) \cdot S^2 A^2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively. This is the first efficient algorithm that does not require any assumptions in partially observed environments where the observation space is smaller than the latent state space.
Jeongyeol Kwon · Yonathan Efroni · Constantine Caramanis · Shie Mannor
-
[ Visit Poster at Spot C3 in Virtual World ]

This paper studies an instance of the multi-armed bandit (\mab) problem, specifically where several causal \textsc{mab}s operate chronologically in the same dynamical system. Practically the reward distribution of each bandit is governed by the same non-trivial dependence structure, which is a dynamic causal model. Dynamic because we allow for each causal \mab to depend on the preceding \mab and in doing so are able to transfer information between both agents. Our contribution, the Chronological Causal Bandit (\ccb), is useful in discrete decision-making settings where the causal effects are changing across time and can be informed by earlier interventions in the same system. In this paper we present some early findings of the \ccb as demonstrated on a toy problem.

Neil Dhir
-
[ Visit Poster at Spot C2 in Virtual World ]

This paper serves to introduce the reader to the field of multi-agent reinforcement learning (MARL) and its intersection with methods from the study of causality. We highlight key challenges in MARL and discuss these in the context of how causal methods may assist in tackling them. We promote moving toward a 'causality first' perspective on MARL. Specifically, we argue that causality can offer improved safety, interpretability, and robustness, while also providing strong theoretical guarantees for emergent behaviour. We discuss potential causality related solutions for common challenges, and use this context to motivate future research directions. Finally, we provide some perspectives on open problems.

St John Grimbly · Jonathan Shock · Arnu Pretorius
-
[ Visit Poster at Spot C1 in Virtual World ]

We consider the problem of using expert data with unobserved confounders for imitation and reinforcement learning. We begin by defining the problem of learning from confounded expert data in a contextual MDP setup. We analyze the limitations of learning from such data with and without external reward and propose an adjustment of standard imitation learning algorithms to fit this setup. In addition, we discuss the problem of distribution shift between the expert data and the online environment when partial observability is present in the data. We prove possibility and impossibility results for imitation learning under arbitrary distribution shift of the missing covariates. When additional external reward is provided, we propose a sampling procedure that addresses the unknown shift and prove convergence to an optimal solution. Finally, we validate our claims empirically on challenging assistive healthcare and recommender system simulation tasks.

Guy Tennenholtz · Assaf Hallak · Gal Dalal · Shie Mannor · Gal Chechik · Uri Shalit
-
[ Visit Poster at Spot C0 in Virtual World ]

Missing data is a ubiquitous problem in biomedical and social science research and almost all areas involving data collection. Data imputation is commonly recommended to make the best use of valuable data at hand and gain more efficient estimates of quantities of interest. Mobile technology (e.g., mobile phones and wearable devices) allows an unprecedented way to closely monitor individuals' behavior, social interaction, symptoms, and other health conditions in real-time, holding great potential for scientific discoveries in psychiatric research and advancements in the treatment of severe mental illness. Continuous data collection using mobile technology gives rise to a new type of data -- entangled multivariate time series of outcome, exposure, and covariates, and poses new challenges in missing data imputation for valid causal inference. Most existing imputation methods are either designed for longitudinal data with limited follow-up times or for stationary time series, which may not be suitable in the field of psychiatry when mental health symptoms display dramatic changes over time or patients experience shifts in treatment regime over their course of recovery. We propose a novel multiple imputation method based on the state space model (SSMmp) to address missing data in multivariate time series that are potentially non-stationary. We evaluate its theoretical properties and performance in extensive simulations of both stationary and non-stationary time series under different missing mechanisms, showing its advantages over other commonly used strategies for missing data. We apply the SSMmp method in the analysis of a multi-year observational smartphone study of bipolar patients to evaluate the association between social network size and psychiatric symptoms adjusting for confounding.

Xiaoxuan Cai · Linda Valeri
-
[ Visit Poster at Spot B6 in Virtual World ]

Many organizations measure treatment effects via an experimentation platform to evaluate the casual effect of product variations prior to full-scale deployment [3,25]. However, standard experimentation platforms do not perform optimally for end user populations that exhibit heterogeneous treatment effects (HTEs) [3,5]. Here we present a personalized experimentation framework, Personalized Experiments (PEX), which optimizes treatment group assignment at the user level via HTE modeling and sequential decision policy optimization to optimize multiple short term and long term outcomes simultaneously. We describe an end-to-end workflow that has proven to be successful in practice and can be readily implemented using open-source software

Mia Garrard · Hanson Wang · Ben Letham · Zehui Wang · Yin Huang · Yichun Hu · Chad Zhou · Norm Zhou · Eytan Bakshy
-
[ Visit Poster at Spot B5 in Virtual World ]

Using deep reinforcement learning (DRL) to recover expert policies via imitation has been found to be promising in a wide range of applications. However, it remains a difficult task to interpret the control policy learned by the agent. Difficulties mainly come from two aspects: 1) agents in DRL are usually implemented as deep neural networks (DNNs), which are black-box models and lack in interpretability; 2) the latent causal mechanism behind agents' decisions may vary along the trajectory, rather than staying static throughout time steps. To address these difficulties, in this paper, we propose a self-explaining imitation framework, which can expose causal relations among states and action variables behind its decisions. Specifically, a dynamic causal discovery module is designed to extract the causal graph basing on historical trajectory and current states at each time step, and a causality encoding module is designed to model the interactions among variables with discovered causal edges. After encoding causality into variable embeddings, a prediction model conducts the imitation learning on top of obtained representations. These three components are trained end-to-end, and discovered causal edges can provide interpretations on rules captured by the agent. Comprehensive experiments are conducted on the simulation dataset to analyze its causal discovery capacity, and we further test it on a real-world medical dataset MIMIC-IV. Experimental results demonstrate its potential of providing explanations behind decisions.

Wenchao Yu
-
[ Visit Poster at Spot B4 in Virtual World ]

Reinforcement learning (RL) has gained increasing attraction in the academia and tech industry with launches to a variety of impactful applications and products. Although research is being actively conducted on many fronts (e.g., offline RL, performance, etc.), many RL practitioners face a challenge that has been largely ignored: determine whether a designed Markov Decision Process (MDP) is valid and meaningful. This study proposes a heuristic-based feature analysis method to validate whether an MDP is well formulated. We believe an MDP suitable for applying RL should contain a set of state features that are both sensitive to actions and predictive in rewards. We tested our method in constructed environments showing that our approach can identify certain invalid environment formulations. As far as we know, performing validity analysis for RL problem formulation is a novel direction. We envision that our tool will serve as a motivational example to help practitioners apply RL in real-world problems more easily.

RUIYANG XU · Zhengxing Chen
-
[ Visit Poster at Spot B3 in Virtual World ]

We develop algorithms for imitation learning from data that was corrupted by unobserved confounders. Sources of such confounding include (a) persistent perturbations to actions or (b) the expert responding to a part of the state that the learner does not have access to. When a confounder affects multiple timesteps of recorded data, it can manifest as spurious correlations between states and actions that a learner might latch onto, leading to poor policy performance. By utilizing the effect of past states on current states, we are able to break up these spurious correlations, an application of the econometric technique of instrumental variable regression. This insight leads to two novel algorithms, one of a generative-modeling flavor (DoubIL) that can utilize access to a simulator and one of a game-theoretic flavor (ResiduIL) that can be run offline. Both approaches are able to find policies that match the result of a query to an unconfounded expert. We find both algorithms compare favorably to non-causal approaches on simulated control problems.

Gokul Swamy · Sanjiban Choudhury · James Bagnell · Steven Wu
-
[ Visit Poster at Spot B2 in Virtual World ]
The problem of causal inference with panel data is a central econometric question. The following is a fundamental version of this problem: Let $M^*$ be a low-rank matrix and $E$ be a zero-mean noise matrix. For a `treatment' matrix $Z$ with entries in $\{0,1\}$ we observe the matrix $O$ with entries $O_{ij} := M^*_{ij} + E_{ij} + \mathcal{T}_{ij} Z_{ij}$ where $\mathcal{T}_{ij} $ are unknown, heterogeneous treatment effects. The problem requires we estimate the average treatment effect $\tau^* := \sum_{ij} \mathcal{T}_{ij} Z_{ij} / \sum_{ij} Z_{ij}$. The synthetic control paradigm provides an approach to estimating $\tau^*$ when $Z$ places support on a single row. This paper extends that framework to allow rate-optimal recovery of $\tau^*$ for general $Z$, thus broadly expanding its applicability. Our guarantees are the first of their type in this general setting. Computational experiments on synthetic and real-world data show a substantial advantage over competing estimators.
Vivek Farias · Andrew Li · Tianyi Peng
-
[ Visit Poster at Spot B1 in Virtual World ]

A system often exhibits more fine-grained causal relationships between a variable and its parents, which we call local independence relationships including Context-Specific Independence (CSI). It plays a crucial role in probabilistic inference and causal effect identification. However, for the continuous variables, directly searching for the local independence is infeasible and it is still unclear how to effectively discover them. To this end, we propose a novel framework to discover the local independences of continuous variables by partitioning the joint outcome space of the parent variables and imposing each partition to induce the local independence via modeling of a conditional distribution. In experiments with both static and sequential settings, we empirically demonstrated that the proposed method successfully discovers the ground truth local independencies.

INWOO HWANG · Byoung-Tak Zhang · Sanghack Lee
-
[ Visit Poster at Spot B0 in Virtual World ]

Podcast recommendation systems are increasingly becoming prevalent with the growing popularity of podcasts. Accounting for the sequential nature of user interaction with such systems is cardinal to ensure that the recommendations are meaningful to the users. However, while there is growing work in effectively designing the recommendation systems, there is limited understanding of how the recommendations affect user behavior, i.e., what is the causal effect of a recommendation on long-term user behavior. In this work, we seek to explain the sequential causal effect of consuming a podcast category on long-term user behavior. We estimate the direct and indirect effect of user consumption, here, podcast category, on consecutive outcomes, i.e, aggregate weekly user podcast consumption observed across four weeks. We observe nonzero, non-homogeneous direct and indirect effects on outcomes across all weeks for multiple treatments, i.e. podcast categories. Moreover, consuming certain categories such as true crime have overall higher causal effects than others, highlighting that some podcast categories can alter user behavior resulting in long-term increased consumption.

Vishwali Mhasawade · Praveen Chandar · Ghazal Fazelnia · Benjamin Carterette
-
[ Visit Poster at Spot A6 in Virtual World ]

We propose an information bottleneck principle for causal time-series prediction. We develop variational bounds on the information bottleneck objective function that can be optimized using recurrent neural networks. We implement our algorithm on simulated data as well as real-world weather-prediction and stock market-prediction datasets and show that these problems can be successfully solved using the new information bottleneck principle.

ADRIAN TOVAR · Varun Jog
-
[ Visit Poster at Spot A5 in Virtual World ]

Slate recommendation systems are commonly evaluated prior to deployment using off-policy evaluation methods, whereby data collected under the old logging policy is used to predict the performance of a new target policy. However, in practice most recommendation systems are not observed to recommend the vast majority of items, which is an issue since existing methods require that the probability of the target policy recommending an item can only be non-zero when the probability of the logging policy is non-zero. To circumvent this issue, we explore the use of item embeddings. By representing queries and slates in an embedding space, we are able to share information to extrapolate behaviors for queries and items that have not been seen yet.

Jaron Jia Rong Lee · David Arbour · Georgios Theocharous
-
[ Visit Poster at Spot A4 in Virtual World ]

Multi-armed bandit algorithms like Thompson Sampling can be used to conduct adaptive experiments, in which maximizing reward means that data is used to progressively assign more participants to more effective arms. Such assignment strategies increase the risk of statistical hypothesis tests identifying a difference between arms when there is not one, and failing to conclude there is a difference in arms when there truly is one \citep{rafferty2019statistical}. We present simulations for 2-arm experiments that explore two algorithms that combine the benefits of uniform randomization for statistical analysis, with the benefits of reward maximization achieved by Thompson sampling (TS). First, Top-Two Thompson Sampling~\citep{russo2016simple} adds a fixed amount of uniform random allocation (UR) spread evenly over time. Second, a novel heuristic algorithm, called TS-PostDiff (Posterior Probability of Difference). TS-PostDiff takes a Bayesian approach to mixing TS and UR: the probability a participant is assigned using UR allocation is the posterior probability that the difference between two arms is `small' (below a certain threshold), allowing for more UR exploration when there is little or no reward to be gained. We find that TS-PostDiff method performs well across multiple effect sizes, and thus does not require tuning based on a guess for the true effect size.

Jacob Nogas · Arghavan Modiri · · Sofia Villar · Audrey Durand · Anna Rafferty · Joseph Williams
-
[ Visit Poster at Spot A3 in Virtual World ]

This paper provides the first sample complexity lower bounds for the estimation of simple diffusion models, including the Bass model (used in modeling consumer adoption) and the SIR model (used in modeling epidemics). We show that one cannot hope to learn such models until quite late in the diffusion. Specifically, we show that the time required to collect a number of observations that exceeds our sample complexity lower bounds is large. For Bass models with low innovation rates, our results imply that one cannot hope to predict the eventual number of adopting customers until one is at least two-thirds of the way to the time at which the rate of new adopters is at its peak. In a similar vein, our results imply that in the case of an SIR model, one cannot hope to predict the eventual number of infections until one is approximately two-thirds of the way to the time at which the infection rate has peaked. These limits are borne out in both product adoption data (Amazon), as well as epidemic data (COVID-19).

Jackie Baek · Vivek Farias · ANDREEA GEORGESCU · Retsef Levi · Tianyi Peng · Joshua Wilde · Andrew Zheng
-
[ Visit Poster at Spot A2 in Virtual World ]

We explore the promises and challenges of employing sequential decision-making algorithms -- such as bandits, reinforcement learning, and active learning -- in the public sector. While such algorithms have been heavily studied in settings that are suitable for the private sector (e.g., online advertising), the public sector could greatly benefit from these approaches, but poses unique methodological challenges for machine learning. We highlight several applications of sequential decision-making algorithms in regulation and governance, and discuss areas for further research which would enable them to be more widely applicable, fair, and effective. In particular, ensuring that these systems learn rational, causal decision-making policies can be difficult and requires great care. We also note the potential risks of such deployments and urge caution when conducting work in this area. We hope our work inspires more investigation of public-sector sequential decision making applications, which provide unique challenges for machine learning researchers and can be socially beneficial.

Peter Henderson · Brandon Anderson · Daniel Ho
-
[ Visit Poster at Spot A1 in Virtual World ]
We consider the estimation of treatment effects in settings when multiple treatments are assigned over time and treatments can have a causal effect on future outcomes or the state of the treated unit. We propose an extension of the double/debiased machine learning framework to estimate the dynamic effects of treatments, which can be viewed as a Neyman orthogonal (locally robust) cross-fitted version of $g$-estimation in the dynamic treatment regime. Our method applies to a general class of non-linear dynamic treatment models known as Structural Nested Mean Models and allows the use of machine learning methods to control for potentially high dimensional state variables, subject to a mean square error guarantee, while still allowing parametric estimation and construction of confidence intervals for the structural parameters of interest. These structural parameters can be used for off-policy evaluation of any target dynamic policy at parametric rates, subject to semi-parametric restrictions on the data generating process. Our work is based on a recursive peeling process, typical in $g$-estimation, and formulates a strongly convex objective at each stage, which allows us to extend the $g$-estimation framework in multiple directions: i) to provide finite sample guarantees, ii) to estimate non-linear effect heterogeneity with respect to fixed unit characteristics, within arbitrary function spaces, enabling a dynamic analogue of the RLearner algorithm for heterogeneous effects, iii) to allow for high-dimensional sparse parameterizations of the target structural functions, enabling automated model selection via a recursive lasso algorithm. We also provide guarantees for data stemming from a single treated unit over a long horizon and under stationarity conditions.
Greg Lewis · Vasilis Syrgkanis
-
[ Visit Poster at Spot A0 in Virtual World ]

Policy makers often need to estimate the long-term effects of novel treatments, while only having historical data of older treatment options. We propose a surrogate-based approach using a long-term dataset where only past treatments were administered and a short-term dataset where novel treatments have been administered. Our approach generalizes previous surrogate-style methods, allowing for continuous treatments and serially-correlated treatment policies while maintaining consistency and root-n asymptotically normal estimates under a Markovian assumption on the data and the observational policy. Using a semi-synthetic dataset on customer incentives from a major corporation, we evaluate the performance of our method and discuss solutions to practical challenges when deploying our methodology.

Keith Battocchi · Maggie Hei · Greg Lewis · Miruna Oprescu · Vasilis Syrgkanis
-
TALK (Elias Bareibnboim) (TALK)
-
TALK (Susan Athey) (Talk)
-
TALK (Razieh Nabi) (TALK)
-
TALK (Rui Song) (Talk)
-
TALK (Mark van der Laan) (Talk)
-
TALK (Claire Vernade) (Talk)
-

We propose deviation-based learning, a new approach to training recommender systems. In the beginning, the recommender and rational users have different pieces of knowledge, and the recommender needs to learn the users' knowledge to make better recommendations. The recommender learns users' knowledge by observing whether each user followed or deviated from her recommendations. We show that learning frequently stalls if the recommender always recommends a choice: users tend to follow the recommendation blindly, and their choices do not reflect their knowledge. Social welfare and the learning rate are improved drastically if the recommender abstains from recommending a choice when she predicts that multiple arms will produce a similar payoff.

Junpei Komiyama · Shunya Noda
-
We consider the online optimization for multiple interactive agents who cooperate to achieve a global best reward. By interaction, one agent's action and reward will influence other agents' rewards. Yet, the existing cooperative bandits primarily assume that agents are independent of each other. The multi-agent reinforcement learning methods, on the other hand, mainly rely on deep learning, and thus hard to interpret or achieve a theoretical guarantee. In this work, we leverage the idea from the structural equation model to characterize different interactions among agents simultaneously, and propose the multi-agent graph cooperative bandits (MAGNET) that integrates techniques from online decision making, graph structure modeling, and multi-agent system. We allow heterogeneous individual rewards in the cooperative agents, and the interaction information partially known or completely unknown. Depending on the scenario if the global objective is known or not, the framework consists of a global optimizer for a central controller, and a local optimizer for decentralized agents, respectively. We derive the regret bound for the global optimizer as $\mathcal{O}( \sqrt{KT})$ and for the local optimizer as $\mathcal{O}(K \sqrt{T})$. Extensive simulation studies show the empirical validity of the proposed methods.
Hengrui Cai · Rui Song
-

We consider the problem of using expert data with unobserved confounders for imitation and reinforcement learning. We begin by defining the problem of learning from confounded expert data in a contextual MDP setup. We analyze the limitations of learning from such data with and without external reward and propose an adjustment of standard imitation learning algorithms to fit this setup. In addition, we discuss the problem of distribution shift between the expert data and the online environment when partial observability is present in the data. We prove possibility and impossibility results for imitation learning under arbitrary distribution shift of the missing covariates. When additional external reward is provided, we propose a sampling procedure that addresses the unknown shift and prove convergence to an optimal solution. Finally, we validate our claims empirically on challenging assistive healthcare and recommender system simulation tasks.

Guy Tennenholtz · Assaf Hallak · Gal Dalal · Shie Mannor · Gal Chechik · Uri Shalit
-

We develop algorithms for imitation learning from data that was corrupted by unobserved confounders. Sources of such confounding include (a) persistent perturbations to actions or (b) the expert responding to a part of the state that the learner does not have access to. When a confounder affects multiple timesteps of recorded data, it can manifest as spurious correlations between states and actions that a learner might latch onto, leading to poor policy performance. By utilizing the effect of past states on current states, we are able to break up these spurious correlations, an application of the econometric technique of instrumental variable regression. This insight leads to two novel algorithms, one of a generative-modeling flavor (DoubIL) that can utilize access to a simulator and one of a game-theoretic flavor (ResiduIL) that can be run offline. Both approaches are able to find policies that match the result of a query to an unconfounded expert. We find both algorithms compare favorably to non-causal approaches on simulated control problems.

Gokul Swamy · Sanjiban Choudhury · James Bagnell · Steven Wu
-

This paper provides the first sample complexity lower bounds for the estimation of simple diffusion models, including the Bass model (used in modeling consumer adoption) and the SIR model (used in modeling epidemics). We show that one cannot hope to learn such models until quite late in the diffusion. Specifically, we show that the time required to collect a number of observations that exceeds our sample complexity lower bounds is large. For Bass models with low innovation rates, our results imply that one cannot hope to predict the eventual number of adopting customers until one is at least two-thirds of the way to the time at which the rate of new adopters is at its peak. In a similar vein, our results imply that in the case of an SIR model, one cannot hope to predict the eventual number of infections until one is approximately two-thirds of the way to the time at which the infection rate has peaked. These limits are borne out in both product adoption data (Amazon), as well as epidemic data (COVID-19).

Jackie Baek · Vivek Farias · ANDREEA GEORGESCU · Retsef Levi · Tianyi Peng · Joshua Wilde · Andrew Zheng

Author Information

Aurelien Bibaut (UC Berkeley)
Maria Dimakopoulou (Stanford University)
Nathan Kallus (Cornell University)
Xinkun Nie (Stanford University)
Masatoshi Uehara (Cornell University)
Kelly Zhang (Harvard University)

More from the Same Authors