`

Timezone: »

 
Workshop
Learning and Decision-Making with Strategic Feedback
Yahav Bechavod · Hoda Heidari · Eric Mazumdar · Celestine Mendler-Dünner · Tijana Zrnic

Tue Dec 14 07:00 AM -- 02:30 PM (PST) @ None
Event URL: https://www.strategic-ml.com/ »

Classical treatments of machine learning rely on the assumption that the data, after deployment, resembles the data the model was trained on. However, as machine learning models are increasingly used to make consequential decisions about people, individuals often react strategically to the deployed model. These strategic behaviors---which effectively invalidate the predictive models---have opened up new avenues of research and added new challenges to the deployment of machine learning algorithms in the real world.

Different aspects of strategic behavior have been studied by several communities both within and outside of machine learning. For example, the growing literature on strategic classification studies algorithms for finding strategy-robust decision rules, as well as the properties of such rules. Behavioral economics aims to understand and model people’s strategic responses. Recent works on learning in games study optimization algorithms for finding meaningful equilibria and solution concepts in competitive environments.

This workshop aims to create a dialogue between these different communities, all studying aspects of decision-making and learning with strategic feedback. The goal is to identify common points of interest and open problems in the different subareas, as well as to encourage cross-disciplinary collaboration.

Tue 7:00 a.m. - 7:10 a.m.
Opening remarks (Introduction)
Tue 7:10 a.m. - 7:37 a.m.
(Invited Talk)   

Many of today’s most promising technological systems involve very large numbers of autonomous agents that influence each other and make strategic decisions within a network structure. Examples include opinion dynamics, targeted marketing in social networks, economic exchange and international trade, product adoption and social contagion.

While traditional tools for the analysis of these systems assumed that a social planner has full knowledge of the network of interactions, when we turn to very large networks two issues emerge. First, collecting data about the exact network of interactions becomes very expensive or not at all possible because of privacy concerns. Second, methods for designing optimal interventions that rely on the exact network structure typically do not scale well with the population size.

To obviate these issues, in this talk I will present a framework in which the social planner designs interventions based on probabilistic instead of exact information about agent’s interactions. I will introduce the tool of “graphon games” as a way to formally describe strategic interactions in this setting and I will illustrate how this tool can be exploited to design interventions. I will cover two main applications: targeted budget allocation and optimal seeding in contagion processes. I will illustrate how the graphon approach leads to interventions that are asymptotically optimal in terms of the population size and can be computed without requiring exact network data.

Francesca Parise
Tue 7:38 a.m. - 8:05 a.m.
(Invited Talk)   

tba

Lillian Ratliff
Tue 8:05 a.m. - 8:30 a.m.
(Invited Talk)   

Across a multitude of domains and applications, machine learning has become widespread as a tool for informing decisions about humans, and for humans. But most tools used in practice focus exclusively on mapping inputs to relevant outputs - and take no account of how humans respond to these outputs. This begs the question: how should we design learning systems when we know they will be used in social settings? The goal of this talk is to initiate discussion regarding this question and the paths we can take towards possible answers. Building on strategic classification as an appropriate first step, I will describe some of our work, both recent and current, that aims to extend strategic classification towards more realistic strategic settings that include more elaborate forms of economic modeling. Finally, I will argue for a broader view of how we can approach learning problems that lie just outside the scope of classic supervised learning.

Nir Rosenfeld
Tue 8:30 a.m. - 9:00 a.m.
Panel 1 (Discussion Panel)
Tue 9:00 a.m. - 9:10 a.m.
(Contributed talk)   

On-line firms deploy suites of software platforms, where each platform is designed to interact with users during a certain activity, such as browsing, chatting, socializing, emailing, driving, etc. The economic and incentive structure of this exchange, as well as its algorithmic nature, have not been explored to our knowledge. We model this interaction as a Stackelberg game between a Designer and one or more Agents. We model an Agent as a Markov chain whose states are activities; we assume that the Agent’s utility is a linear function of the steady-state distribution of this chain. The Designer may design a platform for each of these activities/states; if a platform is adopted by the Agent, the transition probabilities of the Markov chain are affected, and so is the objective of the Agent. The Designer’s utility is a linear function of the steady state probabilities of the accessible states (that is, the ones for which the platform has been adopted), minus the development cost of the platforms. The underlying optimization problem of the Agent — that is, how to choose the states for which to adopt the platform — is an MDP. If this MDP has a simple yet plausible structure (the transition probabilities from one state to another only depend on the target state and the recurrent probability of the current state) the Agent’s problem can be solved by a greedy algorithm. The Designer’s optimization problem (designing a custom suite for the Agent so as to optimize, through the Agent’s optimum reaction, the Designer’s revenue), is in general NP-hard to approximate within any finite ratio; however, in the special case, while still NP-hard, has an FPTAS. These results generalize, under mild additional assumptions, from a single Agent to a distribution of Agents with finite support, as well as to the setting where other Designers have already created platforms, and the Designer must find the best response to the strategies of the other Designers. We discuss other implications of our results and directions of future research.

Kiran Vodrahalli
Tue 9:10 a.m. - 9:20 a.m.
(Contributed talk)   

Strategic classification, i.e. classification under possible strategic manipulations of features, has received a lot of attention from both the machine learning and the game theory community. Most works focus on analysing the optimal decision rule under such manipulations. In our work we take a learning theoretic perspective, focusing on the sample complexity needed to learn a good decision rule which is robust to strategic manipulation. We perform this analysis under a strategic manipulation loss that takes into account both the accuracy of the final decision and the vulnerability to manipulation. We analyse the sample complexity for a known graph of possible manipulations in terms of the complexity of the function class and the manipulation graph. Additionally, we address the problem of unknown manipulation capabilities of the involved agents. Using techniques from transfer learning theory, we define a similarity measure for manipulation graphs and show that learning outcomes are robust with respect to small changes in the manipulation graph. Lastly we analyse the (sample complexity of) learning of the manipulation capability of agents with respect to this similarity measures, providing a way to learn strategic classification with respect to an unknown manipulation graph.

Tosca Lechner
Tue 9:20 a.m. - 9:30 a.m.
(Contributed talk)   

The use of algorithmic decision making systems in domains which impact the financial, social, and political well-being of people has created a demand for these decision making systems to be ``fair'' under some accepted notion of equity. This demand has in turn inspired a large body of work focused on the development of fair learning algorithms which are then used in lieu of their conventional counterparts. Most analysis of such fair algorithms proceeds from the assumption that the people affected by the algorithmic decisions are represented as immutable feature vectors. However, strategic agents may possess both the ability and the incentive to manipulate this observed feature vector in order to attain a more favorable outcome. We explore the impact that strategic agent behavior could have on fair classifiers and derive conditions under which this behavior leads to fair classifiers becoming less fair than their conventional counterparts under the same measure of fairness that the fair classifier takes into account. These conditions are related to the the way in which the fair classifier remedies unfairness on the original unmanipulated data: fair classifiers which remedy unfairness by becoming more selective than their conventional counterparts are the ones that become less fair than their counterparts when agents are strategic. We further demonstrate that both the increased selectiveness of the fair classifier, and consequently the loss of fairness, arises when performing fair learning on domains in which the advantaged group is overrepresented in the region near (and on the beneficial side of) the decision boundary of conventional classifiers. Finally, we observe experimentally, using several datasets and learning methods, that this \emph{fairness reversal} is common, and that our theoretical characterization of the fairness reversal conditions indeed holds in most such cases.

Andrew Estornell
Tue 9:30 a.m. - 9:40 a.m.
(Contributed talk)   

Due to the Covid-19 pandemic, more than 500 US-based colleges and universities went ``test-optional'' for admissions and promised that they would not penalize applicants for not submitting test scores, part of a longer trend to rethink the role of testing in college admissions. However, it remains unclear how (and whether) a college can simultaneously use test scores for those who submit them, while not penalizing those who do not--and what that promise even means. We formalize these questions, and study how a college can overcome two challenges with optional testing: strategic applicants (when those with low test scores can pretend to not have taken the test), and informational gaps (it has more information on those who submit a test score than those who do not). We find that colleges can indeed do so, if and only if they are able to use information on who has test access and are willing to randomize admissions.

Zhi Liu
Tue 9:40 a.m. - 10:40 a.m.
Poster Session (Poster Session, social & break)  link »
Tue 10:45 a.m. - 11:12 a.m.
(Invited Talk)   

When theorizing the causal effects that algorithmic decisions have on a population, an important modeling choice arises. We can model the change to a population in the aggregate, or we can model the response to a decision rule at the individual level. Standard economic microfoundations, for instance, ground the response in the utility-maximizing behavior of individuals.

Providing context from sociological and economic theory, I will argue why this methodological problem is of significant importance to machine learning. I will focus on the relationships and differences between two recent lines of work, called strategic classification and performative prediction. While performative prediction takes a macro-level perspective on distribution shifts induced by algorithmic predictions, strategic classification builds on standard economic microfoundations. Based on work with Meena Jagadeesan and Celestine Mendler-Dünner, I will discuss the serious shortcomings of standard microfoundations in the context of machine learning and speculate about the alternatives that we have.

Moritz Hardt
Tue 11:12 a.m. - 11:40 a.m.
(Invited Talk)   

Data-based decision-making must account for the manipulation of data by agents who are aware of how decisions are being made and want to affect their allocations. We study a framework in which, due to such manipulation, data becomes less informative when decisions depend more strongly on data. We formalize why and how a decisionmaker should commit to underutilizing data. Doing so attenuates information loss and thereby improves allocation accuracy.

Navin Kartik
Tue 11:40 a.m. - 12:10 p.m.
(Invited Talk)   

Strategic classification concerns the problem of training a classifier that will ultimately observe data generated according to strategic agents’ responses. The commonly adopted setting is that the agents are fully rational and can best respond to a classifier, and the classifier is aiming to maximize its robustness to the strategic “manipulations”. This talk revisits a couple of dynamics concepts in the above formulation. The first question we try to revisit is: are all changes considered undesirable? We observe that in many application settings, changes in agents’ profile X can lead to true improvement in their target variable Y [1,2]. This observation requires us to revisit the objective function of the learner, and study the possibility of inducing an improved population from the agents. The second question we revisit is: do agents respond rationally? Inspired by evolutionary game theory, we introduce a dynamical agent response model using replicator dynamics to model agents’ potentially non-fully rational responses to a sequence of classifiers [3]. We characterize the dynamics of this model and offer observations of its fairness implication in such a long-term dynamical environment.

References:

[1] Linear Classifiers that Encourage Constructive Adaptation, Yatong Chen, Jialu Wang and Yang Liu, 2021.

[2] Induced Domain Adaptation, Yang Liu, Yatong Chen, Jiaheng Wei, 2021.

[3] Unintended Selection: Persistent Qualification Rate Disparities and Interventions, Reilly Raab and Yang Liu, Neural Information Processing Systems (NeurIPS), 2021

Yang Liu
Tue 12:10 p.m. - 12:40 p.m.
Panel 2 (Discussion Panel)
Tue 12:40 p.m. - 1:08 p.m.
(Invited Talk)   

As algorithms are increasingly applied to screen applicants for high-stakes decisions in employment, education, lending, and other domains, concerns have been raised about the effects of "algorithmic monoculture", in which many decision-makers all rely on the same algorithm. This concern invokes analogies to agriculture, where a monocultural system runs the risk of severe harm from unexpected shocks. We present a set of basic models characterizing the potential risks from algorithmic monoculture, showing that monocultural convergence on a single algorithm by a group of decision-making agents, even when the algorithm is more accurate for any one agent in isolation, can reduce the overall quality of the decisions being made by the full collection of agents. Our results rely on minimal assumptions, and involve a combination of game-theoretic arguments about competing decision-makers with the development of a probabilistic framework for analyzing systems that use multiple noisy estimates of a set of alternatives. The talk is based on joint work with Manish Raghavan.

Jon Kleinberg
Tue 1:08 p.m. - 1:31 p.m.
(Invited Talk)   

Machine Learning algorithms often prompt individuals to strategically modify their observable attributes to receive more favorable predictions. As a result, the distribution the predictive model is trained on may differ from the one it operates on in deployment. While such distribution shifts, in general, hinder accurate predictions, our work identifies a unique opportunity associated with shifts due to strategic responses. We show that we can use strategic responses effectively to recover causal relationships between the observable features and outcomes we wish to predict. More specifically, we study a game-theoretic model in which a principal deploys a sequence of models to predict an outcome of interest (e.g., college GPA) for a sequence of strategic agents (e.g., college applicants). In response, strategic agents invest efforts and modify their features for better predictions. In such settings, unobserved confounding variables can influence both an agent's observable features (e.g., high school records) and outcomes. Therefore, standard regression methods generally produce biased estimators. To address this issue, our work establishes a novel connection between strategic responses to machine learning models and instrumental variable (IV) regression, by observing that the sequence of deployed models can be viewed as an instrument that affects agents' observable features but does not directly influence their outcomes. Therefore, two-stage least squares (2SLS) regression can recover the causal relationships between observable features and outcomes. Beyond causal recovery, we can build on our 2SLS method to address two additional relevant optimization objectives: agent outcome maximization and predictive risk minimization.

This work is joint with Keegan Harris, Daniel Ngo, Logan Stapleton, and Hoda Heidari.

Steven Wu
Tue 1:31 p.m. - 2:00 p.m.
(Invited Talk)   

Across multiple industries, new online platforms are interjecting themselves as digital intermediaries in previously direct business-to-consumer transactions. A reasonable concern is that these platforms, once they become dominant, can leverage their unique position to extract surplus from both sides participating in the transaction and lead to different welfare outcomes than platform participants expected (and experienced) when they first joined the platform. We study the effects that OpenTable (an online restaurant reservation platform) had on restaurants’ prices and their likelihood of survival in NYC, during a period the platform expanded to cover most restaurants in the city. We develop an analytical model to understand restaurants’ adoption decision, and the effect of adoption on prices and consumer surplus. The model shows how the platform can induce a prisoner’s dilemma where restaurants have incentives to join the platform to poach customers from competitors or to protect its clientele from competitors. However, once all restaurants join, none of them will attract additional customers, and the costs of the platform will be passed down to diners through price increases. As the popularity of the platform grows, the platform can charge a higher fee to restaurants until extracting all the benefits it creates. To test the predictions of the model, we create a dataset containing prices, survival, and OpenTable participation for over 5,000 restaurants in NYC between 2005 and 2016. Our analysis suggests that as the platform became prevalent, the costs of the platform were passed down to consumers through price and restaurants saw no benefits in terms of survival.

Cristobal Cheyre
Tue 2:00 p.m. - 2:30 p.m.
Panel 3 (Discussion Panel)
-
(Poster) [ Visit Poster at Spot A0 in Virtual World ]
Winner-take-all competitions in forecasting and machine-learning suffer from distorted incentives.Witkowski et al. identified this problem and proposed ELF, a truthful mechanism to select a winner.We show that, from a pool of $n$ forecasters, ELF requires $\Theta(n\log n)$ events or test data points to select a near-optimal forecaster with high probability.We then show that standard online learning algorithms select an $\epsilon$-optimal forecaster using only $O(\log(n) / \epsilon^2)$ events, by way of a strong approximate-truthfulness guarantee.This bound matches the best possible even in the nonstrategic setting.We then apply these mechanisms to obtain the first no-regret guarantee for non-myopic strategic experts.
Anish Thilagar · Rafael Frongillo · Bo Waggoner · Robert Gomez
-
(Poster) [ Visit Poster at Spot D0 in Virtual World ]

When reasoning about strategic behavior in a machine learning context it is tempting to combine standard microfoundations of rational agents with the statistical decision theory underlying classification. In this work, we argue that a direct combination of these standard ingredients leads to brittle solution concepts of limited descriptive and prescriptive value. First, we show that rational agents with perfect information produce discontinuities in the aggregate response to a decision rule that we often do not observe empirically. Second, when any positive fraction of agents is not perfectly strategic, desirable stable points---where the classifier is optimal for the data it entails---cease to exist. Third, optimal decision rules under standard microfoundations maximize a measure of negative externality known as social burden within a broad class of possible assumptions about agent behavior.Recognizing these limitations we explore alternatives to standard microfoundations for binary classification. To navigate the space of possible assumptions about how agents respond to a decision rule we specify a set of desiderata our model should satisfy and that help mitigate the limitations of the standard model. We propose noisy response as a promising candidate model. Inspired by smoothed analysis and empirical observations, noisy response incorporates natural imperfection in the agent responses. This model retains analytical tractability, leads to more robust insights about stable points, and imposes a lower social burden at optimality.

Meena Jagadeesan · Celestine Mendler-Dünner · Moritz Hardt
-
(Poster) [ Visit Poster at Spot A1 in Virtual World ]

We provide efficient methods for learning strategic agents' underlying bid and value distributions by observing only the outcomes of their repeated interaction in a variety of standard auction models. In particular, given a finite set of observations---each only comprising the identity of the winner and the price they paid---in a sequence of auctions involving the same set of ex ante asymmetric bidders with independent private values, we provide algorithms for non-parametrically estimating the bid distribution of each bidder to within Wasserstein, Kolmogorov, or total variation distance on the effective support of these distributions. We provide convergence bounds for the attained distance in terms of the number of observations, number of bidders, and other relevant parameters of the problem, which are uniform in that they do not depend on the bid distributions being estimated. For first-price auctions (where bid distributions and equilibrium value distributions do not coincide) we also show provide finite-sample estimation results for agents' value distributions at Bayes-Nash equilibrium. Our estimation guarantees advance a body of work at the intersection of machine learning and econometrics with partial sample observability wherein only identification results have been previously obtained in our setting.

Yeshwanth Cherapanamjeri · Constantinos Daskalakis · Andrew Ilyas · Emmanouil Zampetakis
-
(Poster) [ Visit Poster at Spot A2 in Virtual World ]
We show that Optimistic Hedge -- a common variant of multiplicative-weights-updates with recency bias -- attains ${\rm poly}(\log T)$ regret in multi-player general-sum games. In particular, when every player of the game uses Optimistic Hedge to iteratively update her strategy in response to the history of play so far, then after $T$ rounds of interaction, each player experiences total regret that is ${\rm poly}(\log T)$. Our bound improves, exponentially, the $O({T}^{1/2})$ regret attainable by standard no-regret learners in games, the $O(T^{1/4})$ regret attainable by no-regret learners with recency bias (Syrgkanis et al., 2015), and the ${O}(T^{1/6})$ bound that was recently shown for Optimistic Hedge in the special case of two-player games (Chen & Peng, 2020). A corollary of our bound is that Optimistic Hedge converges to coarse correlated equilibrium in general games at a rate of $\tilde{O}\left(\frac 1T\right)$.
Constantinos Daskalakis · Maxwell Fishelson · Noah Golowich
-
(Poster) [ Visit Poster at Spot A3 in Virtual World ]

When a robot shares the same workspace with other intelligent agents (e.g., other robots or humans), the robot must be able to reason the behaviors of its neighboring agents while accomplishing its designated task. We assume other agents do not explicitly share their decisions or reasoning processes with the robot. Thus the robot needs to reason itself for the consequences of all other agents, the process of which demands prohibitive computational resources. We observe that in practice, oftentimes predicting the absolutely optimal agent behaviors is not necessary. Instead, it is more desirable that the solution can be computed with a close-to-optimal (sub-optimal) but affordable (tractable) computation time. We propose to incorporate the concept of bounded rationality (from an information-theoretic viewpoint) into the general-sum stochastic game setting. Specifically, the robot computes a policy under the information processing constraint, represented as KL-divergence between the default and optimized stochastic policies. The solution to the bounded-optimal policy can be obtained by an importance sampling approach. We show pilot results that this framework allows the robot to (1) effectively trade-off between solution optimality and limited computational time; (2) efficiently learn the sub-optimal behaviors of other agents.

Junhong Xu · Kai Yin · lantao
-
(Poster) [ Visit Poster at Spot A4 in Virtual World ]

We study the effects of information discrepancy across sub-populations on their ability to simultaneously improve their features in strategic learning settings. Specifically, we consider a game where a principal deploys a decision rule in an attempt to optimize the whole population's welfare, and agents strategically adapt to it to receive better scores. Inspired by real-life settings, such as loan approvals and college admissions, we remove the typical assumption made in the strategic learning literature that the decision rule is fully known to the agents, and focus on settings where it is inaccessible. In their lack of knowledge, individuals try to infer this rule by learning from their peers (e.g., friends and acquaintances who previously applied for a loan), naturally forming groups in the population, each with possibly different type and level of information about the decision rule. In our equilibrium analysis, we show that the principal's decision rule optimizing the welfare across subgroups} may cause a surprising negative externality; the true quality of some of the subgroups can actually deteriorate. On the positive side, we show that in many natural cases, optimal improvement is guaranteed simultaneously for all subgroups in equilibrium. We also characterize the disparity in improvements across subgroups via a measure of their informational overlap. Finally, we complement our theoretical analysis with experiments on real-world datasets.

Yahav Bechavod · Chara Podimata · Steven Wu · Juba Ziani
-
(Poster) [ Visit Poster at Spot A5 in Virtual World ]
In cooperative bandits, a framework that captures essential features of collective sequential decision making, agents can minimize group regret, and thereby improve performance, by leveraging shared information. However, sharing information can be costly, which motivates developing policies that minimize group regret while also reducing the number of messages communicated by agents. Existing cooperative bandit algorithms obtain optimal performance when agents share information with their neighbors at \textit{every time step}, i.e., full communication. This requires $\Theta(T)$ number of messages, where $T$ is the time horizon of the decision making process. We propose \textit{ComEx}, a novel cost-effective communication protocol in which the group achieves the same order of performance as full communication while communicating only $O(\log T)$ number of messages. Our key step is developing a method to identify and only communicate the information crucial to achieving optimal performance. Further we propose novel algorithms for several benchmark cooperative bandit frameworks and show that our algorithms obtain \textit{state-of-the-art} performance while consistently incurring a significantly smaller communication cost than existing algorithms.
Udari Madhushani · Naomi Leonard
-
(Poster) [ Visit Poster at Spot A6 in Virtual World ]

We consider an online regression setting in which individuals adapt to the regression model: arriving individuals are aware of the current model, and invest strategically in modifying their own features so as to improve the predicted score that the current model assigns to them. Such feature manipulation has been observed in various scenarios---from credit assessment to school admissions---posing a challenge for the learner. Surprisingly, we find that such strategic manipulations may in fact help the learner recover the meaningful variables---that is, the features that, when changed, affect the true label (as opposed to non-meaningful features that have no effect). We show that even simple behavior on the learner's part allows her to simultaneously i) accurately recover the meaningful features, and ii) incentivize agents to invest in these meaningful features, providing incentives for improvement.

Yahav Bechavod · Katrina Ligett · Steven Wu · Juba Ziani
-
(Poster) [ Visit Poster at Spot B0 in Virtual World ]

As machine learning is applied more to real-world problems like robotics, control of autonomous vehicles, drones, and recommendation systems, it becomes essential to consider the notion of agency where multiple agents with local observations start impacting each other and interact to achieve their goals. Multi-agent reinforcement learning (MARL) is concerned with developing learning algorithms that can discover effective policies in multi-agent environments. In this work, we develop algorithms for addressing two critical challenges in MARL - non-stationarity and robustness. We show that naive independent reinforcement learning does not preserve the strategic game-theoretic interaction between the agents, and we present a way to realize the classical infinite order recursion reasoning in a reinforcement learning setting. We refer to this framework as Interactive Policy Optimization (IPO) and derive four MARL algorithms using centralized-training-decentralized-execution that generalize the widely used single-agent policy gradient methods to multi-agent settings. Finally, we provide a method to estimate opponent's parameters in adversarial settings using maximum likelihood and integrate IPO with an adversarial learning framework to train agents robust to destabilizing disturbances from the environment/adversaries and for better sim2real transfer from simulated multi-agent environments to the real world.

Videh Nema · Balaraman Ravindran
-
(Poster) [ Visit Poster at Spot B1 in Virtual World ]

We present a novel framework to learn functions that estimate decisions of sellers and buyers simultaneously in a oligopoly market for a price-sensitive product. In this setting, the aim of the seller network is to come up with a price for a given context such that the expected revenue is maximised by considering the buyer's satisfaction as well. On the other hand, the aim of the buyer network is to assign probability of purchase to the offered price to mimic the real world buyers' responses while also showing price sensitivity through its action. In other words, rejecting the unnecessarily high priced products. Similar to generative adversarial networks, this framework corresponds to a minimax two-player game. In our experiments with simulated and real-world transaction data, we compared our framework with the baseline model and demonstrated its potential through proposed evaluation metrics.

naman shukla
-
(Poster) [ Visit Poster at Spot B2 in Virtual World ]

This paper studies the problem of clustering through the lens of utility and welfare. In particular, we ask, how much does the quality of the clustering --- typically measured by the conductance, or by the number of edges cut, or the average distance to the centers --- deteriorate if the nodes are strategic and can change clusters? And among reasonable utilities for the nodes, which one hurts quality the least? We investigate these questions both theoretically, by studying the equilibria of hedonic games (simplified clustering games with unconstrained number of clusters), and experimentally, by measuring the quality of pure Nash equilibria of more realistic clustering games. We introduce a new utility function for the nodes which we call closeness, and which we believe is an attractive alternative to previously studied node utilities. We study the properties of the closeness utility theoretically and demonstrate experimentally its advantages over other established utilities such as the modified fractional utility. Finally, we present a polynomial-time algorithm which, given a clustering with optimal quality, finds another clustering with better average utility, and in fact the one that maximizes the ratio of the gain in average utility over the loss in quality. This submission and the topics covered are particularly related to the theme of the Strategic ML workshop, especially to welfare-aware machine learning and modeling strategic behavior. In the case of clustering, this work aims to cover gaps between the classical ML algorithms for clustering that optimize certain connectivity metrics (such as conductance) and incentives that people have in creating or changing groups. With an increasing demand for algorithmic tools in identifying and facilitating groups, understanding the incentives people may have in group formation become important in order to ensure retention, stability, and success of groups. This work aims to open avenues of research for investigating trade-offs between robustness and welfare in designing algorithms for clustering.

Ana-Andreea Stoica · Christos Papadimitriou
-
(Poster) [ Visit Poster at Spot B3 in Virtual World ]
Federated learning is highly studied as an approach to collaboration across large populations, however, little has been explored in how to coordinate resources to maintain such data collaborations over time. Inspired by game theoretic notions, this paper introduces a framework for incentive-aware learning and data sharing in federated learning. Our notions of stable and envy-free equilibria formalize notions of fairness and stability for self-interested agents. For example, in an envy-free equilibrium, no agent would wish to swap their sampling burden with any other agent and in a stable equilibrium, no agent would wish to unilaterally reduce their sampling burden. In addition to formalizing this framework, our contributions include characterizing the structural properties of such equilibria, proving when they exist, and showing how they can be computed. Furthermore, we show that, in the worst case, a $\Omega(\sqrt{k})$ gap exists between the sample-minimizing and equilibrium solutions across realistic learning models.
Richard Phillips · Han Shao · Avrim Blum · Nika Haghtalab
-
(Poster) [ Visit Poster at Spot B4 in Virtual World ]

We consider the two-sided matching market with bandit learners. In a matching market, there are two sets of agents---users and providers---that seek to be matched one-to-one. Contrary to a core assumption in the literature, users and providers often do not know their true matching preferences a priori and must learn them. To address this assumption, recent works propose to blend the matching and multi-armed bandit problems. One recent work finds that it is possible to assign matchings that are stable (i.e., no pair of agents is incentivized to defect) at every time step while also allowing agents to learn enough so that the system converges to matchings that are stable under the agents' true preferences. However, the authors also uncover an impossibility result: under their setup, one cannot simultaneously guarantee stability and low optimal regret. In this work, we study the two-sided matching market with bandit learners under costs and transfers in order to faithfully model competition between agents. We show that, in contrast to previous work, it is possible to simultaneously guarantee four desiderata: (1) stability, (2) low optimal regret, (3) fairness in the distribution of regret, and (4) high social welfare.

Sarah Cen · Devavrat Shah
-
(Poster) [ Visit Poster at Spot B5 in Virtual World ]

In this work, we consider classification of agents who can both game and improve. For example, people wishing to get a loan may be able to take some actions that increase their perceived credit-worthiness and others that also increase their true credit-worthiness. A decision-maker would like to define a classification rule with few false-positives (does not give out many bad loans) while yielding many true positives (giving out many good loans), which includes encouraging agents to improve to become true positives if possible. We consider two models for this problem, a general discrete model and a linear model, and prove algorithmic, learning, and hardness results for each.For the general discrete model, we give an efficient algorithm for the problem of maximizing the number of true positives subject to no false positives, and show how to extend this to a partial-information learning setting. We also show hardness for the problem of maximizing the number of true positives subject to a nonzero bound on the number of false positives, and that this hardness holds even for a finite-point version of our linear model. We also show that maximizing the number of true positives subject to no false positive is NP-hard in our full linear model. We additionally provide an algorithm that determines whether there exists a linear classifier that classifies all agents accurately and causes all improvable agents to become qualified, and give additional results for low-dimensional data.

Saba Ahmadi · Hedyeh Beyhaghi · Avrim Blum · Keziah Naggita
-
(Poster) [ Visit Poster at Spot B6 in Virtual World ]
The classical Perceptron algorithm provides a simple and elegant procedure for learning a linear classifier. In each step, the algorithm observes the sample’s position and label and updates the current predictor accordingly if it makes a mistake. However, in presence of strategic agents that desire to be classified as positive and that are able to modify their position by a limited amount, the classifier may not be able to observe the true position of agents but rather a position where the agent pretends to be. Unlike the original setting with perfect knowledge of positions, in this situation the Perceptron algorithm fails to achieve its guarantees, and we illustrate examples with the predictor oscillating between two solutions forever, making an unbounded number of mistakes even though a perfect large-margin linear classifier exists. Our main contribution is providing a modified Perceptron-style algorithm which makes a bounded number of mistakes in presence of strategic agents with both $\ell_2$ and weighted $\ell_1$ manipulation costs. In our baseline model, knowledge of the manipulation costs (i.e., the extent to which an agent may manipulate) is assumed. In our most general model, we relax this assumption and provide an algorithm which learns and refines both the classifier and its cost estimates to achieve good mistake bounds even when manipulation costs are unknown.
Saba Ahmadi · Hedyeh Beyhaghi · Avrim Blum · Keziah Naggita
-
(Poster) [ Visit Poster at Spot C0 in Virtual World ]

Waiting lists allocate items by offering agents a choice among items with associated waiting times. These waiting times serve as prices that are formed by the stochastic arrivals and departures of agents. We study the allocative efficiency under such dynamically adjusting prices by drawing a connection between this price adjustment process and the stochastic gradient descent optimization algorithm. We show that the loss due to price fluctuations is bounded by the granularity of price changes. Additional conditions allow us to identify markets where the loss is close to the bound or exponentially small. Our results show that a simple price adjustment heuristic can perform well, but may be slow to adjust to changes in demand because of a trade-off between the speed of adaptation and loss from price fluctuations.

Itai Ashlagi · Jacob Leshno · Amin Saberi
-
(Poster) [ Visit Poster at Spot C1 in Virtual World ]

The use of algorithmic decision making systems in domains which impact the financial, social, and political well-being of people has created a demand for these decision making systems to be ``fair'' under some accepted notion of equity. This demand has in turn inspired a large body of work focused on the development of fair learning algorithms which are then used in lieu of their conventional counterparts. Most analysis of such fair algorithms proceeds from the assumption that the people affected by the algorithmic decisions are represented as immutable feature vectors. However, strategic agents may possess both the ability and the incentive to manipulate this observed feature vector in order to attain a more favorable outcome. We explore the impact that strategic agent behavior could have on fair classifiers and derive conditions under which this behavior leads to fair classifiers becoming less fair than their conventional counterparts under the same measure of fairness that the fair classifier takes into account. These conditions are related to the the way in which the fair classifier remedies unfairness on the original unmanipulated data: fair classifiers which remedy unfairness by becoming more selective than their conventional counterparts are the ones that become less fair than their counterparts when agents are strategic. We further demonstrate that both the increased selectiveness of the fair classifier, and consequently the loss of fairness, arises when performing fair learning on domains in which the advantaged group is overrepresented in the region near (and on the beneficial side of) the decision boundary of conventional classifiers. Finally, we observe experimentally, using several datasets and learning methods, that this \emph{fairness reversal} is common, and thatour theoretical characterization of the fairness reversal conditions indeed holds in most such cases.

Andrew Estornell · Yang Liu · Yevgeniy Vorobeychik
-
(Poster) [ Visit Poster at Spot C2 in Virtual World ]

Strategic classification, i.e. classification under possible strategic manipulations of features, has received a lot of attention from both the machine learning and the game theory community. Most works focus on analysing the optimal decision rule under such manipulations. In our work we take a learning theoretic perspective, focusing on the sample complexity needed to learn a good decision rule which is robust to strategic manipulation. We perform this analysis under a strategic manipulation loss that takes into account both the accuracy of the final decision and the vulnerability to manipulation. We analyse the sample complexity for a known graph of possible manipulations in terms of the complexity of the function class and the manipulation graph. Additionally, we address the problem of unknown manipulation capabilities of the involved agents. Using techniques from transfer learning theory, we define a similarity measure for manipulation graphs and show that learning outcomes are robust with respect to small changes in the manipulation graph. Lastly we analyse the (sample complexity of) learning of the manipulation capability of agents with respect to this similarity measures, providing a way to learn strategic classification with respect to an unknown manipulation graph.

Tosca Lechner · Ruth Urner
-
(Poster) [ Visit Poster at Spot C3 in Virtual World ]

Cooperation in settings where agents have both common and conflicting interests (mixed-motive environments) has recently received considerable attention in multi-agent learning. However, the mixed-motive environments typically studied have a single cooperative outcome on which all agents can agree. Many real-world multi-agent environments are instead bargaining problems (BPs): they have several Pareto-optimal payoff profiles over which agents have conflicting preferences. We argue that typical cooperation-inducing learning algorithms fail to cooperate in BPs when there is room for \textit{normative disagreement} resulting in the existence of multiple competing cooperative equilibria, and illustrate this problem empirically. To remedy the issue, we introduce the notion of \textit{norm-adaptive} policies. Norm-adaptive policies are capable of behaving according to different norms in different circumstances, creating opportunities for resolving normative disagreement. We develop a class of norm-adaptive policies and show in experiments that these significantly increase cooperation. However, norm-adaptiveness cannot address residual bargaining failure arising from a fundamental tradeoff between exploitability and cooperative robustness.

Julian Stastny · Maxime Riché · Alexander Lyzhov · Johannes Treutlein · Allan Dafoe · Jesse Clifton
-
(Poster) [ Visit Poster at Spot C4 in Virtual World ]

Machine learning models are often deployed in dynamic systems that involve data collection and model updates. In consumer finance, for example, lenders deploy a model to automate lending decisions, collect data from applicants, and then use it to update the model for future applicants. In such systems, data collection suffers from \emph{selective labeling}, as we only observe outcomes for applicants who are approved. When these systems are initialized with models deny loans to a specific group, they exhibit \emph{censoring} whereby they deny loans to a group of consumers in perpetuity.In this work, we identify conditions when machine learning systems exhibit censoring. We study the ability to resolve censoring by providing \emph{recourse} -- i.e., by providing applicants who are denied with actions that guarantee approval. We develop a method to learn linear classifiers with recourse guarantees, which allows model owners to update their models while providing prior applicants with a guarantee of approval. We benchmark our method and other strategies for exploration in their ability to resolve censoring. Our results illustrate the costs of each strategy on key stakeholders, provide insight into failure modes that lead to censoring, and highlight the importance of the feasibility of recourse.

Jennifer Chien · Berk Ustun · Margaret Roberts
-
(Poster) [ Visit Poster at Spot C5 in Virtual World ]

We study offline multi-agent reinforcement learning (RL), which aims to learn an optimal policy based on historical data for multiple agents. Unlike online RL, accidental overestimation errors arising from function approximation can cumulatively arise and affect future iterations in the offline RL. In this paper, we extend the pessimistic value iteration algorithm to the multi-agent setting: after obtaining the lower bound of the value function of each agent, we compute the optimistic policy by solving a general-sum matrix game with these lower bounds as the payoff matrices. Instead of finding the Nash Equilibrium of such a general-sum game, our algorithm solves for a Coarse Correlated Equilibrium (CCE) for the sake of computational efficiency. In the end, we prove an information-theoretic lower bound of the regret for any algorithm on an offline dataset in the two-agent zero-sum game. To the best of our knowledge, such a CCE scheme for pessimism-based algorithm has not appeared in the literature and can be of interest in its own right. We hope that our work can shed light on the future analysis of the equilibrium point of a multiagent Markov decision process given an offline dataset.

Yihang Chen
-
(Poster) [ Visit Poster at Spot C6 in Virtual World ]

Agents operating in real-world settings are often faced with the need to adapt to unexpected changes in their environment. Recent advances in multi-agent reinforcement learning (MARL) provide a variety of tools to support the ability of RL agents to deal with the dynamic nature of their environment, which may often be increased by the presence of other agents. In this work, we measure the resilience of a group of agents as the group’s ability to adapt to unexpected perturbations in the environment. To promote resilience, we suggest facilitating collaboration within the group, and offer a novel confusion-based communication protocol that requires an agent to broadcast its local observations that are least aligned with its previous experience. We present empirical evaluation of our approach on a set of simulated multi-taxi settings.

Ofir Abu · Sarah Keren · Matthias Gerstgrasser · Jeffrey S Rosenschein
-
(Poster) [ Visit Poster at Spot C0 in Virtual World ]

We study the game redesign problem in which an external designer has the ability to change the payoff function in each round, but incurs a design cost for deviating from the original game. The players apply no-regret learning algorithms to repeatedly play the changed games with limited feedback. The goals of the designer are to (i) incentivize all players to take a specific target action profile frequently; and (ii) incur small cumulative design cost. We present game redesign algorithms with the guarantee that the target action profile is played in T−o(T) rounds while incurring only o(T) cumulative design cost. Game redesign describes both positive and negative applications: a benevolent designer who incentivizes players to take a target action profile with better social welfare compared to the solution of the original game, or a malicious attacker whose target action profile benefits themselves but not the players. Simulations on four classic games illustrate our proposed redesign algorithms.

Yuzhe Ma · Jerry Zhu
-
(Poster) [ Visit Poster at Spot D1 in Virtual World ]

We study a new game of price competition amongst firms selling identical goods, whose defining property is that the revenue of a firm at a price is independent of any competing prices that are strictly lower. This game is motivated by a variety of customer choice models that induce this property, prominently those stemming from the well-known satisficing heuristic for decision-making. Under a mild condition, we show that every pure-strategy local Nash equilibrium in this game corresponds to a set of prices generated by the firms sequentially setting local best-response prices in a fixed order. In other words, despite being simultaneous-move games, equilibria have a sequential-move equilibrium structure. We thus call these games pseudo-competitive games. We find that these games are often plagued by strictly-local Nash equilibria, in which the price of a firm is only a local best-response to the competitor's price, when a globally optimal response with a potentially unboundedly higher payoff is available. We show that price dynamics resulting from the firms utilizing gradient-based pricing algorithms may often converge to such undesirable outcomes. We propose a new online learning approach to address this concern under certain regularity assumptions on the revenue curves.

Chamsi Hssaine · Vijay Kamble · Sid Banerjee
-
(Poster) [ Visit Poster at Spot D0 in Virtual World ]

A growing number of machine learning architectures, such as Generative Adversarial Networks, rely on the design of games which implement a desired functionality via a Nash equilibrium. In practice these games have an implicit complexity (e.g.~from underlying datasets and the deep networks used) that makes directly computing a Nash equilibrium impractical or impossible. For this reason, numerous learning algorithms have been developed with the goal of iteratively converging to a Nash equilibrium. Unfortunately, the dynamics generated by the learning process can be very intricate and instances of training failure hard to interpret. In this paper we show that, in a strong sense, this dynamic complexity is inherent to games. Specifically, we prove that replicator dynamics, the continuous-time analogue of Multiplicative Weights Update, even when applied in a very restricted class of games---known as finite matrix games---is rich enough to be able to approximate arbitrary dynamical systems. Our results are positive in the sense that they show the nearly boundless dynamic modelling capabilities of current machine learning practices, but also negative in implying that these capabilities may come at the cost of interpretability. As a concrete example, we show how replicator dynamics can effectively reproduce the well-known strange attractor of Lonrenz dynamics (the "butterfly effect") while achieving no regret.

Gabriel P Andrade · Rafael Frongillo · Georgios Piliouras
-
(Poster) [ Visit Poster at Spot C6 in Virtual World ]

Due to the Covid-19 pandemic, more than 500 US-based colleges and universities went ``test-optional'' for admissions and promised that they would not penalize applicants for not submitting test scores, part of a longer trend to rethink the role of testing in college admissions. However, it remains unclear how (and whether) a college can simultaneously use test scores for those who submit them, while not penalizing those who do not--and what that promise even means. We formalize these questions, and study how a college can overcome two challenges with optional testing: \textit{strategic applicants} (when those with low test scores can pretend to not have taken the test), and \textit{informational gaps} (it has more information on those who submit a test score than those who do not). We find that colleges can indeed do so, if and only if they are able to use information on who has test access and are willing to randomize admissions.

Zhi Liu · Nikhil Garg
-
(Poster) [ Visit Poster at Spot C5 in Virtual World ]

In many real-world situations, data is distributed across multiple locations and can't be combined for training. For example, multiple hospitals might have comparable data related to patient outcomes. Federated learning is a novel distributed learning approach that allows multiple federating agents to jointly learn a model. While this approach might reduce the error each agent experiences, it also raises questions of fairness: to what extent can the error experienced by one agent be significantly lower than the error experienced by another agent? In this work, we consider two notions of fairness that each may be appropriate in different circumstances: \emph{egalitarian fairness} (which aims to bound how dissimilar error rates can be) and \emph{proportional fairness} (which aims to reward players for contributing more data). For egalitarian fairness, we obtain a tight multiplicative bound on how widely error rates can diverge between agents federating together. For proportional fairness, we show that sub-proportional error (relative to the number of data points contributed) is guaranteed for any individually rational federating coalition.

Kate Donahue · Jon Kleinberg
-
(Poster) [ Visit Poster at Spot C4 in Virtual World ]

In malware behavioral analysis, the list of accessed and created files very often indicates whether the examined file is malicious or benign. However, malware authors are trying to avoid detection by generating random filenames and/or modifying used filenames with new versions of the malware. These changes represent real-world adversarial examples. The goal of this work is to generate realistic adversarial examples and improve the classifier's robustness against these attacks. Our approach learns latent representations of input strings in an unsupervised fashion and uses gradient-based adversarial attack methods in the latent domain to generate adversarial examples in the input domain. We use these examples to improve the classifier's robustness by training on the generated adversarial set of strings. Compared to classifiers trained only on perturbed latent vectors, our approach produces classifiers that are significantly more robust without a large trade-off in standard accuracy.

Marek Galovic · Branislav Bosansky · Viliam Lisy
-
(Poster) [ Visit Poster at Spot C3 in Virtual World ]

In an effort to improve the accuracy of credit lending decisions, many financial intuitions are now using predictions from machine learning models. While such predictions enjoy many advantages, recent research has shown that the predictions have the potential to be biased and unfair towards certain subgroups of the population. To combat this, several techniques have been introduced to help remove the bias and improve the overall fairness of the predictions. We introduce a new fairness technique, called Subgroup Threshold Optimizer (STO), that does not require any alternations to the input training data nor does it require any changes to the underlying machine learning algorithm, and thus can be used with any existing machine learning pipeline. STO works by optimizing the classification thresholds for individual subgroups in order to minimize the overall discrimination score between them. Our experiments on a real-world credit lending dataset show that STO can reduce gender discrimination by over 90%.

Cecilia Ying
-
(Poster) [ Visit Poster at Spot C2 in Virtual World ]

On-line firms deploy suites of software platforms, where each platform is designed to interact with users during a certain activity, such as browsing, chatting, socializing, emailing, driving, etc. The economic and incentive structure of this exchange, as well as its algorithmic nature, have not been explored to our knowledge. We model this interaction as a Stackelberg game between a Designer and one or more Agents. We model an Agent as a Markov chain whose states are activities; we assume that the Agent’s utility is a linear function of the steady-state distribution of this chain. The Designer may design a platform for each of these activities/states; if a platform is adopted by the Agent, the transition probabilities of the Markov chain are affected, and so is the objective of the Agent. The Designer’s utility is a linear function of the steady state probabilities of the accessible states (that is, the ones for which the platform has been adopted), minus the development cost of the platforms. The underlying optimization problem of the Agent — that is, how to choose the states for which to adopt the platform — is an MDP. If this MDP has a simple yet plausible structure (the transition probabilities from one state to another only depend on the target state and the recurrent probability of the current state) the Agent’s problem can be solved by a greedy algorithm. The Designer’s optimization problem (designing a custom suite for the Agent so as to optimize, through the Agent’s optimum reaction, the Designer’s revenue), is in general NP-hard to approximate within any finite ratio; however, in the special case, while still NP-hard, has an FPTAS. These results generalize, under mild additional assumptions, from a single Agent to a distribution of Agents with finite support, as well as to the setting where other Designers have already created platforms, and the Designer must find the best response to the strategies of the other Designers. We discuss other implications of our results and directions of future research.

Christos Papadimitriou · Kiran Vodrahalli · Mihalis Yannakakis
-
(Poster) [ Visit Poster at Spot C1 in Virtual World ]

Strategic classification studies the interaction between a classification rule and the strategic agents it governs. Under the assumption that the classifier is known, rational agents respond to it by manipulating their features. However, in many real-life scenarios of high-stake classification (e.g., credit scoring), the classifier is not revealed to the agents, which leads agents to attempt to learn the classifier and game it too. In this paper we generalize the strategic classification model to such scenarios. We define the ``price of opacity'' as the difference in prediction error between opaque and transparent strategy-robust classifiers, characterize it, and give a sufficient condition for this price to be strictly positive, in which case transparency is the recommended policy. Our experiments show how Hardt et al.’s robust classifier is affected by keeping agents in the dark.

Ganesh Ghalme · Vineet Nair · Itay Eilat · Inbal Talgam-Cohen · Nir Rosenfeld
-
(Poster) [ Visit Poster at Spot B6 in Virtual World ]

We construct a model of expert prediction where predictions can influence the state of the world. Under this model, we show through theoretical and numerical results that proper scoring rules can incentivize experts to manipulate the world with their predictions. We also construct a simple class of scoring rules that avoids this problem.

Alan Chan
-
(Poster) [ Visit Poster at Spot A1 in Virtual World ]

Auctions are modeled as Bayesian games with continuous type and action spaces.Computing equilibria in games is computationally hard in general and no exactsolution theory is known for auction games. Recent research aimed at learningpure strategy Bayes Nash equilibria in symmetric auction games. In contrast, weintroduce algorithms computing distributional strategies on a discretized versionof the game via convex online optimization. First, existence results for purestrategy Bayes Nash equilibria are more restricted than those for equilibria indistributional strategies. Second, the expected utility of agents is linear indistributional strategies. We show that if regularized convex online optimizationalgorithms converge in such games, then they converge to a distributional ε-equilibrium of the discretized game. Importantly, we prove that a distributional ε-equilibrium of the discretized game approximates an equilibrium in the continuousgame. In a large number of experiments, we show that the method approximatesthe analytical (pure) Bayes Nash equilibrium arbitrarily closely in a wide varietyof auction games. The method allows for interdependent valuations and differenttypes of utility functions and provides a foundation for broadly applicableequilibrium solvers. For small environments with only a few players and items,the techniques is much faster, and can solve equilibria in symmetric auction gamesin seconds.

Maximilian Fichtl · Matthias Oberlechner · Martin Bichler
-
(Poster) [ Visit Poster at Spot A2 in Virtual World ]

Strategic classification regards the problem of learning in settings where users can strategically modify their features to improve outcomes. This setting applies broadly and has received much recent attention. But despite its practical significance, work in this space has so far been predominantly theoretical. In this paper we present a learning framework for strategic classification that is practical. Our approach directly minimizes the “strategic” empirical risk, achieved by differentiating through the strategic response of users. This provides flexibility that allows us to extend beyond the original problem formulation and towards more realistic learning scenarios. A series of experiments demonstrates the effectiveness of our approach on various learning settings.

Nir Rosenfeld · Sagi Levanon
-
(Poster) [ Visit Poster at Spot A3 in Virtual World ]

We consider repeated first price auctions where each bidder, having a deterministic type, learns to bid using a mean-based learning algorithm. We completely characterize the Nash convergence property of the bidding dynamics in two senses: (1) time-average: the fraction of rounds where bidders play a Nash equilibrium approaches to 1 in the limit; (2) last-iterate: the mixed strategy profile of bidders approaches to a Nash equilibrium in the limit. Specifically, the results depend on the number of bidders with the highest value:-if the number is at least three, the bidding dynamics almost surely converges to a Nash equilibrium of the auction, both in time-average and in last-iterate. -if the number is two, the bidding dynamics almost surely converges to a Nash equilibrium in time-average but not necessarily in last-iterate. -if the number is one, the bidding dynamics may not converge to a Nash equilibrium in time-average nor in last-iterate. Our discovery opens up new possibilities in the study of convergence dynamics of learning algorithms.

Xiaotie Deng · Xinyan Hu · Tao Lin · Weiqiang Zheng
-
(Poster) [ Visit Poster at Spot A4 in Virtual World ]

When faced with (automated) assessment rules, individuals can modify their observable features strategically to obtain better decisions. In many situations, decision-makers deliberately keep the underlying assessment rule secret to avoid gaming. This forces the decision subjects to rely on incomplete information when making strategic feature modifications. We capture such settings as a game of Bayesian persuasion, in which the decision-maker sends a signal, i.e., an action recommendation, to the decision subject to incentivize them to take desirable actions. We formulate the principal's problem of finding the optimal Bayesian incentive-compatible signaling policy as an optimization problem and characterize it via a linear program. Through this characterization, we observe that while finding a BIC strategy can be simplified dramatically, the computational complexity of solving this linear program is closely tied to (1) the relative size of the agent's action space, and (2) the number of features utilized by the underlying decision rule.

Keegan Harris · Valerie Chen · Joon Kim · Ameet S Talwalkar · Hoda Heidari · Steven Wu
-
(Poster) [ Visit Poster at Spot A5 in Virtual World ]

We investigate how effective an attacker can be when it only learns from its victim's actions, without access to the victim's reward. In this work, we are motivated by the scenario where the attacker wants to disrupt real-world RL applications, such as autonomous vehicles or delivery drones, without knowing the victim's precise goals or reward function. We argue that one heuristic approach an attacker can use is to strategically maximize the entropy of the victim's policy. The policy is generally not obfuscated, which implies it may be extracted simply by passively observing the victim. We provide such a strategy in the form of a reward-free exploration algorithm that maximizes the attacker's entropy during the exploration phase, and it then maximizes the victim's empirical entropy during the planning phase. In our experiments, the victim agents are subverted through policy entropy maximization, implying an attacker might not need access to the victim’s reward to succeed. Hence, even if the victim's reward information is protected, reward-free attacks, based only on observing behavior, underscore the need to better understand policy obfuscation when preparing to deploy reinforcement learning in real world applications.

Ted Fujimoto · Timothy Doster · Adam Attarian · Jill Brandenberger · Nathan Hodas
-
(Poster) [ Visit Poster at Spot A6 in Virtual World ]

We study black-box reward poisoning attacks against reinforcement learning (RL), in which an adversary aims to manipulate the rewards to mislead a sequence of RL agents with unknown algorithms to learn a nefarious policy in an environment unknown to the adversary a priori. That is, our attack makes minimum assumptions on the prior knowledge of the adversary: it has no initial knowledge of the environment or the learner, and neither does it observe the learner's internal mechanism except for its performed actions. We design a novel black-box attack, U2, that can provably achieve a near-matching performance to the state-of-the-art white-box attack, demonstrating the feasibility of reward poisoning even in the most challenging black-box setting.

Amin Rakhsha · Xuezhou Zhang · Jerry Zhu · Adish Singla
-
(Poster) [ Visit Poster at Spot B0 in Virtual World ]

Potential games are arguably one of the most important and widely studied classes of normal form games. They define the archetypal setting of multi-agent coordination as all agent utilities are perfectly aligned with each other via a common potential function. Can this intuitive framework be transplanted in the setting of Markov Games? What are the similarities and differences between multi-agent coordination with and without state dependence? We present a novel definition of Markov Potential Games (MPG) that generalizes prior attempts at capturing complex stateful multi-agent coordination. Counter-intuitively, insights from normal-form potential games do not carry over as MPGs can consist of settings where state-games can be zero-sum games. In the opposite direction, Markov games where every state-game is a potential game are not necessarily MPGs. Nevertheless, MPGs showcase standard desirable properties such as the existence of deterministic Nash policies. In our main technical result, we prove fast convergence of independent policy gradient to Nash policies by adapting recent gradient dominance property arguments developed for single agent MDPs to multi-agent learning settings.

Stefanos Leonardos · Will Overman · Ioannis Panageas · Georgios Piliouras
-
(Poster) [ Visit Poster at Spot B1 in Virtual World ]

The interplay between exploration and exploitation in competitive multi-agent learning is still far from being well understood. Motivated by this, we study smooth Q-learning, a prototypical learning model that explicitly captures the balance between game rewards and exploration costs. We show that Q-learning always converges to the unique quantal-response equilibrium (QRE), the standard solution concept for games under bounded rationality, in weighted zero-sum polymatrix games with heterogeneous learning agents using positive exploration rates. Complementing recent results about convergence in weighted potential games [15,34], we show that fast convergence of Q-learning in competitive settings obtains regardless of the number of agents and without any need for parameter fine-tuning. As showcased by our experiments in network zero-sum games, these theoretical results provide the necessary guarantees for an algorithmic approach to the currently open problem of equilibrium selection in competitive multi-agent settings.

Stefanos Leonardos · Kelly Spendlove · Georgios Piliouras
-
(Poster) [ Visit Poster at Spot B2 in Virtual World ]
How do you incentivize self-interested agents to $\text{\emph{explore}}$ when they prefer to $\text{\emph{exploit}}$? We consider complex exploration problems, where each agent faces the same (but unknown) MDP. In contrast with traditional formulations of reinforcement learning, agents control the choice of policies, whereas an algorithm can only issue recommendations. However, the algorithm controls the flow of information, and can incentivize the agents to explore via information asymmetry. We design an algorithm which explores all reachable states in the MDP. We achieve provable guarantees similar to those for incentivizing exploration in static, stateless exploration problems studied previously.
Max Simchowitz · Alex Slivkins Slivkins
-
(Poster) [ Visit Poster at Spot B3 in Virtual World ]

Prediction markets are incentive-based mechanisms for eliciting and combining the diffused, private beliefs of traders about a future uncertain event such as a political election. Typically prediction markets maintain point estimates of forecast variables; however, exponential-family prediction markets define a class of cost function-based market-making algorithms that maintain a complete, collective belief distribution over the underlying generative process of the event of interest (e.g. the probability density of the winner's vote-share). In this paper, we focus on concretizing a special case of this abstract framework we call the beta-Bernoulli market. We set up a multi-agent simulation of the market ecosystem to experimentally investigate the interaction of this market-making algorithm with a heterogeneous trading population. We design a Bayesian trader model with explicit characterization of this heterogeneity with respect to two independent attributes: how rich a trader's private information is and how much wealth they initially have at their disposal. We are interested in gauging the interplay of the above attributes with the arrival order of traders, particularly in terms of the net profit accrued by different trader types. Our results strongly suggest that early arrival can dominate both wealth and informativeness as a factor in determining trader compensation under a variety of experimental conditions.

Mithun Chakraborty · Sindhu Kutty
-
(Poster) [ Visit Poster at Spot B4 in Virtual World ]

We consider \emph{incentivized exploration}: a version of multi-armed bandits where the choice of arms is controlled by self-interested agents, and the algorithm can only issue recommendations. The algorithm controls the flow of information, and the information asymmetry can incentivize the agents to explore. Prior work achieves optimal regret rates up to multiplicative factors that become arbitrarily large depending on the Bayesian priors, and scale exponentially in the number of arms. A more basic problem of sampling each arm once runs into similar factors.We focus on the \emph{price of incentives}: the loss in performance, broadly construed, incurred for the sake of incentive-compatibility. We prove that Thompson Sampling, a standard bandit algorithm, is incentive-compatible if initialized with sufficiently many data points. The performance loss due to incentives is therefore limited to the initial rounds when these data points are collected.The problem is largely reduced to that of sample complexity: how many rounds are needed? We address this question, providing matching upper and lower bounds and instantiating them in various corollaries. Typically, the optimal sample complexity is polynomial in the number of arms and exponential in the ``strength of beliefs".

Mark Sellke · Alex Slivkins Slivkins
-
(Poster) [ Visit Poster at Spot B5 in Virtual World ]

This paper studies cooperative data-sharing between competitors vying to predict a consumer's tastes. We design optimal data-sharing schemes both for when firms compete only with each other, and for when they additionally compete with an Amazon---a company with more, better data. We derive conditions under which competitors benefit from coopetition, and under which full data-sharing is optimal.

Ronen Gradwohl · Moshe Tennenholtz

Author Information

Yahav Bechavod (Hebrew University)

Yahav Bechavod is a PhD student at the School of Computer Science and Engineering at the Hebrew University, advised by Prof. Amit Daniely. His research explores foundational questions in the fields of machine learning, algorithmic fairness, and learning in the presence of strategic behavior. He is an Apple Scholar in AI\ML, and a recipient of the Charles Clore Foundation PhD Fellowship and the KLA Award for Research Excellence in PhD. He also holds an MS (Computer Science, Summa Cum Laude) and BS (Mathematics and Computer Science) from the Hebrew University.

Hoda Heidari (Carnegie Mellon University)
Eric Mazumdar (UC Berkeley EECS)
Celestine Mendler-Dünner (Max Planck Institute for Intelligent Systems)
Tijana Zrnic (UC Berkeley)

More from the Same Authors