`

Timezone: »

 
Workshop
Safe and Robust Control of Uncertain Systems
Ashwin Balakrishna · Brijen Thananjeyan · Daniel Brown · Marek Petrik · Melanie Zeilinger · Sylvia Herbert

Mon Dec 13 08:00 AM -- 04:00 PM (PST) @ None
Event URL: https://sites.google.com/view/safe-robust-control/home »

Control and decision systems are becoming a ubiquitous part of our daily lives, ranging from serving advertisements or recommendations on the internet to controlling autonomous physical systems such as industrial equipment or robots. While these systems have shown the potential for significantly improving quality of life and industrial efficiency, the impact of the decisions made by these systems can also cause significant damages. For example, an online retailer recommending dangerous products to children, a social media platform serving content which polarizes society, or a household robot/autonomous car which collides with surrounding humans can all cause significant direct harm to society. These undesirable behaviors not only can be dangerous, but also lead to significant inefficiencies when deploying learning-based agents in the real world. This motivates developing algorithms for learning-based control which can reason about uncertainty and constraints in the environment to explicitly avoid undesirable behaviors. We believe hosting a discussion on safety in learning-based control at NeurIPS 2021 would have far-reaching societal impacts by connecting researchers from a variety of disciplines including machine learning, control theory, AI safety, operations research, robotics, and formal methods.

Mon 8:00 a.m. - 8:15 a.m.
Introduction (Short Talk)
Ashwin Balakrishna
Mon 8:00 a.m. - 8:25 a.m.
Ye Pu (Invited Talk)
Ye Pu
Mon 8:25 a.m. - 8:30 a.m.
Ye Pu (Talk Q/A Session)
Ye Pu
Mon 8:45 a.m. - 9:10 a.m.
Aleksandra Faust (Invited Talk)
Aleksandra Faust
Mon 9:10 a.m. - 9:15 a.m.
Aleksandra Faust (Talk Q/A Session)
Aleksandra Faust
Mon 9:15 a.m. - 9:40 a.m.
Shie Mannor (Invited Talk)
Shie Mannor
Mon 9:40 a.m. - 9:45 a.m.
Shie Mannor (Talk Q/A Session)
Shie Mannor
Mon 9:45 a.m. - 9:50 a.m.

We propose a data-driven control policy design from an offline data set. Contraction theory enables constructing a policy-learning framework that makes the closed-loop system trajectories inherently convergent towards a unique trajectory. At the technical level, identifying the contraction metric, which is the distance metric with respect to which a robot's trajectories exhibit contraction is often non-trivial. We propose to jointly learn the control policy and its corresponding contraction metric from offline data. To achieve this, we learn the robot dynamical model from an offline data set consisting of the robot's state and input trajectories. Using this learned dynamical model, we propose a data augmentation algorithm for learning contraction policies. We evaluate the performance of our proposed framework on simulated robotic goal-reaching tasks and demonstrate that enforcing contraction results in faster convergence.

Navid Rezazadeh, Negar Mehr
Mon 9:50 a.m. - 9:55 a.m.

We propose a safety-guaranteed planning and control framework for unmanned surface vessels (USVs), using Gaussian processes (GPs) to learn uncertainties. The uncertainties encountered by USVs, including external disturbances and model mismatches, are potentially state-dependent, time-varying, and hard to capture with constant models. GP is a powerful learning-based tool that can be integrated with a model-based planning and control framework, which employs a Hamilton-Jacobi differential game formulation. Such combination yields less conservative trajectories and safety-guaranteeing control strategies. We demonstrate the proposed framework in simulations and experiments on a CLEARPATH Heron USV.

Shuhao Zhang, Seth Siriya
Mon 9:55 a.m. - 10:00 a.m.

It is well known that current deep reinforcement learning (RL) agents are particularly vulnerable under adversarial perturbations. Therefore, it is important to develop a vulnerability-aware algorithm that could improve the performance of the RL agent under any attack with bounded budgets. Existing robust training approaches in deep RL either directly use adversarial training whose attacks are heuristically generated which might be non-optimal, or they need to learn an RL-based strong adversary which doubles the computational and sample complexity of the training process. In this work, we formalize the notion of the lower bound of the policy value under bounded attacks by a proposed worst-case Bellman operator. By directly estimating and improving the worst-case value of an agent under attack, we develop a robust training method that efficiently improves the robustness of RL policies without learning an adversary. Empirical evaluations show that our algorithm universally achieves state-of-the-art performance under strong adversaries with significantly higher efficiency, compared with other robust training methods.

Yongyuan Liang, Yanchao Sun, Ruijie Zheng, Furong Huang
Mon 10:00 a.m. - 11:00 a.m.
Safe RL Panel Discussion (Discussion Panel)
Animesh Garg, Marek Petrik, Melanie Zeilinger, Shie Mannor, Claire Tomlin, Ugo Rosolia, Dylan Hadfield-Menell
Mon 11:00 a.m. - 12:00 p.m.
Poster Session I  link »
Mon 12:00 p.m. - 12:25 p.m.
Rohin Shah (Invited Talk)
Rohin Shah
Mon 12:25 p.m. - 12:30 p.m.
Rohin Shah (Talk Q/A Session)
Rohin Shah
Mon 12:30 p.m. - 12:55 p.m.
Angelique Taylor (Invited Talk)
Angelique Taylor
Mon 12:55 p.m. - 1:00 p.m.
Angelique Taylor (Talk Q/A Session)
Angelique Taylor
Mon 1:00 p.m. - 1:25 p.m.
Ugo Rosolia (Invited Talk)
Ugo Rosolia
Mon 1:25 p.m. - 1:30 p.m.
Ugo Rosolia (Talk Q/A Session)
Ugo Rosolia
Mon 1:30 p.m. - 2:30 p.m.
Safe RL Debate (Debate)
Sylvia Herbert, Animesh Garg, Emma Brunskill, Aleksandra Faust, Dylan Hadfield-Menell
Mon 2:30 p.m. - 2:35 p.m.

One promising approach to improve the robustness and exploration in Reinforcement Learning is by taking human feedback and in that way incorporating prior knowledge of the target environment. It is, however, often too expensive to obtain enough feedback of good quality. To mitigate the issue, we aim to rely on a group of multiple experts (and non-experts) with different skill levels to generate enough feedback. Such feedback can therefore be inconsistent and infrequent. In this work, we build upon prior work -- Advise, a Bayesian approach attempting to maximise the information gained from human feedback -- extending the algorithm to accept feedback from this larger group of humans, the trainers, while also estimating each trainer's reliability. We show how aggregating feedback from multiple trainers improves the total feedback's accuracy and make the collection process easier in two ways. Firstly, this approach addresses the case of some of the trainers being adversarial. Secondly, having access to the information about each trainer reliability provides a second layer of robustness and offers valuable information for people managing the whole system to improve the overall trust in the system. It offers tools for improving the feedback collection process or modifying the reward function design if needed. We empirically show that our approach can accurately learn the reliability of each trainer correctly and use it to maximise the information gained from the multiple trainers' feedback, even if some of the sources are adversarial.

Taku Yamagata, Ryan McConville, Raul Santos-Rodriguez
Mon 2:35 p.m. - 2:40 p.m.

We develop algorithms for imitation learning from data that was corrupted by unobserved confounders. Sources of such confounding include (a) persistent perturbations to actions or (b) the expert responding to a part of the state that the learner does not have access to. When a confounder affects multiple timesteps of recorded data, it can manifest as spurious correlations between states and actions that a learner might latch onto, leading to poor policy performance. By utilizing the effect of past states on current states, we are able to break up these spurious correlations, an application of the econometric technique of instrumental variable regression. This insight leads to two novel algorithms, one of a generative-modeling flavor (\texttt{DoubIL}) that can utilize access to a simulator and one of a game-theoretic flavor (\texttt{ResiduIL}) that can be run offline. Both approaches are able to find policies that match the result of a query to an unconfounded expert. We find both algorithms compare favorably to non-causal approaches on simulated control problems.

Gokul Swamy, Sanjiban Choudhury, James Bagnell, Steven Wu
Mon 2:40 p.m. - 2:45 p.m.

Evaluating the worst-case performance of a reinforcement learning (RL) agent under the strongest/optimal adversarial perturbations on state observations (within some constraints) is crucial for understanding the robustness of RL agents. However, finding the optimal adversary is challenging, in terms of both whether we can find the optimal attack and how efficiently we can find it. Existing works on adversarial RL either use heuristics-based methods that may not find the strongest adversary, or directly train an RL-based adversary by treating the agent as a part of the environment, which can find the optimal adversary but may become intractable in a large state space. In this paper, we propose a novel attacking algorithm which has an RL-based director'' searching for the optimal policy perturbation, and anactor'' crafting state perturbations following the directions from the director (i.e. the actor executes targeted attacks). Our proposed algorithm, PA-AD, is theoretically optimal against an RL agent and significantly improves the efficiency compared with prior RL-based works in environments with large or pixel state spaces. Empirical results show that our proposed PA-AD universally outperforms state-of-the-art attacking methods in a wide range of environments. Our method can be easily applied to any RL algorithms to evaluate and improve their robustness.

Yanchao Sun, Ruijie Zheng, Yongyuan Liang, Furong Huang
Mon 2:45 p.m. - 3:45 p.m.
Poster Session II  link »
Mon 3:45 p.m. - 4:00 p.m.
Closing Remarks (Short Talk)
Brijen Thananjeyan
-
[ Visit Poster at Spot A1 in Virtual World ]

Many real-life scenarios require humans to make difficult trade-offs: do we always follow all the traffic rules or do we violate the speed limit in an emergency? These scenarios force us to evaluate the trade-off between collective norms and our own personal objectives. To this end, we propose a novel inverse reinforcement learning (IRL) method for learning implicit hard and soft constraints from demonstrations, enabling agents to quickly adapt to new settings. In addition, learning soft constraints over states, actions, and state features allows agents to transfer this knowledge to new domains that share similar aspects.

Arie Glazier, Andrea Loreggia, Nicholas Mattei, Taher Rahgooy, Francesca Rossi, Brent Venable
-
[ Visit Poster at Spot D6 in Virtual World ]

We propose CAP, a model-based safe RL framework that accounts for potential modeling errors by capturing model uncertainty and adaptively exploiting it to balance the reward and the cost objectives. First, CAP inflates predicted costs using an uncertainty-based penalty. Theoretically, we show that policies that satisfy this conservative cost constraint are guaranteed to also be feasible in the true environment. We further show that this guarantees the safety of all intermediate solutions during RL training. Further, CAP adaptively tunes this penalty during training using true cost feedback from the environment. We evaluate this conservative and adaptive penalty-based approach for model-based safe RL extensively on state and image-based environments. Our results demonstrate substantial gains in sample-efficiency while incurring fewer violations than prior safe RL algorithms.

Jason Ma, Andrew Shen, Osbert Bastani, Dinesh Jayaraman
-
[ Visit Poster at Spot D5 in Virtual World ]

Data poisoning for reinforcement learning has historically focused on general performance degradation, and targeted attacks have been successful via perturbations that involve control of the victim's policy and rewards. We introduce an insidious poisoning attack for reinforcement learning which causes agent misbehavior only at specific target states - all while minimally modifying a small fraction of training observations without assuming any control over policy or reward. We accomplish this by adapting a recent technique, gradient alignment, to reinforcement learning. We test our method and demonstrate success in two Atari games of varying difficulty.

Harrison Foley, Liam Fowl, Tom Goldstein, Gavin Taylor
-
[ Visit Poster at Spot D4 in Virtual World ]

How can we synthesize a safe and near-optimal control policy for a partially-observed system, if all we are given is one historical input/output trajectory that has been corrupted by noise? To address this challenge, we suggest a novel data-driven controller synthesis method, that exploits recent results in controller parametrizations for partially-observed systems and analysis tools from robust control. We provide safety certificates for the learned control policy. Furthermore, the suboptimality of the proposed method shrinks to 0 - and linearly so - in terms of the model mismatch incurred during a preliminary system identification phase.

Luca Furieri
-
[ Visit Poster at Spot D3 in Virtual World ]

Constrained reinforcement learning involves multiple rewards that must individually accumulate to given thresholds. In this class of problems, we show a simple example in which the desired optimal policy cannot be induced by any linear combination of rewards. Hence, there exist constrained reinforcement learning problems for which neither regularized nor classical primal-dual methods yield optimal policies. This work addresses this shortcoming by augmenting the state with Lagrange multipliers and reinterpreting primal-dual methods as the portion of the dynamics that drives the multipliers evolution. This approach provides a systematic state augmentation procedure that is guaranteed to solve reinforcement learning problems with constraints. Thus, while primal-dual methods can fail at finding optimal policies, running the dual dynamics while executing the augmented policy yields an algorithm that provably samples actions from the optimal policy.

Miguel Calvo-Fullana, Santiago Paternain, Alejandro Ribeiro
-
[ Visit Poster at Spot D2 in Virtual World ]

We consider safety in simultaneous learning and control of discrete-time linear time-invariant systems. We provide rigorous confidence bounds on the learned model of the system based on the number of utilized state measurements. These bounds are used to modify control inputs to the system via an optimization problem with potentially time-varying safety constraints. We prove that the state can only exit the safe set with small probability, provided a feasible solution to the safety-constrained optimization exists. This optimization problem is then reformulated in a more computationally-friendly format by tightening the safety constraints to account for model uncertainty during learning. The tightening decreases as the confidence in the learned model improves. We finally prove that, under persistence of excitation, the tightening becomes negligible as more measurements are gathered.

Farhad Farokhi, Alex Leong, Mohammad Zamani, Iman Shames
-
[ Visit Poster at Spot D1 in Virtual World ]

Safe exploration is critical for using reinforcement learning in risk-sensitive environments. Recent work learns risk measures which measure the probability of violating constraints, which can then be used to enable safety. However, learning such risk measures requires significant interaction with the environment, resulting in excessive constraint violations during learning. Furthermore, these measures are not easily transferable to new environments in which the agent may be deployed. We cast safe exploration as an offline meta-reinforcement learning problem, where the objective is to leverage examples of safe and unsafe behavior across a range of environments to quickly adapt learned risk measures to a new environment with previously unseen dynamics. We then propose MEta-learning for Safe Adaptation (MESA), an approach for meta-learning a risk measure for safe reinforcement learning. Simulation experiments across 3 continuous control domains suggest that MESA can leverage offline data from a range of different environments to reduce constraint violations in unseen environments by up to a factor of 2 while maintaining task performance.

Michael Luo, Ashwin Balakrishna, Brijen Thananjeyan, Suraj Nair, Julian Ibarz, Jie Tan, Chelsea Finn, Ion Stoica, Ken Goldberg
-
[ Visit Poster at Spot D0 in Virtual World ]

Identifying uncertainty and taking mitigating actions is crucial for safe and trustworthy reinforcement learning agents, especially when deployed in high-risk environments. In this paper, risk sensitivity is promoted in a model-based RL algorithm by exploiting the ability of a bootstrap ensemble of dynamics models to estimate epistemic uncertainty in the environment. We propose uncertainty guided cross-entropy method planning, which penalises proposed actions that result in high variance model rollouts, guiding the agent to known areas of the state space with low uncertainty. Experiments display the ability for the agent to identify uncertain regions of the state space during planning and to take actions that maintain the agent within high confidence areas, without the requirement of explicit constraints. The result is a reduction in the performance in terms of attaining reward, displaying a trade-off between risk and return.

Stefan Radic Webster
-
[ Visit Poster at Spot C6 in Virtual World ]

Deep neural networks have made it possible for reinforcement learning algorithms to learn from raw high dimensional inputs. This jump in the progress has caused deep reinforcement learning algorithms to be deployed in many different fields from financial markets to biomedical applications. While the vulnerability of deep neural networks to imperceptible specifically crafted perturbations has also been inherited by deep reinforcement learning agents, several adversarial training methods have been proposed to overcome this vulnerability. In this paper we focus on state-of-the-art adversarial training algorithms and investigate their robustness to semantically meaningful natural perturbations ranging from changes in brightness to rotation. We conduct several experiments in the OpenAI Atari environments, and find that state-of-the-art adversarially trained neural policies are more sensitive to natural perturbations than vanilla trained agents. We believe our investigation lays out intriguing properties of adversarial training and our observations can help build robust and generalizable neural policies.

Ezgi Korkmaz
-
[ Visit Poster at Spot C5 in Virtual World ]

Using neural networks to learn dynamical models from data is a powerful technique but can also suffer from problems like brittleness, overfitting and lack of safety guarantees. These problems are particularly acute when a distributional shift is observed in the dynamics of the same underlying system, caused by different values of its physical parameters. Casting the learned models in the framework of linear systems enhances our abilities to analyse and control them. However, it does not stop them from failing when having to extrapolate outside of their training distribution. By globally linearising the system's dynamics, using ideas from Deep Koopman Theory, and combining them with off-the-shelf estimation techniques like Kalman filtering, we demonstrate a way of knowing when and what the model does not know and respectively how much we can trust its predictions. We showcase our ideas and results in the context of different rod lengths of the classical pendulum control environment.

Yordan Hristov, Ram Ramamoorthy
-
[ Visit Poster at Spot C4 in Virtual World ]
Reinforcement learning has been shown to be an effective strategy for automatically training policies for challenging control problems. Focusing on non-cooperative multi-agent systems, we propose a novel reinforcement learning framework for training joint policies that form a Nash equilibrium. In our approach, rather than providing low-level reward functions, the user provides high-level specifications that encode the goal of each agent. Then, guided by the structure of the specifications, our algorithm searches over policies to identify one that provably forms an $\epsilon$-Nash equilibrium (with high probability). Importantly, it prioritizes policies in a way that maximizes social welfare across all agents. Our empirical evaluation demonstrates that our algorithm computes equilibrium policies with high social welfare, whereas state-of-the-art baselines either fail to compute Nash equilibria or compute ones with comparatively lower social welfare.
Kishor Jothimurugan, Suguman Bansal, Osbert Bastani, Rajeev Alur
-
[ Visit Poster at Spot C3 in Virtual World ]

Under voltage load shedding has been considered as a standard approach to recover the voltage stability of the electric power grid under emergency conditions, yet this scheme usually trips a massive amount of load inefficiently. Reinforcement learning (RL) has been adopted as a promising approach to circumvent the issues; however, the RL approach usually cannot guarantee the safety of the systems under control. In this paper, we discuss a couple of novel safe RL approaches, namely constrained optimization approach and Barrier function-based approach, that can safely recover voltage under emergency events. This method is general and can be applied to other safety-critical control problems. Numerical simulations on the 39-bus IEEE benchmark are performed to demonstrate the effectiveness of the proposed safe RL emergency control.

Thanh Long Vu, Sayak Mukherjee, Renke Huang, Qiuhua Huang
-
[ Visit Poster at Spot C2 in Virtual World ]

We study distributionally robust chance constrained programs (DRCCP) with maximum mean discrepancy (MMD) ambiguity sets. We provide an exact reformulation of those problems such that the uncertain constraint is satisfied with a probability larger than a desired risk-level for distributions within the MMD ball around the empirical distribution. Additionally, we highlight how the ambiguity set can be connected to known statistical bounds on the MMD to obtain statistical guarantees for the data-driven DRCCP. Lastly, we validate our reformulation on a numerical example and compare it to the robust scenario approach.

Yassine Nemmour, Bernhard Schölkopf, Jia-Jie Zhu
-
[ Visit Poster at Spot C1 in Virtual World ]

Feature counts play a crucial role when computing good reward weights in inverse reinforcement learning. Despite their importance, little work has focused on developing better methods for estimating feature counts. In this work, we propose a new method for estimating feature counts for scenarios with a small number of long demonstrations. Most existing algorithms perform poorly in this scenario. In particular, we propose two new algorithms, E-DLS and E-SLS, which can efficiently use a small number of long demonstrations to estimate feature counts. We show that E-SLS estimates are unbiased, which is the first such estimation algorithm. Our experimental results on benchmark problems demonstrate better learned reward weights when feature counts are estimated with E-DLS and E-SLS compared to other popular methods.

Gerard Donahue, Brendan Crowe, Marek Petrik, Daniel Brown
-
[ Visit Poster at Spot C0 in Virtual World ]

With the increasing emphasis on the safe autonomy for robots, model-based safe control approaches such as Control Barrier Functions have been extensively studied to ensure guaranteed safety during inter-robot interactions. In this paper, we introduce the Parametric Control Barrier Function (Parametric-CBF), a novel variant of the traditional Control Barrier Function to extend its expressivity in describing different safe behaviors among heterogeneous robots. Instead of assuming cooperative and homogeneous robots using the same safe controllers, the ego robot is able to model the neighboring robots' underlying safe controllers through different parametric-CBFs with observed data. Given learned parametric-CBF and proved forward invariance, it provides greater flexibility for the ego robot to better coordinate with other heterogeneous robots with improved efficiency while enjoying formally provable safety guarantees. We demonstrate the usage of Parametric-CBF in behavior prediction and adaptive safe control in the ramp merging scenario from the applications of autonomous driving. Compared to traditional CBF, Parametric-CBF has the advantage of better capturing drivers' characteristics, which also allows for richer description of robot behavior in the context of safe control. Numerical simulations are given to validate the effectiveness of the proposed method.

Yiwei Lyu, John Dolan
-
[ Visit Poster at Spot B6 in Virtual World ]

We consider the problem of inferring constraints from demonstrations from a Bayesian perspective. We propose Bayesian Inverse Constraint Reinforcement Learning (BICRL), a novel approach that infers a probability distribution over constraints from demonstrated trajectories. The main advantages of BICRL, compared to prior constraint inference algorithms, are (1) the freedom to infer constraints from partial trajectories and even from disjoint state-action pairs, (2) the ability to learn constraints from suboptimal demonstrations and to learn constraints in stochastic environments, and (3) the opportunity to estimate a posterior distribution over constraints that enables active learning and robust policy optimization.

Dimitris Papadimitriou, Daniel Brown, Usman Anwar
-
[ Visit Poster at Spot B5 in Virtual World ]

Safety-critical applications require controllers/policies that can guarantee safety with high confidence. The control barrier function is a useful tool to guarantee safety if we have access to the ground-truth system dynamics. In practice, we have inaccurate knowledge of the system dynamics, which can lead to unsafe behaviors due to unmodeled residual dynamics. Learning the residual dynamics with deterministic machine learning models can prevent the unsafe behavior but can fail when the predictions are imperfect. In this situation, a probabilistic learning method that reasons about the uncertainty of its predictions can help provide robust safety margins. In this work, we use a Gaussian process to model the projection of the residual dynamics onto a control barrier function. We propose a novel optimization procedure to generate safe controls that can guarantee safety with high probability. The safety filter is provided with the ability to reason about the uncertainty of the predictions from the GP. We provide a proof-of-concept on a Segway platform. The probabilistic approach is able to reduce the number of safety violations by 50% compared to the deterministic approach with a neural network.

Sulin Liu, Athindran Ramesh Kumar, Jaime Fisac, Ryan Adams, Peter J Ramadge
-
[ Visit Poster at Spot B4 in Virtual World ]

In real-world sequential decision problems, exploration is expensive, and the risk1of expert decision policies must be evaluated from limited data. In this setting, Monte Carlo (MC) risk estimators are typically used to estimate the risks associated with decision policies. While these estimators have the desired low bias property, they often suffer from large variance. In this paper, we consider the problem of minimizing the asymptotic mean squared error and hence variance of MC risk estimators. We show that by carefully choosing the data sampling policy (behavior policy), we can obtain low variance estimates of the risk of any given decision policy.

Elita Lobo, Marek Petrik, Shankar Subramanian
-
[ Visit Poster at Spot B3 in Virtual World ]

Safe exploration is critical to using reinforcement learning in complex, hazardous, real-world environments for which offline data aren't available. We propose a nonlinear safety layer that, unlike prior work, requires no restrictions on the policy or environment, and doesn't require offline training. We demonstrate that a nonlinear model has higher prediction accuracy than a similar linear model and that a linear safety layer fails to learn a non-conservative policy in Safety Gym environments where the nonlinear layer does not.

Eleanor Quint, Garrett Wirka, Stephen Scott
-
[ Visit Poster at Spot B2 in Virtual World ]

Recent work in AI safety has highlighted that in sequential decision making, objectives are often underspecified or incomplete. This potentially allows the AI agent to make undesirable changes to the world while achieving its given objective. A number of recent papers have proposed avoiding such negative side effects by giving an auxiliary reward to the agent for preserving its own ability to complete tasks or gain reward. We argue that effects on others need to be explicitly considered and provide a formulation that generalizes prior work. We experimentally investigate our approach with RL agents in gridworlds.

Parand Alizadeh Alamdari, Toryn Klassen, Rodrigo Toro Icarte, Sheila McIlraith
-
[ Visit Poster at Spot B1 in Virtual World ]

While high-return policies can be learned on a wide range of systems through reinforcement learning, actual deployment of the resulting policies is often hindered by their sensitivity to future changes in the environment. Adversarial training has shown some promise in producing policies that retain better performance under environment shifts, but existing approaches only consider robustness to specific kinds of perturbations that must be specified a priori. As possible changes in future dynamics are typically unknown in practice, we instead seek a policy that is robust to a variety of realistic changes only encountered at test-time. Towards this goal, we propose a new adversarial variant of soft actor-critic, which produces policies on Mujoco continuous control tasks that are simultaneously more robust across various environment shifts, such as changes to friction and body mass.

Samuel Stanton, Rasool Fakoor, Jonas Mueller, Andrew Gordon Wilson, Alexander Smola
-
[ Visit Poster at Spot B0 in Virtual World ]

Modern nonlinear control theory seeks to endow systems with properties of stability and safety, and have been deployed successfully in multiple domains. Despite this success, model uncertainty remains as a significant challenge in synthesizing controllers, leading to degradation in the performance. Reinforcement learning (RL) algorithms, on the other hand, have found success in controlling systems with no model at all but it is limited beyond simulated applications, and one main reason is the absence of safety and stability guarantees during the learning process. To address this issue, we develop a controller architecture that combines a model-free RL controller with model-based controllers and online learning of the unknown system dynamics, to guarantee stability and safety during learning. This general framework leverages the success of RL algorithms to learn high-performance controllers, while the proposed model-based controllers guarantee safety and guide the learning process by constraining the set of explorable polices. We validate this method in simulation of a robotic Segway platform.

Carlos Montenegro, Carlos Rodríguez
-
[ Visit Poster at Spot A6 in Virtual World ]

Evaluating the worst-case performance of a reinforcement learning (RL) agent under the strongest/optimal adversarial perturbations on state observations (within some constraints) is crucial for understanding the robustness of RL agents. However, finding the optimal adversary is challenging, in terms of both whether we can find the optimal attack and how efficiently we can find it. Existing works on adversarial RL either use heuristics-based methods that may not find the strongest adversary, or directly train an RL-based adversary by treating the agent as a part of the environment, which can find the optimal adversary but may become intractable in a large state space. In this paper, we propose a novel attacking algorithm which has an RL-based director'' searching for the optimal policy perturbation, and anactor'' crafting state perturbations following the directions from the director (i.e. the actor executes targeted attacks). Our proposed algorithm, PA-AD, is theoretically optimal against an RL agent and significantly improves the efficiency compared with prior RL-based works in environments with large or pixel state spaces. Empirical results show that our proposed PA-AD universally outperforms state-of-the-art attacking methods in a wide range of environments. Our method can be easily applied to any RL algorithms to evaluate and improve their robustness.

Yanchao Sun, Ruijie Zheng, Yongyuan Liang, Furong Huang
-
[ Visit Poster at Spot A5 in Virtual World ]

We develop algorithms for imitation learning from data that was corrupted by unobserved confounders. Sources of such confounding include (a) persistent perturbations to actions or (b) the expert responding to a part of the state that the learner does not have access to. When a confounder affects multiple timesteps of recorded data, it can manifest as spurious correlations between states and actions that a learner might latch onto, leading to poor policy performance. By utilizing the effect of past states on current states, we are able to break up these spurious correlations, an application of the econometric technique of instrumental variable regression. This insight leads to two novel algorithms, one of a generative-modeling flavor (\texttt{DoubIL}) that can utilize access to a simulator and one of a game-theoretic flavor (\texttt{ResiduIL}) that can be run offline. Both approaches are able to find policies that match the result of a query to an unconfounded expert. We find both algorithms compare favorably to non-causal approaches on simulated control problems.

Gokul Swamy, Sanjiban Choudhury, James Bagnell, Steven Wu
-
[ Visit Poster at Spot A4 in Virtual World ]

One promising approach to improve the robustness and exploration in Reinforcement Learning is by taking human feedback and in that way incorporating prior knowledge of the target environment. It is, however, often too expensive to obtain enough feedback of good quality. To mitigate the issue, we aim to rely on a group of multiple experts (and non-experts) with different skill levels to generate enough feedback. Such feedback can therefore be inconsistent and infrequent. In this work, we build upon prior work -- Advise, a Bayesian approach attempting to maximise the information gained from human feedback -- extending the algorithm to accept feedback from this larger group of humans, the trainers, while also estimating each trainer's reliability. We show how aggregating feedback from multiple trainers improves the total feedback's accuracy and make the collection process easier in two ways. Firstly, this approach addresses the case of some of the trainers being adversarial. Secondly, having access to the information about each trainer reliability provides a second layer of robustness and offers valuable information for people managing the whole system to improve the overall trust in the system. It offers tools for improving the feedback collection process or modifying the reward function design if needed. We empirically show that our approach can accurately learn the reliability of each trainer correctly and use it to maximise the information gained from the multiple trainers' feedback, even if some of the sources are adversarial.

Taku Yamagata, Ryan McConville, Raul Santos-Rodriguez
-
[ Visit Poster at Spot A3 in Virtual World ]

It is well known that current deep reinforcement learning (RL) agents are particularly vulnerable under adversarial perturbations. Therefore, it is important to develop a vulnerability-aware algorithm that could improve the performance of the RL agent under any attack with bounded budgets. Existing robust training approaches in deep RL either directly use adversarial training whose attacks are heuristically generated which might be non-optimal, or they need to learn an RL-based strong adversary which doubles the computational and sample complexity of the training process. In this work, we formalize the notion of the lower bound of the policy value under bounded attacks by a proposed worst-case Bellman operator. By directly estimating and improving the worst-case value of an agent under attack, we develop a robust training method that efficiently improves the robustness of RL policies without learning an adversary. Empirical evaluations show that our algorithm universally achieves state-of-the-art performance under strong adversaries with significantly higher efficiency, compared with other robust training methods.

Yongyuan Liang, Yanchao Sun, Ruijie Zheng, Furong Huang
-
[ Visit Poster at Spot A2 in Virtual World ]

We propose a data-driven control policy design from an offline data set. Contraction theory enables constructing a policy-learning framework that makes the closed-loop system trajectories inherently convergent towards a unique trajectory. At the technical level, identifying the contraction metric, which is the distance metric with respect to which a robot's trajectories exhibit contraction is often non-trivial. We propose to jointly learn the control policy and its corresponding contraction metric from offline data. To achieve this, we learn the robot dynamical model from an offline data set consisting of the robot's state and input trajectories. Using this learned dynamical model, we propose a data augmentation algorithm for learning contraction policies. We evaluate the performance of our proposed framework on simulated robotic goal-reaching tasks and demonstrate that enforcing contraction results in faster convergence.

Navid Rezazadeh, Negar Mehr
-
[ Visit Poster at Spot A0 in Virtual World ]

We propose a safety-guaranteed planning and control framework for unmanned surface vessels (USVs), using Gaussian processes (GPs) to learn uncertainties. The uncertainties encountered by USVs, including external disturbances and model mismatches, are potentially state-dependent, time-varying, and hard to capture with constant models. GP is a powerful learning-based tool that can be integrated with a model-based planning and control framework, which employs a Hamilton-Jacobi differential game formulation. Such combination yields less conservative trajectories and safety-guaranteeing control strategies. We demonstrate the proposed framework in simulations and experiments on a CLEARPATH Heron USV.

Shuhao Zhang, Seth Siriya

Author Information

Ashwin Balakrishna (UC Berkeley)

I am a second year PhD student in Robotics and Artificial Intelligence at UC Berkeley and am advised by Professor Ken Goldberg of the UC Berkeley AUTOLAB. My research interests are in developing algorithms for imitation and reinforcement learning that are reliable and robust enough to safely deploy on robotic systems. I am currently interested in hybrid algorithms between imitation and reinforcement learning to leverage demonstrations to either guide exploration in RL or perform reward inference. I received my Bachelor’s Degree in Electrical Engineering at Caltech in 2018, and enjoy watching/playing tennis, hiking, and eating interesting foods.

Brijen Thananjeyan (UC Berkeley)
Daniel Brown (UC Berkeley)
Marek Petrik (University of New Hampshire)
Melanie Zeilinger (ETH Zurich)
Sylvia Herbert (University of California, San Diego (UCSD))

More from the Same Authors