Timezone: »

Workshop
Reinforcement Learning for Real Life (RL4RealLife) Workshop
Yuxi Li · Emma Brunskill · MINMIN CHEN · Omer Gottesman · Lihong Li · Yao Liu · Zhiwei Tony Qin · Matthew Taylor

Sat Dec 03 05:30 AM -- 03:00 PM (PST) @ Theater A

Discover how to improve the adoption of RL in practice, by discussing key research problems, SOTA, and success stories / insights / lessons w.r.t. practical RL algorithms, practical issues, and applications with leading experts from both academia and industry @ NeurIPS 2022 RL4RealLife workshop.

 Sat 5:30 a.m. - 6:25 a.m. posters (for early birds, optional) (posters) 🔗 Sat 6:25 a.m. - 6:30 a.m. opening remarks 🔗 Sat 6:31 a.m. - 7:00 a.m. Invited talk: Deep Reinforcement Learning for Real-World Inventory Management (talk) We present a Deep Reinforcement Learning approach to solving a periodic review inventory control system with stochastic vendor lead times, lost sales, correlated demand, and price matching. While this dynamic program has historically been considered intractable, we show that several policy learning approaches are competitive with or outperform classical baseline approaches. In order to train these algorithms, we develop novel techniques to convert historical data into a simulator and present a collection of results that motivate this approach. We also present a model-based reinforcement learning procedure (Direct Backprop) to solve the dynamic periodic review inventory control problem by constructing a differentiable simulator. Under a variety of metrics Direct Backprop outperforms model-free RL and newsvendor baselines, in both simulations and real-world deployments. Bio: Dhruv Madeka is a Principal Machine Learning Scientist at Amazon. His current research focuses on applying Deep Reinforcement Learning to supply chain problems. Dhruv has also worked on developing generative and supervised deep learning models for probabilistic time series forecasting. In the past - Dhruv worked in the Quantitative Research team at Bloomberg LP, developing open source tools for the Jupyter ecosystem and conducting advanced mathematical research in derivatives pricing, quantitative finance and election forecasting. Dhruv Madeka 🔗 Sat 7:01 a.m. - 7:30 a.m. Invited talk: Scaling reinforcement learning in the real world, from gaming to finance to manufacturing (talk)    Reinforcement learning is transforming industries from gaming to robotics to manufacturing. This talk showcases how a variety of industries are adopting reinforcement learning to overhaul their businesses, from changing the nature of game development to designing the boat that won the America's Cup. These industries leverage Ray, a distributed framework for scaling Python applications and machine learning applications. Ray is used by companies across the board from Uber to OpenAI to Shopify to Amazon to scale their machine learning training, inference, data ingest, and reinforcement learning workloads. Bio: Robert Nishihara is one of the creators of Ray, a distributed framework for scaling Python applications and machine learning applications. Ray is used by companies across the board from Uber to OpenAI to Shopify to Amazon to scale their machine learning training, inference, data ingest, and reinforcement learning workloads. He is one of the co-founders and CEO of Anyscale, which is the company behind Ray. He did his PhD in machine learning and distributed systems in the computer science department at UC Berkeley. Before that, he majored in math at Harvard. Robert Nishihara 🔗 Sat 7:30 a.m. - 7:31 a.m. Intro speaker (In-person Intro) 🔗 Sat 7:31 a.m. - 8:00 a.m. Invited talk: Outracing Champion Gran Turismo Drivers with Deep Reinforcement Learning (talk)  link » Many potential applications of artificial intelligence involve making real-time decisions in physical systems while interacting with humans. Automobile racing represents an extreme example of these conditions; drivers must execute complex tactical manoeuvres to pass or block opponents while operating their vehicles at their traction limits. Racing simulations, such as the PlayStation game Gran Turismo, faithfully reproduce the non-linear control challenges of real race cars while also encapsulating the complex multi-agent interactions. Here we describe how we trained agents for Gran Turismo that can compete with the world's best e-sports drivers. We combine state-of-the-art, model-free, deep reinforcement learning algorithms with mixed-scenario training to learn an integrated control policy that combines exceptional speed with impressive tactics. In addition, we construct a reward function that enables the agent to be competitive while adhering to racing's important, but under-specified, sportsmanship rules. We demonstrate the capabilities of our agent, Gran Turismo Sophy, by winning a head-to-head competition against four of the world's best Gran Turismo drivers. By describing how we trained championship-level racers, we demonstrate the possibilities and challenges of using these techniques to control complex dynamical systems in domains where agents must respect imprecisely defined human norms. Bio: Dr. Peter Stone holds the Truchard Foundation Chair in Computer Science at the University of Texas at Austin. He is Associate Chair of the Computer Science Department, as well as Director of Texas Robotics. In 2013 he was awarded the University of Texas System Regents' Outstanding Teaching Award and in 2014 he was inducted into the UT Austin Academy of Distinguished Teachers, earning him the title of University Distinguished Teaching Professor. Professor Stone's research interests in Artificial Intelligence include machine learning (especially reinforcement learning), multiagent systems, and robotics. Professor Stone received his Ph.D in Computer Science in 1998 from Carnegie Mellon University. From 1999 to 2002 he was a Senior Technical Staff Member in the Artificial Intelligence Principles Research Department at AT&T Labs - Research. He is an Alfred P. Sloan Research Fellow, Guggenheim Fellow, AAAI Fellow, IEEE Fellow, AAAS Fellow, ACM Fellow, Fulbright Scholar, and 2004 ONR Young Investigator. In 2007 he received the prestigious IJCAI Computers and Thought Award, given biannually to the top AI researcher under the age of 35, and in 2016 he was awarded the ACM/SIGAI Autonomous Agents Research Award. Professor Stone co-founded Cogitai, Inc., a startup company focused on continual learning, in 2015, and currently serves as Executive Director of Sony AI America. Link » Peter Stone 🔗 Sat 8:00 a.m. - 8:20 a.m. Coffee break 🔗 Sat 8:20 a.m. - 9:10 a.m. Panel RL Benchmarks (Panel) Minmin Chen · Pablo Samuel Castro · Caglar Gulcehre · Tony Jebara · Peter Stone 🔗 Sat 9:10 a.m. - 10:00 a.m. Panel RL Implementation (Panel) Xiaolin Ge · Alborz Geramifard · Kence Anderson · Craig Buhr · Robert Nishihara · Yuandong Tian 🔗 Sat 10:00 a.m. - 11:30 a.m. Lunch Break / Posters (Poster/Break) [] 🔗 Sat 11:31 a.m. - 12:00 p.m. Invited talk AlphaTensor: Discovering faster matrix multiplication algorithms with RL (talk)    Improving the efficiency of algorithms for fundamental computational tasks such as matrix multiplication can have widespread impact, as it affects the overall speed of a large amount of computations. Automatic discovery of algorithms using ML offers the prospect of reaching beyond human intuition and outperforming the current best human-designed algorithms. In this talk I’ll present AlphaTensor, our RL agent based on AlphaZero for discovering efficient and provably correct algorithms for the multiplication of arbitrary matrices. AlphaTensor discovered algorithms that outperform the state-of-the-art complexity for many matrix sizes. Particularly relevant is the case of 4 × 4 matrices in a finite field, where AlphaTensor’s algorithm improves on Strassen’s two-level algorithm for the first time since its discovery 50 years ago. I’ll present our problem formulation as a single-player game, the key ingredients that enable tackling such difficult mathematical problems using RL, and the flexibility of the AlphaTensor framework. Bio: Matej Balog is a Senior Research Scientist at DeepMind, working in the Science team on applications of AI to Maths and Computation. Prior to joining DeepMind he worked on program synthesis and understanding, and was a PhD student at the University of Cambridge with Zoubin Ghahramani, working on general machine learning methodology, in particular on conversions between fundamental computational tasks such as integration, sampling, optimization, and search. Matej Balog 🔗 Sat 12:00 p.m. - 12:55 p.m. Panel RL Theory-Practice Gap (Panel) Peter Stone · Matej Balog · Jason Gauci · Dhruv Madeka 🔗 Sat 12:55 p.m. - 1:00 p.m. closing remarks 🔗 Sat 1:00 p.m. - 1:30 p.m. Coffee break / Posters 🔗 Sat 1:30 p.m. - 3:00 p.m. Posters 🔗 - An Empirical Evaluation of Posterior Sampling for Constrained Reinforcement Learning (Poster) []  We study a posterior sampling approach to efficient exploration in constrained reinforcement learning. Alternatively to existing algorithms, we propose two simple algorithms that are more efficient statistically, simpler to implement and computationally cheaper. The first algorithm is based on a linear formulation of CMDP, and the second algorithm leverages the saddle-point formulation of CMDP. Our empirical results demonstrate that, despite its simplicity, posterior sampling achieves state-of-the-art performance and, in some cases, significantly outperforms optimistic algorithms. Danil Provodin · Pratik Gajane · Mykola Pechenizkiy · Maurits Kaptein 🔗 - An Empirical Evaluation of Posterior Sampling for Constrained Reinforcement Learning (Spotlight)    We study a posterior sampling approach to efficient exploration in constrained reinforcement learning. Alternatively to existing algorithms, we propose two simple algorithms that are more efficient statistically, simpler to implement and computationally cheaper. The first algorithm is based on a linear formulation of CMDP, and the second algorithm leverages the saddle-point formulation of CMDP. Our empirical results demonstrate that, despite its simplicity, posterior sampling achieves state-of-the-art performance and, in some cases, significantly outperforms optimistic algorithms. Danil Provodin · Pratik Gajane · Mykola Pechenizkiy · Maurits Kaptein 🔗 - MARLIM: Multi-Agent Reinforcement Learning for Inventory Management (Poster) []     Maintaining a balance between the supply and demand of products by optimizing replenishment decisions is one of the most important challenges in the supply chain industry. This paper presents a novel reinforcement learning framework called MARLIM, to address the inventory management problem for a single-echelon multi-products supply chain with stochastic demands and lead-times. Within this context, controllers are developed through single or multiple agents in a cooperative setting. Numerical experiments on real data demonstrate the benefits of reinforcement learning methods over traditional baselines. Rémi Leluc · Elie Kadoche · Antoine Bertoncello · Sébastien Gourvénec 🔗 - MARLIM: Multi-Agent Reinforcement Learning for Inventory Management (Spotlight) Maintaining a balance between the supply and demand of products by optimizing replenishment decisions is one of the most important challenges in the supply chain industry. This paper presents a novel reinforcement learning framework called MARLIM, to address the inventory management problem for a single-echelon multi-products supply chain with stochastic demands and lead-times. Within this context, controllers are developed through single or multiple agents in a cooperative setting. Numerical experiments on real data demonstrate the benefits of reinforcement learning methods over traditional baselines. Rémi Leluc · Elie Kadoche · Antoine Bertoncello · Sébastien Gourvénec 🔗 - A Versatile and Efficient Reinforcement Learning Approach for Autonomous Driving (Poster) []  Heated debates continue over the best solution for autonomous driving. The classic modular pipeline is widely adopted in the industry owing to its great interpretability and stability, whereas the fully end-to-end paradigm has demonstrated considerable simplicity and learnability along with the rise of deep learning. As a way of marrying the advantages of both approaches, learning a semantically meaningful representation and then using it in the downstream driving policy learning tasks provides a viable and attractive solution. However, several key challenges remain to be addressed, including identifying the most effective representation, alleviating the sim-to-real generalization issue as well as balancing model training cost. In this study, we propose a versatile and efficient reinforcement learning approach and build a fully functional autonomous vehicle for real-world validation. Our method shows great generalizability to various complicated real-world scenarios and superior training efficiency against the competing baselines. Guan Wang · Haoyi Niu · desheng zhu · Jianming HU · Xianyuan Zhan · Guyue Zhou 🔗 - A Versatile and Efficient Reinforcement Learning Approach for Autonomous Driving (Spotlight)    Heated debates continue over the best solution for autonomous driving. The classic modular pipeline is widely adopted in the industry owing to its great interpretability and stability, whereas the fully end-to-end paradigm has demonstrated considerable simplicity and learnability along with the rise of deep learning. As a way of marrying the advantages of both approaches, learning a semantically meaningful representation and then using it in the downstream driving policy learning tasks provides a viable and attractive solution. However, several key challenges remain to be addressed, including identifying the most effective representation, alleviating the sim-to-real generalization issue as well as balancing model training cost. In this study, we propose a versatile and efficient reinforcement learning approach and build a fully functional autonomous vehicle for real-world validation. Our method shows great generalizability to various complicated real-world scenarios and superior training efficiency against the competing baselines. Guan Wang · Haoyi Niu · desheng zhu · Jianming HU · Xianyuan Zhan · Guyue Zhou 🔗 - Semi-analytical Industrial Cooling System Model for Reinforcement Learning (Poster)    We present a hybrid industrial cooling system model that embeds analytical solutions within a multiphysics simulation. This model is designed for reinforcement learning (RL) applications and balances simplicity with simulation fidelity and interpretability. The model’s fidelity is evaluated against real world data from a large scale cooling system. This is followed by a case study illustrating how themodel can be used for RL research. For this, we develop an industrial task suite that allows specifying different problem settings and levels of complexity, and use it to evaluate the performance of different RL algorithms. Yuri Chervonyi · Praneet Dutta 🔗 - Semi-analytical Industrial Cooling System Model for Reinforcement Learning (Spotlight) We present a hybrid industrial cooling system model that embeds analytical solutions within a multiphysics simulation. This model is designed for reinforcement learning (RL) applications and balances simplicity with simulation fidelity and interpretability. The model’s fidelity is evaluated against real world data from a large scale cooling system. This is followed by a case study illustrating how themodel can be used for RL research. For this, we develop an industrial task suite that allows specifying different problem settings and levels of complexity, and use it to evaluate the performance of different RL algorithms. Yuri Chervonyi · Praneet Dutta 🔗 - Structured Q-learning For Antibody Design (Poster) []     Optimizing combinatorial structures is core to many real-world problems, such as those encountered in life sciences. For example, one of the crucial steps involved in antibody design is to find an arrangement of amino acids in a protein sequence that improves its binding with a pathogen. Combinatorial optimization of antibodies is difficult due to extremely large search spaces and non-linear objectives. Even for modest antibody design problems, where proteins have a sequence length of eleven, we are faced with searching over $2.05 \times 10^{14}$ structures. Applying traditional Reinforcement Learning algorithms such as Q-learning to combinatorial optimization results in poor performance. We propose Structured Q-learning (SQL), an extension of Q-learning that incorporates structural priors for combinatorial optimization. Using a molecular docking simulator, we demonstrate that SQL finds high binding energy sequences and performs favourably against baselines on eight challenging antibody design tasks, including designing antibodies for SARS-COV. Alexander Cowen-Rivers · Philip John Gorinski · aivar sootla · Asif Khan · Jun WANG · Jan Peters · Haitham Bou Ammar 🔗 - Structured Q-learning For Antibody Design (Spotlight) Optimizing combinatorial structures is core to many real-world problems, such as those encountered in life sciences. For example, one of the crucial steps involved in antibody design is to find an arrangement of amino acids in a protein sequence that improves its binding with a pathogen. Combinatorial optimization of antibodies is difficult due to extremely large search spaces and non-linear objectives. Even for modest antibody design problems, where proteins have a sequence length of eleven, we are faced with searching over $2.05 \times 10^{14}$ structures. Applying traditional Reinforcement Learning algorithms such as Q-learning to combinatorial optimization results in poor performance. We propose Structured Q-learning (SQL), an extension of Q-learning that incorporates structural priors for combinatorial optimization. Using a molecular docking simulator, we demonstrate that SQL finds high binding energy sequences and performs favourably against baselines on eight challenging antibody design tasks, including designing antibodies for SARS-COV. Alexander Cowen-Rivers · Philip John Gorinski · aivar sootla · Asif Khan · Jun WANG · Jan Peters · Haitham Bou Ammar 🔗 - Hierarchical Reinforcement Learning for Furniture Layout in Virtual Indoor Scenes (Poster) []     In real life, the decoration of 3D indoor scenes through designing furniture layoutprovides a rich experience for people. In this paper, we explore the furniturelayout task as a Markov decision process (MDP) in virtual reality, which is solvedby hierarchical reinforcement learning (HRL). The goal is to produce a propertwo-furniture layout in the virtual reality of the indoor scenes. In particular, wefirst design a simulation environment and introduce the HRL formulation for atwo-furniture layout. We then apply a hierarchical actor-critic algorithm withcurriculum learning to solve the MDP. We conduct our experiments on a large-scalereal-world interior layout dataset that contains industrial designs from professionaldesigners. Our numerical results demonstrate that the proposed model yieldshigher-quality layouts as compared with the state-of-art models. Xinhan Di · Pengqian Yu 🔗 - Hierarchical Reinforcement Learning for Furniture Layout in Virtual Indoor Scenes (Spotlight) In real life, the decoration of 3D indoor scenes through designing furniture layoutprovides a rich experience for people. In this paper, we explore the furniturelayout task as a Markov decision process (MDP) in virtual reality, which is solvedby hierarchical reinforcement learning (HRL). The goal is to produce a propertwo-furniture layout in the virtual reality of the indoor scenes. In particular, wefirst design a simulation environment and introduce the HRL formulation for atwo-furniture layout. We then apply a hierarchical actor-critic algorithm withcurriculum learning to solve the MDP. We conduct our experiments on a large-scalereal-world interior layout dataset that contains industrial designs from professionaldesigners. Our numerical results demonstrate that the proposed model yieldshigher-quality layouts as compared with the state-of-art models. Xinhan Di · Pengqian Yu 🔗 - Learning an Adaptive Forwarding Strategy for Mobile Wireless Networks: Resource Usage vs. Latency (Poster) []     Mobile wireless networks present several challenges for any learning system, due to uncertain and variable device movement, a decentralized network architecture, and constraints on network resources. In this work, we use deep reinforcement learning (DRL) to learn a scalable and generalizable forwarding strategy for such networks. We make the following contributions: i) we use hierarchical RL to design DRL packet agents rather than device agents, to capture the packet forwarding decisions that are made over time and improve training efficiency; ii) we use relational features to ensure generalizeability of the learned forwarding strategy to a wide range of network dynamics and enable offline training; and iii) we design the DRL reward function to reflect both the packet forwarding goals and the resource considerations of the network; and we incorporate both the forwarding goals and network resource considerations into packet decision-making by designing a weighted DRL reward function. Our results show that our DRL agent often achieves a similar delay per packet delivered as the optimal forwarding strategy and outperforms all other strategies including state-of-the art strategies, even on scenarios on which the DRL agent was not trained. Victoria Manfredi · Alicia Wolfe · Xiaolan Zhang · Bing Wang 🔗 - Learning an Adaptive Forwarding Strategy for Mobile Wireless Networks: Resource Usage vs. Latency (Spotlight)    Mobile wireless networks present several challenges for any learning system, due to uncertain and variable device movement, a decentralized network architecture, and constraints on network resources. In this work, we use deep reinforcement learning (DRL) to learn a scalable and generalizable forwarding strategy for such networks. We make the following contributions: i) we use hierarchical RL to design DRL packet agents rather than device agents, to capture the packet forwarding decisions that are made over time and improve training efficiency; ii) we use relational features to ensure generalizeability of the learned forwarding strategy to a wide range of network dynamics and enable offline training; and iii) we design the DRL reward function to reflect both the packet forwarding goals and the resource considerations of the network; and we incorporate both the forwarding goals and network resource considerations into packet decision-making by designing a weighted DRL reward function. Our results show that our DRL agent often achieves a similar delay per packet delivered as the optimal forwarding strategy and outperforms all other strategies including state-of-the art strategies, even on scenarios on which the DRL agent was not trained. Victoria Manfredi · Alicia Wolfe · Xiaolan Zhang · Bing Wang 🔗 - Safe Reinforcement Learning for Automatic Insulin Delivery in Type I Diabetes (Poster)    Despite promising performances, reinforcement learning (RL) is only rarely appliedwhen a high level of risk is implied. Glycemia control in type I diabetes is onesuch example: a variety of RL agents have been shown to accurately regulateinsulin delivery and yet no real life application can be seen. For such applications,managing risk is the key. In this paper, we use the evolution strategies algorithmto train a policy network for glycemia control: it has state-of-the-arts results,and recovers, without any a priori knowledge, the basics of insulin therapy andblood sugar management. We propose a way to equip the policy network withan epistemic uncertainty measure which requires no further model training. Weillustrate how this epistemic uncertainty estimate can be used to improve the safetyof the device, paving the way for real life clinical trials. Maxime Louis · Hector Romero Ugalde · Pierre Gauthier · Alice Adenis · Yousra Tourki · Erik Huneker 🔗 - Safe Reinforcement Learning for Automatic Insulin Delivery in Type I Diabetes (Spotlight) Despite promising performances, reinforcement learning (RL) is only rarely appliedwhen a high level of risk is implied. Glycemia control in type I diabetes is onesuch example: a variety of RL agents have been shown to accurately regulateinsulin delivery and yet no real life application can be seen. For such applications,managing risk is the key. In this paper, we use the evolution strategies algorithmto train a policy network for glycemia control: it has state-of-the-arts results,and recovers, without any a priori knowledge, the basics of insulin therapy andblood sugar management. We propose a way to equip the policy network withan epistemic uncertainty measure which requires no further model training. Weillustrate how this epistemic uncertainty estimate can be used to improve the safetyof the device, paving the way for real life clinical trials. Maxime Louis · Hector Romero Ugalde · Pierre Gauthier · Alice Adenis · Yousra Tourki · Erik Huneker 🔗 - Power Grid Congestion Management via Topology Optimization with AlphaZero (Poster) []  The energy sector is facing rapid changes in the transition towards clean renewable sources. However, the growing share of volatile, fluctuating renewable generation such as wind or solar energy has already led to an increase in power grid congestion and network security concerns. Grid operators mitigate these by modifying either generation or demand (redispatching, curtailment, flexible loads). Unfortunately, redispatching of fossil generators leads to excessive grid operation costs and higher emissions, which is in direct opposition to the decarbonization of the energy sector. In this paper, we propose an AlphaZero-based grid topology optimization agent as a non-costly, carbon-free congestion management alternative. Our experimental evaluation confirms the potential of topology optimization for power grid operation, achieves a reduction of the average amount of required redispatching by 60\% and shows the interoperability with traditional congestion management methods. Based on our findings, we identify and discuss open research problems as well as technical challenges for a productive system on a real power grid. Matthias Dorfer · Anton R. Fuxjaeger · Kristián Kozák · Patrick Blies · Marcel Wasserer 🔗 - Power Grid Congestion Management via Topology Optimization with AlphaZero (Spotlight)    The energy sector is facing rapid changes in the transition towards clean renewable sources. However, the growing share of volatile, fluctuating renewable generation such as wind or solar energy has already led to an increase in power grid congestion and network security concerns. Grid operators mitigate these by modifying either generation or demand (redispatching, curtailment, flexible loads). Unfortunately, redispatching of fossil generators leads to excessive grid operation costs and higher emissions, which is in direct opposition to the decarbonization of the energy sector. In this paper, we propose an AlphaZero-based grid topology optimization agent as a non-costly, carbon-free congestion management alternative. Our experimental evaluation confirms the potential of topology optimization for power grid operation, achieves a reduction of the average amount of required redispatching by 60\% and shows the interoperability with traditional congestion management methods. Based on our findings, we identify and discuss open research problems as well as technical challenges for a productive system on a real power grid. Matthias Dorfer · Anton R. Fuxjaeger · Kristián Kozák · Patrick Blies · Marcel Wasserer 🔗 - Multi-Agent Reinforcement Learning with Shared Resources for Inventory Management (Poster) []     In this paper, we consider the inventory management~(IM) problem where we need to make replenishment decisions for a large number of stock keeping units (SKUs) to balance their supply and demand. In our setting, the constraint on the shared resources (such as the inventory capacity) couples the otherwise independent control for each SKU. We formulate the problem with this structure as Shared-Resource Stochastic Game (SRSG) and propose an efficient algorithm called Context-aware Decentralized PPO (CD-PPO). Through extensive experiments, we demonstrate that CD-PPO can accelerate the learning procedure compared with standard MARL algorithms. Yuandong Ding · Mingxiao Feng · Guozi Liu · Wei Jiang · Chuheng Zhang · Li Zhao · Lei Song · Houqiang Li · Yan Jin · Jiang Bian 🔗 - Multi-Agent Reinforcement Learning with Shared Resources for Inventory Management (Spotlight)    In this paper, we consider the inventory management~(IM) problem where we need to make replenishment decisions for a large number of stock keeping units (SKUs) to balance their supply and demand. In our setting, the constraint on the shared resources (such as the inventory capacity) couples the otherwise independent control for each SKU. We formulate the problem with this structure as Shared-Resource Stochastic Game (SRSG) and propose an efficient algorithm called Context-aware Decentralized PPO (CD-PPO). Through extensive experiments, we demonstrate that CD-PPO can accelerate the learning procedure compared with standard MARL algorithms. Yuandong Ding · Mingxiao Feng · Guozi Liu · Wei Jiang · Chuheng Zhang · Li Zhao · Lei Song · Houqiang Li · Yan Jin · Jiang Bian 🔗 - LibSignal: An Open Library for Traffic Signal Control (Poster) []     This paper introduces a library for cross-simulator comparison of reinforcement learning models in traffic signal control tasks. This library is developed to implement recent state-of-the-art reinforcement learning models with extensible interfaces and unified cross-simulator evaluation metrics. It supports commonly-used simulators in traffic signal control tasks, including Simulation of Urban Mobility(SUMO) and CityFlow, and multiple benchmark datasets for fair comparisons. We conducted experiments to validate our implementation of the models and to calibrate the simulators so that the experiments from one simulator could be referential to the other. Based on the validated models and calibrated environments, this paper compares and reports the performance of current state-of-the-art RL algorithms across different datasets and simulators. This is the first time that these methods have been compared fairly under the same datasets with different simulators. Hao Mei · Xiaoliang Lei · Longchao Da · Bin Shi · Hua Wei 🔗 - LibSignal: An Open Library for Traffic Signal Control (Spotlight) This paper introduces a library for cross-simulator comparison of reinforcement learning models in traffic signal control tasks. This library is developed to implement recent state-of-the-art reinforcement learning models with extensible interfaces and unified cross-simulator evaluation metrics. It supports commonly-used simulators in traffic signal control tasks, including Simulation of Urban Mobility(SUMO) and CityFlow, and multiple benchmark datasets for fair comparisons. We conducted experiments to validate our implementation of the models and to calibrate the simulators so that the experiments from one simulator could be referential to the other. Based on the validated models and calibrated environments, this paper compares and reports the performance of current state-of-the-art RL algorithms across different datasets and simulators. This is the first time that these methods have been compared fairly under the same datasets with different simulators. Hao Mei · Xiaoliang Lei · Longchao Da · Bin Shi · Hua Wei 🔗 - Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs (Poster) []     Cloud datacenters are exponentially growing both in numbers and size. This increase results in a network activity surge that warrants better congestion avoidance. The resulting challenge is two-fold: (i) designing algorithms that can be custom-tuned to the complex traffic patterns of a given datacenter; but, at the same time (ii) run on low-level hardware with the required low latency of effective Congestion Control (CC). In this work, we present a Reinforcement Learning (RL) based CC solution that learns from certain traffic scenarios and successfully generalizes to others. We then distill the RL neural network policy into binary decision trees to achieve the desired $\mu$sec decision latency required for real-time inference. We deploy the distilled policy on NVIDIA NICs in a real network and demonstrate state-of-the-art performance, balancing all tested metrics simultaneously: bandwidth, latency, fairness, and drops. Benjamin Fuhrer · Yuval Shpigelman · Chen Tessler · Shie Mannor · Gal Chechik · Eitan Zahavi · Gal Dalal 🔗 - Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs (Spotlight) Cloud datacenters are exponentially growing both in numbers and size. This increase results in a network activity surge that warrants better congestion avoidance. The resulting challenge is two-fold: (i) designing algorithms that can be custom-tuned to the complex traffic patterns of a given datacenter; but, at the same time (ii) run on low-level hardware with the required low latency of effective Congestion Control (CC). In this work, we present a Reinforcement Learning (RL) based CC solution that learns from certain traffic scenarios and successfully generalizes to others. We then distill the RL neural network policy into binary decision trees to achieve the desired $\mu$sec decision latency required for real-time inference. We deploy the distilled policy on NVIDIA NICs in a real network and demonstrate state-of-the-art performance, balancing all tested metrics simultaneously: bandwidth, latency, fairness, and drops. Benjamin Fuhrer · Yuval Shpigelman · Chen Tessler · Shie Mannor · Gal Chechik · Eitan Zahavi · Gal Dalal 🔗 - Provably Efficient Reinforcement Learning for Online Adaptive Influence Maximization (Poster) Online influence maximization aims to maximize the influence spread of a content in a social network with unknown network model by selecting a few seed nodes. Recent studies followed a non-adaptive setting, where the seed nodes are selected before the start of the diffusion process and network parameters are updated when the diffusion stops. We consider an adaptive version of content-dependent online influence maximization problem where the seed nodes are sequentially activated based on real-time feedback. In this paper, we formulate the problem as an infinite-horizon discounted MDP under a linear diffusion process and present a model-based reinforcement learning solution. Our algorithm maintains a network model estimate and selects seed users adaptively, exploring the social network while improving the optimal policy optimistically. We establish $\widetilde \gO(\sqrt{T})$ regret bound for our algorithm. Empirical evaluations on synthetic and real-world networks demonstrate the efficiency of our algorithm. Kaixuan Huang · Yu Wu · Xuezhou Zhang · Shenyinying Tu · Qingyun Wu · Mengdi Wang · Huazheng Wang 🔗 - Provably Efficient Reinforcement Learning for Online Adaptive Influence Maximization (Spotlight)    Online influence maximization aims to maximize the influence spread of a content in a social network with unknown network model by selecting a few seed nodes. Recent studies followed a non-adaptive setting, where the seed nodes are selected before the start of the diffusion process and network parameters are updated when the diffusion stops. We consider an adaptive version of content-dependent online influence maximization problem where the seed nodes are sequentially activated based on real-time feedback. In this paper, we formulate the problem as an infinite-horizon discounted MDP under a linear diffusion process and present a model-based reinforcement learning solution. Our algorithm maintains a network model estimate and selects seed users adaptively, exploring the social network while improving the optimal policy optimistically. We establish $\widetilde \gO(\sqrt{T})$ regret bound for our algorithm. Empirical evaluations on synthetic and real-world networks demonstrate the efficiency of our algorithm. Kaixuan Huang · Yu Wu · Xuezhou Zhang · Shenyinying Tu · Qingyun Wu · Mengdi Wang · Huazheng Wang 🔗 - Pareto-Optimal Diagnostic Policy Learning in Clinical Applications via Semi-Model-Based Deep Reinforcement Learning (Poster)    Dynamic diagnosis is desirable when medical tests are costly or time-consuming. In this work, we use reinforcement learning (RL) to find a dynamic policy that selects labtest panels sequentially based on previous observations, ensuring accurate testing at a low cost. Clinical diagnostic data are often highly imbalanced; therefore, we aim to maximize the F1 score directly instead of the error rate. However, the F1 score cannot be written as a cumulative sum of rewards, which invalidates standard RL methods. To remedy this issue, we develop a reward-shaping approach, leveraging properties of the F1 score and duality of policy optimization, to provably find the set of all Pareto-optimal policies for budget-constrained F1 score maximization. To handle the combinatorially complex state space, we propose a Semi-Model-based Deep Diagnosis Policy Optimization (SM-DDPO) framework that is compatible with end-to-end training and online learning. SM-DDPO is tested on clinical tasks: ferritin prediction, sepsis prevention, and acute kidney injury diagnosis. Experiments with real-world data validate that SM-DDPO trains efficiently and identify all Pareto-front solutions. Across all three tasks, SM-DDPO is able to achieve state-of-the-art diagnosis accuracy (in some cases higher than conventional methods) with up to 85% reduction in testing cost. zheng Yu · Yikuan Li · Joseph Kim · Kaixuan Huang · Yuan Luo · Mengdi Wang 🔗 - Pareto-Optimal Diagnostic Policy Learning in Clinical Applications via Semi-Model-Based Deep Reinforcement Learning (Spotlight) Dynamic diagnosis is desirable when medical tests are costly or time-consuming. In this work, we use reinforcement learning (RL) to find a dynamic policy that selects labtest panels sequentially based on previous observations, ensuring accurate testing at a low cost. Clinical diagnostic data are often highly imbalanced; therefore, we aim to maximize the F1 score directly instead of the error rate. However, the F1 score cannot be written as a cumulative sum of rewards, which invalidates standard RL methods. To remedy this issue, we develop a reward-shaping approach, leveraging properties of the F1 score and duality of policy optimization, to provably find the set of all Pareto-optimal policies for budget-constrained F1 score maximization. To handle the combinatorially complex state space, we propose a Semi-Model-based Deep Diagnosis Policy Optimization (SM-DDPO) framework that is compatible with end-to-end training and online learning. SM-DDPO is tested on clinical tasks: ferritin prediction, sepsis prevention, and acute kidney injury diagnosis. Experiments with real-world data validate that SM-DDPO trains efficiently and identify all Pareto-front solutions. Across all three tasks, SM-DDPO is able to achieve state-of-the-art diagnosis accuracy (in some cases higher than conventional methods) with up to 85% reduction in testing cost. zheng Yu · Yikuan Li · Joseph Kim · Kaixuan Huang · Yuan Luo · Mengdi Wang 🔗 - tinyMAN: Lightweight Energy Manager using Reinforcement Learning for Energy Harvesting Wearable IoT Devices (Poster) []     Advances in low-power electronics and machine learning techniques lead to many novel wearable IoT devices. These devices have limited battery capacity and computational power.Thus, energy harvesting from ambient sources is a promising solution to power these low-energy wearable devices. They need to manage the harvested energy optimally to achieve energy-neutral operation, which eliminates recharging requirements. Optimal energy management is a challenging task due to the dynamic nature of the harvested energy and the battery energy constraints of the target device. To address this challenge, we present a reinforcement learning based energy management framework, tinyMAN, for resource-constrained wearable IoT devices. The framework maximizes the utilization of the target device under dynamic energy harvesting patterns and battery constraints. Moreover, tinyMAN does not rely on forecasts of the harvested energy which makes it a prediction-free approach. We deployed tinyMAN on a wearable device prototype using TensorFlow Lite for Micro thanks to its small memory footprint of less than 100 KB. Our evaluations show that tinyMAN achieves less than 2.36 ms and 27.75 uJ while maintaining up to 45% higher utility compared to prior approaches. Toygun Basaklar · Yigit Tuncel · Umit Ogras 🔗 - tinyMAN: Lightweight Energy Manager using Reinforcement Learning for Energy Harvesting Wearable IoT Devices (Spotlight) Advances in low-power electronics and machine learning techniques lead to many novel wearable IoT devices. These devices have limited battery capacity and computational power.Thus, energy harvesting from ambient sources is a promising solution to power these low-energy wearable devices. They need to manage the harvested energy optimally to achieve energy-neutral operation, which eliminates recharging requirements. Optimal energy management is a challenging task due to the dynamic nature of the harvested energy and the battery energy constraints of the target device. To address this challenge, we present a reinforcement learning based energy management framework, tinyMAN, for resource-constrained wearable IoT devices. The framework maximizes the utilization of the target device under dynamic energy harvesting patterns and battery constraints. Moreover, tinyMAN does not rely on forecasts of the harvested energy which makes it a prediction-free approach. We deployed tinyMAN on a wearable device prototype using TensorFlow Lite for Micro thanks to its small memory footprint of less than 100 KB. Our evaluations show that tinyMAN achieves less than 2.36 ms and 27.75 uJ while maintaining up to 45% higher utility compared to prior approaches. Toygun Basaklar · Yigit Tuncel · Umit Ogras 🔗 - Optimizing Audio Recommendations for the Long-Term (Poster) []     We study the problem of optimizing recommender systems for outcomes that realize over several weeks or months. Successfully addressing this problem requires overcoming difficult statistical and organizational challenges. We begin by drawing on reinforcement learning to formulate a comprehensive model of users' recurring relationship with a recommender system. We then identify a few key assumptions that lead to simple, testable recommender system prototypes that explicitly optimize for the long-term. We apply our approach to a podcast recommender system at a large online audio streaming service, and we demonstrate that purposefully optimizing for long-term outcomes can lead to substantial performance gains over approaches optimizing for short-term proxies. Lucas Maystre · Daniel Russo · Yu Zhao 🔗 - Optimizing Audio Recommendations for the Long-Term (Spotlight) We study the problem of optimizing recommender systems for outcomes that realize over several weeks or months. Successfully addressing this problem requires overcoming difficult statistical and organizational challenges. We begin by drawing on reinforcement learning to formulate a comprehensive model of users' recurring relationship with a recommender system. We then identify a few key assumptions that lead to simple, testable recommender system prototypes that explicitly optimize for the long-term. We apply our approach to a podcast recommender system at a large online audio streaming service, and we demonstrate that purposefully optimizing for long-term outcomes can lead to substantial performance gains over approaches optimizing for short-term proxies. Lucas Maystre · Daniel Russo · Yu Zhao 🔗 - Controlling Commercial Cooling Systems Using Reinforcement Learning (Poster) []  This paper is a technical overview of our recent work on reinforcement learning for controlling commercial cooling systems. Building on previous work on cooling data centers more efficiently, we recently conducted two live experiments in partnership with a building management system provider. These live experiments had a variety of challenges in areas such as evaluation, learning from offline data, and constraint satisfaction. Our paper describes these challenges in the hope that awareness of them will benefit future applied RL work. We also describe the way we adapted our RL system to deal with these challenges, resulting in energy savings of approximately 9% and 13% respectively at the two live experiment sites. Jerry Luo · Cosmin Paduraru · Octavian Voicu · Yuri Chervonyi · Scott Munns · Jerry Li · Crystal Qian · Praneet Dutta · Daniel Mankowitz · Jared Quincy Davis · Ningjia Wu · Xingwei Yang · Chu-Ming Chang · Ted Li · Rob Rose · Mingyan Fan · Hootan Nakhost · Tinglin Liu · Deeni Fatiha · Neil Satra · Juliet Rothenberg · Molly Carlin · Satish Tallapaka · Sims Witherspoon · David Parish · Peter Dolan · Chenyu Zhao 🔗 - Controlling Commercial Cooling Systems Using Reinforcement Learning (Spotlight)    This paper is a technical overview of our recent work on reinforcement learning for controlling commercial cooling systems. Building on previous work on cooling data centers more efficiently, we recently conducted two live experiments in partnership with a building management system provider. These live experiments had a variety of challenges in areas such as evaluation, learning from offline data, and constraint satisfaction. Our paper describes these challenges in the hope that awareness of them will benefit future applied RL work. We also describe the way we adapted our RL system to deal with these challenges, resulting in energy savings of approximately 9% and 13% respectively at the two live experiment sites. Jerry Luo · Cosmin Paduraru · Octavian Voicu · Yuri Chervonyi · Scott Munns · Jerry Li · Crystal Qian · Praneet Dutta · Daniel Mankowitz · Jared Quincy Davis · Ningjia Wu · Xingwei Yang · Chu-Ming Chang · Ted Li · Rob Rose · Mingyan Fan · Hootan Nakhost · Tinglin Liu · Deeni Fatiha · Neil Satra · Juliet Rothenberg · Molly Carlin · Satish Tallapaka · Sims Witherspoon · David Parish · Peter Dolan · Chenyu Zhao 🔗 - Multi-Agent Reinforcement Learning for Fast-Timescale Demand Response (Poster) []     To integrate high amounts of renewable energy resources, power grids must be able to cope with high amplitude, fast timescale variations in power generation. Frequency regulation through demand response has the potential to coordinate temporally flexible loads, such as air conditioners, to counteract these variations. Existing approaches for discrete control with dynamic constraints struggle to provide satisfactory performance for fast timescale action selection with hundreds of agents. We propose a decentralized agent trained by multi-agent proximal policy optimization with localized communication. We show that the resulting policy leads to good and robust performance for frequency regulation and scales seamlessly to arbitrary numbers of houses for constant processing times, where classical methods fail. Vincent Mai · Philippe Maisonneuve · Tianyu Zhang · Jorge Montalvo Arvizu · Liam Paull · Antoine Lesage-Landry 🔗 - Multi-Agent Reinforcement Learning for Fast-Timescale Demand Response (Spotlight) To integrate high amounts of renewable energy resources, power grids must be able to cope with high amplitude, fast timescale variations in power generation. Frequency regulation through demand response has the potential to coordinate temporally flexible loads, such as air conditioners, to counteract these variations. Existing approaches for discrete control with dynamic constraints struggle to provide satisfactory performance for fast timescale action selection with hundreds of agents. We propose a decentralized agent trained by multi-agent proximal policy optimization with localized communication. We show that the resulting policy leads to good and robust performance for frequency regulation and scales seamlessly to arbitrary numbers of houses for constant processing times, where classical methods fail. Vincent Mai · Philippe Maisonneuve · Tianyu Zhang · Jorge Montalvo Arvizu · Liam Paull · Antoine Lesage-Landry 🔗 - Identifying Disparities in Sepsis Treatment by Learning the Expert Policy (Poster)    Sepsis is a life-threatening condition defined by end-organ dysfunction due to a dysregulated host response to infection. Sepsis has been the focus of intense research in the field of machine learning with the primary aim being the ability to predict the onset of disease and to identify the optimal treatment policies for this complex condition. Here, we apply a number of reinforcement learning techniques including behavioral cloning, imitation learning, and inverse reinforcement learning, to learn the optimal policy in the management of septic patients using expert demonstrations. Then we estimate the counterfactual optimal policies by applying the model to another subset of unseen medical populations and identify the difference in cure by comparing it to the real policy. Our data comes from the sepsis cohort of MIMIC-IV and the clinical data warehouses of the Mass General Brigham healthcare system. The ultimate objective of this work is to use the optimal reward function to estimate the counterfactual treatment policy and identify deviations across sub-populations of interest. We hope this approach would help us identify any disparities in care and also changes in cure in response to the publication of national sepsis treatment guidelines. Hyewon Jeong · Siddharth Nayak · Taylor Killian · Sanjat Kanjilal · Marzyeh Ghassemi 🔗 - Identifying Disparities in Sepsis Treatment by Learning the Expert Policy (Spotlight) Sepsis is a life-threatening condition defined by end-organ dysfunction due to a dysregulated host response to infection. Sepsis has been the focus of intense research in the field of machine learning with the primary aim being the ability to predict the onset of disease and to identify the optimal treatment policies for this complex condition. Here, we apply a number of reinforcement learning techniques including behavioral cloning, imitation learning, and inverse reinforcement learning, to learn the optimal policy in the management of septic patients using expert demonstrations. Then we estimate the counterfactual optimal policies by applying the model to another subset of unseen medical populations and identify the difference in cure by comparing it to the real policy. Our data comes from the sepsis cohort of MIMIC-IV and the clinical data warehouses of the Mass General Brigham healthcare system. The ultimate objective of this work is to use the optimal reward function to estimate the counterfactual treatment policy and identify deviations across sub-populations of interest. We hope this approach would help us identify any disparities in care and also changes in cure in response to the publication of national sepsis treatment guidelines. Hyewon Jeong · Siddharth Nayak · Taylor Killian · Sanjat Kanjilal · Marzyeh Ghassemi 🔗 - Bandits for Online Calibration: An Application to Content Moderation on Social Media Platforms (Poster) We describe the current content moderation strategy employed by Meta to remove policy-violating content from its platforms. Meta relies on both handcrafted and learned risk models to flag potentially violating content for human review. Our approach aggregates these risk models into a single ranking score, calibrating them to prioritize more reliable risk models. A key challenge is that violation trends change over time, affecting which risk models are most reliable. Our system additionally handles production challenges such as changing risk models and novel risk models. We use a contextual bandit to update the calibration in response to such trends. Our approach increases Meta's top-line metric for measuring the effectiveness of its content moderation strategy by 13%. Vashist Avadhanula · Omar Abdul Baki · Hamsa Bastani · Osbert Bastani · Caner Gocmen · Daniel Haimovich · Darren Hwang · Dmytro Karamshuk · Thomas Leeper · Jiayuan Ma · Gregory macnamara · Jake Mullet · Christopher Palow · Sung Park · Varun S Rajagopal · Kevin Schaeffer · Parikshit Shah · Deeksha Sinha · Nicolas Stier-Moses · Ben Xu 🔗 - Bandits for Online Calibration: An Application to Content Moderation on Social Media Platforms (Spotlight)    We describe the current content moderation strategy employed by Meta to remove policy-violating content from its platforms. Meta relies on both handcrafted and learned risk models to flag potentially violating content for human review. Our approach aggregates these risk models into a single ranking score, calibrating them to prioritize more reliable risk models. A key challenge is that violation trends change over time, affecting which risk models are most reliable. Our system additionally handles production challenges such as changing risk models and novel risk models. We use a contextual bandit to update the calibration in response to such trends. Our approach increases Meta's top-line metric for measuring the effectiveness of its content moderation strategy by 13%. Vashist Avadhanula · Omar Abdul Baki · Hamsa Bastani · Osbert Bastani · Caner Gocmen · Daniel Haimovich · Darren Hwang · Dmytro Karamshuk · Thomas Leeper · Jiayuan Ma · Gregory macnamara · Jake Mullet · Christopher Palow · Sung Park · Varun S Rajagopal · Kevin Schaeffer · Parikshit Shah · Deeksha Sinha · Nicolas Stier-Moses · Ben Xu 🔗 - Beyond CAGE: Investigating Generalization of Learned Autonomous Network Defense Policies (Poster) Advancements in reinforcement learning (RL) have inspired new directions in intelligent automation of network defense. However, many of these advancements have either outpaced their application to network security or have not considered the challenges associated with implementing them in the real-world. To understand these problems, this work evaluates several RL approaches implemented in the CAGE Challenge 2, a public competition to build an autonomous network defender agent in a high-fidelity network simulator. Our approaches all build on the Proximal Policy Optimization (PPO) family of algorithms, and include hierarchical RL, action masking, custom training, and ensemble RL. We find that the ensemble RL technique performs strongest, outperforming our other models and taking second place in the competition. To understand applicability to real environments we evaluate each method's ability to generalize to unseen networks and against an unknown attack strategy. In unseen environments, all of our approaches perform worse, with degradation varied based on the type of environmental change. Against an unknown attacker strategy, we found that our models had reduced overall performance even though the new strategy was in fact less efficient than the ones our models trained on. Taken together, these results highlight promising research directions towards autonomous network defense in the real world. Melody Wolk · Andy Applebaum · Camron Dennler · Patrick Dwyer · Marina Moskowitz · Harold Nguyen · Nicole Nichols · Nicole Park · Paul Rachwalski · Frank Rau · Adrian Webster 🔗 - Beyond CAGE: Investigating Generalization of Learned Autonomous Network Defense Policies (Spotlight)    Advancements in reinforcement learning (RL) have inspired new directions in intelligent automation of network defense. However, many of these advancements have either outpaced their application to network security or have not considered the challenges associated with implementing them in the real-world. To understand these problems, this work evaluates several RL approaches implemented in the CAGE Challenge 2, a public competition to build an autonomous network defender agent in a high-fidelity network simulator. Our approaches all build on the Proximal Policy Optimization (PPO) family of algorithms, and include hierarchical RL, action masking, custom training, and ensemble RL. We find that the ensemble RL technique performs strongest, outperforming our other models and taking second place in the competition. To understand applicability to real environments we evaluate each method's ability to generalize to unseen networks and against an unknown attack strategy. In unseen environments, all of our approaches perform worse, with degradation varied based on the type of environmental change. Against an unknown attacker strategy, we found that our models had reduced overall performance even though the new strategy was in fact less efficient than the ones our models trained on. Taken together, these results highlight promising research directions towards autonomous network defense in the real world. Melody Wolk · Andy Applebaum · Camron Dennler · Patrick Dwyer · Marina Moskowitz · Harold Nguyen · Nicole Nichols · Nicole Park · Paul Rachwalski · Frank Rau · Adrian Webster 🔗 - Optimizing Industrial HVAC Systems with Hierarchical Reinforcement Learning (Poster)    Reinforcement learning (RL) techniques have been developed to optimize industrial cooling systems, offering substantial energy savings compared to traditional heuristic policies. A major challenge in industrial control involves learning behaviors that are feasible in the real world due to machinery constraints. For example, certain actions can only be executed every few hours while other actions can be taken more frequently. Without extensive reward engineering and experimentation, an RL agent may not learn realistic operation of machinery. To address this, we use hierarchical reinforcement learning with multiple agents that control subsets of actions according to their operation time scales. Our hierarchical approach achieves energy savings over existing baselines while maintaining constraints such as operating chillers within safe bounds in a simulated HVAC control environment. William Wong · Praneet Dutta · Octavian Voicu · Yuri Chervonyi · Cosmin Paduraru · Jerry Luo 🔗 - Optimizing Industrial HVAC Systems with Hierarchical Reinforcement Learning (Spotlight) Reinforcement learning (RL) techniques have been developed to optimize industrial cooling systems, offering substantial energy savings compared to traditional heuristic policies. A major challenge in industrial control involves learning behaviors that are feasible in the real world due to machinery constraints. For example, certain actions can only be executed every few hours while other actions can be taken more frequently. Without extensive reward engineering and experimentation, an RL agent may not learn realistic operation of machinery. To address this, we use hierarchical reinforcement learning with multiple agents that control subsets of actions according to their operation time scales. Our hierarchical approach achieves energy savings over existing baselines while maintaining constraints such as operating chillers within safe bounds in a simulated HVAC control environment. William Wong · Praneet Dutta · Octavian Voicu · Yuri Chervonyi · Cosmin Paduraru · Jerry Luo 🔗 - Reinforcement Learning Approaches for Traffic Signal Control under Missing Data (Poster) []     Traffic signal control is critical in improving transportation efficiency and alleviating traffic congestion. In recent years, the emergence of deep reinforcement learning (RL) methods in traffic signal control tasks has achieved better performance than conventional rule-based approaches. Most RL approaches require the observation of the environment for the agent to decide which action is optimal for a long-term reward. However, in real-world urban scenarios, missing observation of traffic states may frequently occur due to the lack of sensors, which makes existing RL methods inapplicable on road networks with missing observation. In this work, we aim to control the traffic signals under the real-world setting, where some of the intersections in the road network are not installed with sensors and thus with no direct observations around them. Specifically, we propose and investigate two types of approaches: the first approach imputes the traffic states to enable adaptive control, while the second approach imputes both states and rewards to enable not only adaptive control but also the training of RL agents as well. Through extensive experiments on both synthetic and real-world road network traffic, we reveal that imputation can help the application of RL methods on intersections without observations, while the position of intersections without observation can largely influence the performance of RL agents. Hao Mei · Junxian Li · Bin Shi · Hua Wei 🔗 - Reinforcement Learning Approaches for Traffic Signal Control under Missing Data (Spotlight)    Traffic signal control is critical in improving transportation efficiency and alleviating traffic congestion. In recent years, the emergence of deep reinforcement learning (RL) methods in traffic signal control tasks has achieved better performance than conventional rule-based approaches. Most RL approaches require the observation of the environment for the agent to decide which action is optimal for a long-term reward. However, in real-world urban scenarios, missing observation of traffic states may frequently occur due to the lack of sensors, which makes existing RL methods inapplicable on road networks with missing observation. In this work, we aim to control the traffic signals under the real-world setting, where some of the intersections in the road network are not installed with sensors and thus with no direct observations around them. Specifically, we propose and investigate two types of approaches: the first approach imputes the traffic states to enable adaptive control, while the second approach imputes both states and rewards to enable not only adaptive control but also the training of RL agents as well. Through extensive experiments on both synthetic and real-world road network traffic, we reveal that imputation can help the application of RL methods on intersections without observations, while the position of intersections without observation can largely influence the performance of RL agents. Hao Mei · Junxian Li · Bin Shi · Hua Wei 🔗 - Reinforcement Learning-Based Air Traffic Deconfliction (Poster) []  Remain Well Clear, keeping the aircraft away from hazards by the appropriateseparation distance, is an essential technology for the safe operation of uncrewedaerial vehicles in congested airspace. This work focuses on automating the horizontal separation of two aircraft and presents the obstacle avoidance problem as a2D surrogate optimization task. By our design, the surrogate task is made moreconservative to guarantee the execution of the solution in the primary domain.Using Reinforcement Learning (RL), we optimize the avoidance policy and modelthe dynamics, interactions, and decision-making. By recursively sampling theresulting policy and the surrogate transitions, the system translates the avoidancepolicy into a complete avoidance trajectory. Then, the solver publishes the trajectoryas a set of waypoints for the airplane to follow using the Robot Operating System(ROS) interface.The proposed system generates a quick and achievable avoidance trajectory thatsatisfies the safety requirements. Evaluation of our system is completed in a high-fidelity simulation and full-scale airplane demonstration. Moreover, the paperconcludes an enormous integration effort that has enabled a real-life demonstrationof the RL-based system. Denis Osipychev · Dragos Margineantu 🔗 - Reinforcement Learning-Based Air Traffic Deconfliction (Spotlight)    Remain Well Clear, keeping the aircraft away from hazards by the appropriateseparation distance, is an essential technology for the safe operation of uncrewedaerial vehicles in congested airspace. This work focuses on automating the horizontal separation of two aircraft and presents the obstacle avoidance problem as a2D surrogate optimization task. By our design, the surrogate task is made moreconservative to guarantee the execution of the solution in the primary domain.Using Reinforcement Learning (RL), we optimize the avoidance policy and modelthe dynamics, interactions, and decision-making. By recursively sampling theresulting policy and the surrogate transitions, the system translates the avoidancepolicy into a complete avoidance trajectory. Then, the solver publishes the trajectoryas a set of waypoints for the airplane to follow using the Robot Operating System(ROS) interface.The proposed system generates a quick and achievable avoidance trajectory thatsatisfies the safety requirements. Evaluation of our system is completed in a high-fidelity simulation and full-scale airplane demonstration. Moreover, the paperconcludes an enormous integration effort that has enabled a real-life demonstrationof the RL-based system. Denis Osipychev · Dragos Margineantu 🔗 - Automatic Evaluation of Excavator Operators using Learned Reward Functions (Poster)    Training novice users to operate an excavator for learning different skills requires the presence of expert teachers. Considering the complexity of the problem, it is comparatively expensive to find skilled experts as the process is time consuming and requires precise focus. Moreover, since humans tend to be biased, the evaluation process is noisy and will lead to high variance in the final score of different operators with similar skills. In this work, we address these issues and propose a novel strategy for automatic evaluation of excavator operators. We take into account the internal dynamics of the excavator and the safety criterion at every time step to evaluate the performance. To further validate our approach, we use this score prediction model as a source of reward for a reinforcement learning agent to learn the task of maneuvering an excavator in a simulated environment that closely replicates the real-world dynamics. For a policy learned using these external reward prediction models, our results demonstrate safer solutions following the required dynamic constraints when compared to policy trained with goal based reward functions only, making it one step closer to real-life adoption. Pranav Agarwal · Marek Teichmann · Sheldon Andrews · Samira Ebrahimi Kahou 🔗 - Automatic Evaluation of Excavator Operators using Learned Reward Functions (Spotlight) Training novice users to operate an excavator for learning different skills requires the presence of expert teachers. Considering the complexity of the problem, it is comparatively expensive to find skilled experts as the process is time consuming and requires precise focus. Moreover, since humans tend to be biased, the evaluation process is noisy and will lead to high variance in the final score of different operators with similar skills. In this work, we address these issues and propose a novel strategy for automatic evaluation of excavator operators. We take into account the internal dynamics of the excavator and the safety criterion at every time step to evaluate the performance. To further validate our approach, we use this score prediction model as a source of reward for a reinforcement learning agent to learn the task of maneuvering an excavator in a simulated environment that closely replicates the real-world dynamics. For a policy learned using these external reward prediction models, our results demonstrate safer solutions following the required dynamic constraints when compared to policy trained with goal based reward functions only, making it one step closer to real-life adoption. Pranav Agarwal · Marek Teichmann · Sheldon Andrews · Samira Ebrahimi Kahou 🔗 - Function Approximations for Reinforcement Learning Controller for Wave Energy Converters (Poster) []     The industrial Wave Energy Converters (WEC) have evolved into complex multi-generator designs, but a lack of effective control has limited their potential for higher energy capture efficiency. The Multi-Agent Reinforcement Learning (MARL) controller can handle these complexities and support multiple objectives of energy capture efficiency, reduction of structural stress, and proactive protection against high waves. However, even with well-trained agent algorithms like Proximal Policy Optimization (PPO), MARL is limited in performance. In this paper, we explore different function approximations for the policy and critic networks in modeling the sequential nature of the system dynamics and find that they are key to better performance. We investigated the performance of fully connected neural networks (FCN), LSTMs, and the Transformer model variants with varying depths. We propose a novel transformer architecture, Skip Transformer-XL (STrXL), with gated residual connections around the multi-head attention, multi-layer perceptron, and transformer block. Our results suggest that STrXL performed best and beat the state-of-the-art GTrXL with faster training convergence. STrXL boosts energy efficiency by an average of 25% to 28% for the entire wave spectrum over the existing spring damper (SD) controller for waves at different angles. Furthermore, unlike the default SD controller, the transformer controller almost eliminated the mechanical stress from the rotational yaw motion. Soumyendu Sarkar · Vineet Gundecha · Alexander Shmakov · Sahand Ghorbanpour · Ashwin Ramesh Babu · Alexandre Pichard · Mathieu Cocho 🔗 - Function Approximations for Reinforcement Learning Controller for Wave Energy Converters (Spotlight)    The industrial Wave Energy Converters (WEC) have evolved into complex multi-generator designs, but a lack of effective control has limited their potential for higher energy capture efficiency. The Multi-Agent Reinforcement Learning (MARL) controller can handle these complexities and support multiple objectives of energy capture efficiency, reduction of structural stress, and proactive protection against high waves. However, even with well-trained agent algorithms like Proximal Policy Optimization (PPO), MARL is limited in performance. In this paper, we explore different function approximations for the policy and critic networks in modeling the sequential nature of the system dynamics and find that they are key to better performance. We investigated the performance of fully connected neural networks (FCN), LSTMs, and the Transformer model variants with varying depths. We propose a novel transformer architecture, Skip Transformer-XL (STrXL), with gated residual connections around the multi-head attention, multi-layer perceptron, and transformer block. Our results suggest that STrXL performed best and beat the state-of-the-art GTrXL with faster training convergence. STrXL boosts energy efficiency by an average of 25% to 28% for the entire wave spectrum over the existing spring damper (SD) controller for waves at different angles. Furthermore, unlike the default SD controller, the transformer controller almost eliminated the mechanical stress from the rotational yaw motion. Soumyendu Sarkar · Vineet Gundecha · Alexander Shmakov · Sahand Ghorbanpour · Ashwin Ramesh Babu · Alexandre Pichard · Mathieu Cocho 🔗