Timezone: »
When making everyday decisions, people are guided by their conscience, an internal sense of right and wrong, to behave morally. By contrast, artificial agents may behave immorally when trained on environments that ignore moral concerns, such as violent video games. With the advent of generally capable agents that pretrain on many environments, mitigating inherited biases towards immoral behavior will become necessary. However, prior work on aligning agents with human values and morals focuses on small-scale settings lacking in semantic complexity. To enable research in larger, more realistic settings, we introduce Jiminy Cricket, an environment suite of 25 text-based adventure games with thousands of semantically rich, morally salient scenarios. Via dense annotations for every possible action, Jiminy Cricket environments robustly evaluate whether agents can act morally while maximizing reward. To improve moral behavior, we leverage language models with commonsense moral knowledge and develop strategies to mediate this knowledge into actions. In extensive experiments, we find that our artificial conscience approach can steer agents towards moral behavior without sacrificing performance.
Author Information
Dan Hendrycks (UC Berkeley)
Mantas Mazeika (University of Illinois Urbana-Champaign)
Andy Zou (UC Berkeley)
Sahil Patel (University of California, Berkeley)
Christine Zhu
Jesus Navarro
Dawn Song (UC Berkeley)
Bo Li (UIUC)
Jacob Steinhardt (UC Berkeley)
More from the Same Authors
-
2021 : CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review »
Dan Hendrycks · Collin Burns · Anya Chen · Spencer Ball -
2021 Spotlight: Learning Equilibria in Matching Markets from Bandit Feedback »
Meena Jagadeesan · Alexander Wei · Yixin Wang · Michael Jordan · Jacob Steinhardt -
2021 : Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models »
Boxin Wang · Chejian Xu · Shuohang Wang · Zhe Gan · Yu Cheng · Jianfeng Gao · Ahmed Awadallah · Bo Li -
2021 : Measuring Coding Challenge Competence With APPS »
Dan Hendrycks · Steven Basart · Saurav Kadavath · Mantas Mazeika · Akul Arora · Ethan Guo · Collin Burns · Samir Puranik · Horace He · Dawn Song · Jacob Steinhardt -
2021 : Certified Robustness for Free in Differentially Private Federated Learning »
Chulin Xie · Yunhui Long · Pin-Yu Chen · Krishnaram Kenthapadi · Bo Li -
2021 : RVFR: Robust Vertical Federated Learning via Feature Subspace Recovery »
Jing Liu · Chulin Xie · Krishnaram Kenthapadi · Sanmi Koyejo · Bo Li -
2021 : PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures »
Dan Hendrycks · Andy Zou · Mantas Mazeika · Leonard Tang · Dawn Song · Jacob Steinhardt -
2021 : Effect of Model Size on Worst-group Generalization »
Alan Pham · Eunice Chan · Vikranth Srivatsa · Dhruba Ghosh · Yaoqing Yang · Yaodong Yu · Ruiqi Zhong · Joseph Gonzalez · Jacob Steinhardt -
2021 : The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models »
Alexander Pan · Kush Bhatia · Jacob Steinhardt -
2021 : Measuring Mathematical Problem Solving With the MATH Dataset »
Dan Hendrycks · Collin Burns · Saurav Kadavath · Akul Arora · Steven Basart · Eric Tang · Dawn Song · Jacob Steinhardt -
2021 : Live panel: Perspectives on ImageNet. »
Dawn Song · Ross Wightman · Dan Hendrycks -
2021 : Using ImageNet to Measure Robustness and Uncertainty »
Dawn Song · Dan Hendrycks -
2021 : Career and Life: Panel Discussion - Bo Li, Adriana Romero-Soriano, Devi Parikh, and Emily Denton »
Emily Denton · Devi Parikh · Bo Li · Adriana Romero -
2021 : Live Q&A with Bo Li »
Bo Li -
2021 : Invited talk – Trustworthy Machine Learning via Logic Inference, Bo Li »
Bo Li -
2021 : Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models »
Boxin Wang · Chejian Xu · Shuohang Wang · Zhe Gan · Yu Cheng · Jianfeng Gao · Ahmed Awadallah · Bo Li -
2021 Poster: Grounding Representation Similarity Through Statistical Testing »
Frances Ding · Jean-Stanislas Denain · Jacob Steinhardt -
2021 Poster: Latent Execution for Neural Program Synthesis Beyond Domain-Specific Languages »
Xinyun Chen · Dawn Song · Yuandong Tian -
2021 Poster: G-PATE: Scalable Differentially Private Data Generator via Private Aggregation of Teacher Discriminators »
Yunhui Long · Boxin Wang · Zhuolin Yang · Bhavya Kailkhura · Aston Zhang · Carl Gunter · Bo Li -
2021 Poster: Anti-Backdoor Learning: Training Clean Models on Poisoned Data »
Yige Li · Xixiang Lyu · Nodens Koren · Lingjuan Lyu · Bo Li · Xingjun Ma -
2021 Poster: Adversarial Attack Generation Empowered by Min-Max Optimization »
Jingkang Wang · Tianyun Zhang · Sijia Liu · Pin-Yu Chen · Jiacen Xu · Makan Fardad · Bo Li -
2021 : Reconnaissance Blind Chess + Q&A »
Ryan Gardner · Gino Perrotta · Corey Lowman · Casey Richardson · Andrew Newman · Jared Markowitz · Nathan Drenkow · Bart Paulhamus · Ashley J Llorens · Todd Neller · Raman Arora · Bo Li · Mykel J Kochenderfer -
2021 : VisDA21: Visual Domain Adaptation + Q&A »
Kate Saenko · Kuniaki Saito · Donghyun Kim · Samarth Mishra · Ben Usman · Piotr Teterwak · Dina Bashkirova · Dan Hendrycks -
2021 Poster: Learning Equilibria in Matching Markets from Bandit Feedback »
Meena Jagadeesan · Alexander Wei · Yixin Wang · Michael Jordan · Jacob Steinhardt -
2021 Poster: TRS: Transferability Reduced Ensemble via Promoting Gradient Diversity and Model Smoothness »
Zhuolin Yang · Linyi Li · Xiaojun Xu · Shiliang Zuo · Qian Chen · Pan Zhou · Benjamin Rubinstein · Ce Zhang · Bo Li -
2021 Poster: Adversarial Examples for k-Nearest Neighbor Classifiers Based on Higher-Order Voronoi Diagrams »
Chawin Sitawarin · Evgenios Kornaropoulos · Dawn Song · David Wagner -
2020 Workshop: Workshop on Dataset Curation and Security »
Nathalie Baracaldo Angel · Yonatan Bisk · Avrim Blum · Michael Curry · John Dickerson · Micah Goldblum · Tom Goldstein · Bo Li · Avi Schwarzschild -
2020 Poster: Synthesize, Execute and Debug: Learning to Repair for Neural Program Synthesis »
Kavi Gupta · Peter Ebert Christensen · Xinyun Chen · Dawn Song -
2020 Poster: Compositional Generalization via Neural-Symbolic Stack Machines »
Xinyun Chen · Chen Liang · Adams Wei Yu · Dawn Song · Denny Zhou -
2020 Poster: Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations »
Huan Zhang · Hongge Chen · Chaowei Xiao · Bo Li · Mingyan Liu · Duane Boning · Cho-Jui Hsieh -
2020 Spotlight: Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations »
Huan Zhang · Hongge Chen · Chaowei Xiao · Bo Li · Mingyan Liu · Duane Boning · Cho-Jui Hsieh -
2020 Poster: On Convergence of Nearest Neighbor Classifiers over Feature Transformations »
Luka Rimanic · Cedric Renggli · Bo Li · Ce Zhang -
2019 : TBD »
Dawn Song -
2019 Poster: Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty »
Dan Hendrycks · Mantas Mazeika · Saurav Kadavath · Dawn Song -
2018 Workshop: Workshop on Security in Machine Learning »
Nicolas Papernot · Jacob Steinhardt · Matt Fredrikson · Kamalika Chaudhuri · Florian Tramer -
2018 Poster: Using Trusted Data to Train Deep Networks on Labels Corrupted by Severe Noise »
Dan Hendrycks · Mantas Mazeika · Duncan Wilson · Kevin Gimpel -
2018 Poster: Semidefinite relaxations for certifying robustness to adversarial examples »
Aditi Raghunathan · Jacob Steinhardt · Percy Liang -
2018 Poster: Tree-to-tree Neural Networks for Program Translation »
Xinyun Chen · Chang Liu · Dawn Song -
2017 Workshop: Aligned Artificial Intelligence »
Dylan Hadfield-Menell · Jacob Steinhardt · David Duvenaud · David Krueger · Anca Dragan -
2017 : Panel »
Garth Gibson · Joseph Gonzalez · John Langford · Dawn Song -
2017 Workshop: Machine Learning and Computer Security »
Jacob Steinhardt · Nicolas Papernot · Bo Li · Chang Liu · Percy Liang · Dawn Song -
2017 Poster: Certified Defenses for Data Poisoning Attacks »
Jacob Steinhardt · Pang Wei Koh · Percy Liang -
2016 : Opening Remarks »
Jacob Steinhardt -
2016 Workshop: Reliable Machine Learning in the Wild »
Dylan Hadfield-Menell · Adrian Weller · David Duvenaud · Jacob Steinhardt · Percy Liang -
2015 Poster: Learning with Relaxed Supervision »
Jacob Steinhardt · Percy Liang -
2009 Poster: Tracking Dynamic Sources of Malicious Activity at Internet Scale »
Shobha Venkataraman · Avrim Blum · Dawn Song · Subhabrata Sen · Oliver Spatscheck -
2009 Spotlight: Tracking Dynamic Sources of Malicious Activity at Internet Scale »
Shobha Venkataraman · Avrim Blum · Dawn Song · Subhabrata Sen · Oliver Spatscheck