NeurIPS Tutorial Cross-disciplinary insights into alignment in humans and machines

Tutorial

Cross-disciplinary insights into alignment in humans and machines

Gillian Hadfield · Dylan Hadfield-Menell · Joel Leibo · Rakshit Trivedi

[ Abstract ]

2024 Tutorial

Abstract:

Aligning the behavior of AI systems and agents with human goals and values continues to be a major challenge. But the problem is not novel: many disciplines such as economics, political science, legal theory, and cultural evolutionary theory, have grappled for decades if not centuries with the question of how to align the behaviors of individuals with the well-being of other individuals and entire societies. Markets, legal institutions and rules, and political processes are mechanisms on which human societies rely to achieve goals such as well-being, fair treatment, and economic innovation and growth. In this tutorial, we will provide an introduction to these mechanisms: how they work and how they can inform a more robust approach to AI alignment. For example, a key misunderstanding in the current alignment literature is the idea that AI alignment can be achieved by fine-tuning AI agents and systems with a pre-defined set of human preferences; this is the principle underlying reinforcement learning from human feedback (RLHF) for large language models. But regulated market systems take a different approach to alignment: they encourage self-interested firms and individuals to take actions that generate wealth and do not impose excessive costs (externalities) on others and use a variety of mechanisms to shape behavior. They focus on the alignment of the system, not the individual agent per se. In this tutorial we’ll introduce participants to core ideas from economics, law, political science, and cultural evolutionary theory to inform the next generation of thinking in AI safety and alignment.

Chat is not available.