NeurIPS 2024

Workshop

HarmAnalyst: Interpretable, transparent, and steerable LLM safety moderation
Jing-Jing Li · Valentina Pyatkin · Max Kleiman-Weiner · Liwei Jiang · Nouha Dziri · Anne Collins · Jana Schaich Borg · Maarten Sap · Yejin Choi · Sydney Levine

Workshop

Declarative characterizations of direct preference alignment algorithms
Kyle Richardson · Vivek Srikumar · Ashish Sabharwal

Poster

Wed 11:00

The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models
Hannah Rose Kirk · Alexander Whitefield · Paul Rottger · Andrew M. Bean · Katerina Margatina · Rafael Mosquera-Gomez · Juan Ciro · Max Bartolo · Adina Williams · He He · Bertie Vidgen · Scott Hale

Oral

Wed 10:40

The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models
Hannah Rose Kirk · Alexander Whitefield · Paul Rottger · Andrew M. Bean · Katerina Margatina · Rafael Mosquera-Gomez · Juan Ciro · Max Bartolo · Adina Williams · He He · Bertie Vidgen · Scott Hale

Workshop

Fine-Tuning Language Models for Ethical Ambiguity: A Comparative Study of Alignment with Human Responses
Pranav Senthilkumar · Visshwa Balasubramanian · Aneesa Maity · Prisha Jain · Kevin Zhu · Jonathan Lu

Main Navigation

65 Results