firstbacksecondback
65 Results
Workshop
|
HarmAnalyst: Interpretable, transparent, and steerable LLM safety moderation Jing-Jing Li · Valentina Pyatkin · Max Kleiman-Weiner · Liwei Jiang · Nouha Dziri · Anne Collins · Jana Schaich Borg · Maarten Sap · Yejin Choi · Sydney Levine |
||
Workshop
|
Declarative characterizations of direct preference alignment algorithms Kyle Richardson · Vivek Srikumar · Ashish Sabharwal |
||
Poster
|
Wed 11:00 |
The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models Hannah Rose Kirk · Alexander Whitefield · Paul Rottger · Andrew M. Bean · Katerina Margatina · Rafael Mosquera-Gomez · Juan Ciro · Max Bartolo · Adina Williams · He He · Bertie Vidgen · Scott Hale |
|
Oral
|
Wed 10:40 |
The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models Hannah Rose Kirk · Alexander Whitefield · Paul Rottger · Andrew M. Bean · Katerina Margatina · Rafael Mosquera-Gomez · Juan Ciro · Max Bartolo · Adina Williams · He He · Bertie Vidgen · Scott Hale |
|
Workshop
|
Fine-Tuning Language Models for Ethical Ambiguity: A Comparative Study of Alignment with Human Responses Pranav Senthilkumar · Visshwa Balasubramanian · Aneesa Maity · Prisha Jain · Kevin Zhu · Jonathan Lu |