MELISSA: Multi-level Evaluation with LLM-based Integrated Self-Scrutiny and Auditing
Abstract
As AI systems increasingly conduct complex multi-turn interactions, reliable evaluation becomes critical yet challenging. Current LLM-as-judge approaches suffer from severe biases and struggle with lengthy conversations, while monolithic evaluation misses quality variations across dialogue segments. We present MELISSA, a framework that hierarchically decomposes conversations and learns bias corrections from human judgments, requiring no model fine-tuning. Evaluating 100 AI-conducted technical interviews with expert annotations reveals surprising insights: bias correction alone reduces error by over 50\%, indicating LLMs struggle with scale calibration rather than quality discrimination; properly aligned GPT-4o-mini outperforms unaligned Claude 3.5, enabling order-of-magnitude cost reductions; and optional audit mechanisms show mixed results—while potentially providing confidence signals, they often introduce unnecessary edits that degrade performance when models already produce well-calibrated evaluations, highlighting the importance of empirical validation over intuitive design. These findings demonstrate that simple calibration transforms weak models into reliable judges, while reinforcing that high-quality human data remains essential for automated evaluation systems.