Poster
TaskBench: Benchmarking Large Language Models for Task Automation
Yongliang Shen · Kaitao Song · Xu Tan · Wenqi Zhang · Kan Ren · Siyu Yuan · Weiming Lu · Dongsheng Li · Yueting Zhuang
West Ballroom A-D #5403
In recent years, the remarkable progress of large language models (LLMs) has sparked interest in task automation, which involves decomposing complex tasks described by user instructions into sub-tasks and invoking external tools to execute them, playing a central role in autonomous agents. However, there is a lack of systematic and standardized benchmarks to promote the development of LLMs in task automation. To address this, we introduce TaskBench to evaluate the capability of LLMs in task automation. Specifically, task automation can be divided into three critical stages: task decomposition, tool selection, and parameter prediction to fulfill user intent. This complexity makes data collection and evaluation more challenging compared to common NLP tasks. To generate high-quality evaluation datasets, we introduce the concept of Tool Graph to represent the decomposed tasks in user intent, and adopt a back-instruct method to simulate user instruction and annotations. Furthermore, we propose TaskEval to evaluate the capability of LLMs from different aspects, including task decomposition, tool selection, and parameter prediction. Experimental results demonstrate that TaskBench can effectively reflect the capability of LLMs in task automation. Benefiting from a combination of automated data construction and human verification, TaskBench achieves high consistency compared to human evaluation, making it a comprehensive and reliable benchmark for LLM-based autonomous agents. The code and datasets of TaskBench will be made publicly available in the future.
Live content is unavailable. Log in and register to view live content