Skip to yearly menu bar Skip to main content


Poster

Enhancing Data Quality via Training Dynamics from Private Domains for Collaborative Fine-Tuning of Large Language Models

Wanru Zhao · Hongxiang Fan · Shell Xu Hu · Wangchunshu Zhou · Nicholas Lane


Abstract:

Recent research has highlighted the importance of data quality in scaling large language models (LLMs). However, automated data quality control faces unique challenges in collaborative settings where data cannot be directly shared between different silos. To tackle this issue, this paper proposes a novel data quality control technique based on training dynamics to enhance the quality of data from different private domains in the collaborative training setting, such as model merging. It first scores each training sample based on tracing gradients of low-rank adapters and filters out the low-quality data locally, and then adaptively merges the model parameters based on the individual training. Furthermore, we develop a quality control evaluation tailored for collaborative settings with heterogeneous medical domain data. Experiments show that training on the high-quality data selected by our method can often outperform other data selection methods across diverse private domain datasets, in both medical and multilingual settings. Our qualitative analysis demonstrates that our method offers a general, efficient, and effective way to identify the data that contributes most significantly to large language models.

Live content is unavailable. Log in and register to view live content