Skip to yearly menu bar Skip to main content

Lightning Talk
Workshop: Data Centric AI

Automatic Data Quality Evaluation for Text Classification


Data quality is critical for machine learning, but its evaluation usually relies on the performance of used models. A model-independent data quality evaluation metric is needed. This paper proposes a convenient metric called DQTC to quantify the data quality for text classification based on information theory. And an experiment is conducted to verify the relevance between DQTC and model performance. Finally, we describe the linguistic improvement that should be considered. The code is available online.