Study on Short Text Classification with Imperfect Labels

oleh: LIANG Haowei, WANG Shi, CAO Cungen

Format:	Article
Diterbitkan:	Editorial office of Computer Science 2023-01-01

Deskripsi

Short text classification techniques have been widely studied.When these techniques are applied to domain short text forproduction,as textual data accumulates,people often encounter problems mainly in two aspects:the imperfect labels and mistakenly-labeled training dataset.First,the class label set is generally dynamic in nature.Second,when domain annotators label textual data,it is hard to distinguish some fine-grained class label from others.For the above problems,this paper analyzes the shortcomings of an actual and complex telecom domain label set with numerous classes in depth and proposes a conceptual model for the imperfect multi-classification label system.Based on the conceptual model,for repairing the conflicts and omissions in a labeled dataset,we introduce a semi-automatic method for detecting these problems iteratively with the help of a seed dataset.After repairing the conflicts and omissions caused by a dynamic label set and mistakes of annotators,after about six months of iteration,the F1-score of the BERT-based classification model is above 0.9 after filtering out 10% tickets with low classification confidence.

Find in Library

Indexed Open Access Databases

Study on Short Text Classification with Imperfect Labels

Deskripsi