Find in Library
Search millions of books, articles, and more
Indexed Open Access Databases
Study on Short Text Classification with Imperfect Labels
oleh: LIANG Haowei, WANG Shi, CAO Cungen
Format: | Article |
---|---|
Diterbitkan: | Editorial office of Computer Science 2023-01-01 |
Deskripsi
Short text classification techniques have been widely studied.When these techniques are applied to domain short text forproduction,as textual data accumulates,people often encounter problems mainly in two aspects:the imperfect labels and mistakenly-labeled training dataset.First,the class label set is generally dynamic in nature.Second,when domain annotators label textual data,it is hard to distinguish some fine-grained class label from others.For the above problems,this paper analyzes the shortcomings of an actual and complex telecom domain label set with numerous classes in depth and proposes a conceptual model for the imperfect multi-classification label system.Based on the conceptual model,for repairing the conflicts and omissions in a labeled dataset,we introduce a semi-automatic method for detecting these problems iteratively with the help of a seed dataset.After repairing the conflicts and omissions caused by a dynamic label set and mistakes of annotators,after about six months of iteration,the F1-score of the BERT-based classification model is above 0.9 after filtering out 10% tickets with low classification confidence.